User:JHernandez (WMF)/How to migrate from g00gle docs to wikitext

Why is the title weird? Because you can't create pages with google on the title here. Now...

How do you covert a google drive doc (old google docs) into a wiki page? Here is a wayː

Parsing edit

  1. File -> Download as -> HTML zipped
  2. Unzip into a single HTML file
  3. Convert it to commonmark and then to mediawiki with pandocː
    1. pandoc -f html -t commonmark document.html > document.md
    2. pandoc -f commonmark -t mediawiki document.md > document.wiki
    3. Or use http://pandoc.org/try/ if your document is small enough
  4. Proceed to clean up the wikitext, see next section

Cleanup edit

  • Check your headers
    • If the top level headers start with one =, add one heading level to all headers (==Hello== becomes ===hello===) to match MediaWiki headers level
    • Make sure you don't have any headers with only one =, as that is the page title in the MediaWiki page
  • Remove empty <p></p>
  • If you don't want comments around
    • Find #cmref and remove the comments markup and text
    • Find #cmnt and remove the links to the comments content
  • Fix external URLs
    • Find https://www.google.com/url?q= and
      • remove it (google docs adds it to all URLs)
      • remove the query parameters at the end of the URL (exampleː &sa=D&ust=1521458237517000&usg=AFQjCNF8rpnKXKe5SBJfu3UHPfEJa4rGxA)
    • Fix the URL that remains as it is URL encoded and probably broken if it uses characters like =, (), or anything that is not [a-zA-Zː/]
  • If you have links to headings in the document, you will have to manually change themː
    • Find #h. in the document and empty that hash id that is in the URL of the link
    • Change each url target to the mediawiki section list (usually the name of the section with _instead of spaces)
      • You can use the TOC to check the ids by copying the link address on the TOC entry or clicking the TOC link and looking at the end of the URL in the browser's address bar

See also edit