User:JHernandez (WMF)/How to migrate from g00gle docs to wikitext
Why is the title weird? Because you can't create pages with google on the title here. Now...
How do you covert a google drive doc (old google docs) into a wiki page? Here is a wayː
Parsing
edit- File -> Download as -> HTML zipped
- Unzip into a single HTML file
- Convert it to commonmark and then to mediawiki with pandocː
pandoc -f html -t commonmark document.html > document.md
pandoc -f commonmark -t mediawiki document.md > document.wiki
- Or use http://pandoc.org/try/ if your document is small enough
- Proceed to clean up the wikitext, see next section
Cleanup
edit- Check your headers
- If the top level headers start with one
=
, add one heading level to all headers (==Hello==
becomes===hello===
) to match MediaWiki headers level - Make sure you don't have any headers with only one
=
, as that is the page title in the MediaWiki page
- If the top level headers start with one
- Remove empty
<p></p>
- If you don't want comments around
- Find
#cmref
and remove the comments markup and text - Find
#cmnt
and remove the links to the comments content
- Find
- Fix external URLs
- Find https://www.google.com/url?q= and
- remove it (google docs adds it to all URLs)
- remove the query parameters at the end of the URL (exampleː
&sa=D&ust=1521458237517000&usg=AFQjCNF8rpnKXKe5SBJfu3UHPfEJa4rGxA
)
- Fix the URL that remains as it is URL encoded and probably broken if it uses characters like
=
,()
, or anything that is not[a-zA-Zː/]
- You can use https://urldecode.org/ for that
- Find https://www.google.com/url?q= and
- If you have links to headings in the document, you will have to manually change themː
- Find
#h.
in the document and empty that hash id that is in the URL of the link - Change each url target to the mediawiki section list (usually the name of the section with
_
instead of spaces)- You can use the TOC to check the ids by copying the link address on the TOC entry or clicking the TOC link and looking at the end of the URL in the browser's address bar
- Find