Convert Socialtextwiki to Mediawiki

This page describes how to convert a Socialtextwiki to Mediawiki using Linux. It is based on a single conversion and is by no means exhaustive; it was tested with a wiki comprising only a few hundred pages and files and could be improved a lot.

Socialtextwiki is similar to Kwiki.

The procedure described below can:

convert pages, retaining the essential syntax
convert files
convert histories of pages and files

and cannot:

convert tables (reedit them manually, mostly adding only start and end syntax)
most other features of Socialtextwiki
convert user-association of edits
and much more

Introduction

Our socialtext wiki is stored as files located in a directory named

data

The tree contains one directory per page (below data/{WORKSPACE}), with one index.txt containing the current version of the page and several {date}.txt files containing older revisions. (Workspaces are separate branches of socialtext wiki.)

The files are located within a directory named

plugin

Put all the following files and dirs (except for the new wiki) into one working directory and proceed as follows.

Install MediaWiki

install a current mediawiki
allow upload of all files

$wgEnableUploads = true;
$wgStrictFileExtensions = false;
$wgCheckFileExtensions = false;

modify php.ini and reload apache2 (to be able to upload bigger files)

post_max_size = 32M
upload_max_filesize = 32M

Copy the original files to the new host

Copy these directories (use scp, not rsync, since we don't want symlinks; index.php are symlinks):

data
plugin

Script to convert a single page

create a script conv.py to convert a single page. It takes the file name of the page as arg1.

 #!/usr/bin/python
 
 import re
 import sys
 
 filename = sys.argv[1]
  
 f = open(filename, "r")
 text = f.read()
 (header,content) = text.split('\n\n',1)
 
 # trim content lines
 lines = content.split('\n')
 lines2 = [line.strip() for line in lines]
 content = '\n'.join(lines2)
 
 # headings
 p = re.compile('^\^\^\^\^(.*)$', re.M)
 content = p.sub('====\\1 ====', content)
 p = re.compile('^\^\^\^(.*)$', re.M)
 content = p.sub('===\\1 ===', content)
 p = re.compile('^\^\^(.*)$', re.M)
 content = p.sub('==\\1 ==', content)
 p = re.compile('^\^(.*)$', re.M)
 content = p.sub('=\\1 =', content)
 
 # bold
 p = re.compile('([^\*]+)\*([^\*]+)\*', re.M)
 content = p.sub('\\1\'\'\'\\2\'\'\'', content)
 
 # link
 p = re.compile('\[([^\]]+)\]', re.M)
 content = p.sub('[[\\1]]', content)
 
 # file
 p = re.compile('{file: ([^}]+)}', re.M)
 content = p.sub('[[Media:\\1]]', content)
 
 # image
 p = re.compile('{image: ([^}]+)}', re.M)
 content = p.sub('[[Bild:\\1]]', content)
 
 # item level 1
 p = re.compile('\342\200\242\011', re.M)
 content = p.sub('* ', content)
 
 # table, only partially, do the rest manually!
 # you have to add {|... , |} , and check for errors due to empty cells
 p = re.compile('[^\n]\|', re.M)
 content = p.sub('\n|', content)
 p = re.compile('\|\s*\|', re.M)
 content = p.sub('|-\n|', content)
 
 # lines with many / * + symbols were used as separator lines...
 p = re.compile('[\/]{15,200}', re.M)
 content = p.sub('----', content)
 p = re.compile('[\*]{15,200}', re.M)
 content = p.sub('----', content)
 p = re.compile('[\+]{15,200}', re.M)
 content = p.sub('----', content)
 
 # external links
 p = re.compile('\"([^\"]+)\"<http(.*)>\s*\n', re.M)
 content = p.sub('[http\\2 \\1]\n\n', content)
 p = re.compile('\"([^\"]+)\"<http(.*)>', re.M)
 content = p.sub('[http\\2 \\1]', content)
 
 
 # add categories
 content += '\n'
 header_lines = header.split('\n')
 for line in header_lines:
     if re.match('^[Cc]ategory: ', line):
         category = re.sub('^[Cc]ategory: (.*)$', '\\1', line)
         content += '[[Category:' + category + ']]\n'
 
 # departments / workspaces
 if re.match('data/zsi-fe', filename):
     content += '[[Category:FE]]\n'
 if re.match('data/zsi-ac', filename):
     content += '[[Category:AC]]\n'
 if re.match('data/zsi-tw', filename):
     content += '[[Category:TW]]\n'
 
 print content

Test it like this:

./conv.py data/{WORKSPACE}/{PAGENAME}/{REVISION}

Just copy the resulting wiki text into a page of the new mediawiki and use preview.

Adapt the python script to your needs until most pages are translated correctly.

Script to upload a single file

The MediaWiki API does not yet have action=upload. Get upload.pl.

The file has to be modified to use our new server instead of mediawiki.blender.org . Also edit username and password. Create a directory called 'upload', put content there and test uploading.

script to migrate pages

Use this script (which calls ./conv.py) to migrate pages. They will be uploaded in chronological order:

 #!/bin/sh
 
 wikiurl="http://NAME.OF.NEW.SERVER/mediawiki/api.php"
 lgname="WikiSysop"
 lgpassword="*************"
 
 # login
 login=$(wget -q -O - --no-check-certificate --save-cookies=/tmp/converter-cookies.txt \
              --post-data "action=login&lgname=$lgname&lgpassword=$lgpassword&format=json" \
              $wikiurl)
 #echo $login 
 
 # get edittoken
 edittoken=$(wget -q -O - --no-check-certificate --save-cookies=/tmp/converter-cookies.txt \
              --post-data "action=query&prop=info|revisions&intoken=edit&titles=Main%20Page&format=json" \
              $wikiurl)
 #echo $edittoken
 token=$(echo $edittoken | sed -e 's/.*edittoken.:.\([^\"]*\)...\".*/\1/')
 token="$token""%2B%5C"
 #echo $token
 
 # test editing with a test page
 #cmd="action=edit&title=test1&summary=autoconverted&format=json&text=test1&token=$token&recreate=1&notminor=1&bot=1"
 #editpage=$(wget -q -O - --no-check-certificate --load-cookies=/tmp/converter-cookies.txt --post-data $cmd $wikiurl)
 #echo $editpage
 #exit 
 
 # loop over all pages except for dirs in the list of excludes
 find data -not -path "data/help*" -type f -and -not -name ".*" | sort |
 while read n; do
     pagedir=$(echo $n | sed -e 's/.*\/\(.*\)\/index.txt/\1/')
     if [[ "`grep -q $pagedir excludes; echo $?`" == "0" ]]; then
         echo "omitting  $pagedir"
     else
         echo "parsing   $pagedir"
         workspace=$(echo $n | sed -e 's/.*\/\(.*\)\/[^\/]\+\/index.txt/\1/')
         pagename=$(egrep '^Subject:' $n | head -n 1 | sed -e 's/^Subject: \(.*\)/\1/')
         pagedate=$(egrep '^Date:' $n | head -n 1 | sed -e 's/^Date: \(.*\)/\1/')
         echo "$workspace $pagedir -------------- $pagename";
         text=$(./conv.py $n)
         text1=$(php -r 'print urlencode($argv[1]);' "$text")
         pagename1=$(php -r 'print urlencode($argv[1]);' "$pagename")
         pagedate1=$(php -r 'print urlencode($argv[1]);' "$pagedate")
         cmd="action=edit&title=$pagename1&summary=$pagedate1+autoconverted+from+socialtextwiki&format=json&text=$text1&token=$token&recreate=1&notminor=1&bot=1"
         editpage=$(wget -q -O - --no-check-certificate --load-cookies=/tmp/converter-cookies.txt --post-data $cmd $wikiurl)
         #echo $editpage    
     fi
 done

script to migrate files

Use this script (which calls ./upload.pl) to migrate files. The files will be uploaded in chronological order:

 #!/bin/sh
 
 find plugin -path 'plugin/zsi*/attachments/*.txt' | sort |
 while read f; do
     if [[ "`grep -q 'Control: Deleted' $f; echo $?`" != "0" ]]; then
         d=${f/.txt}
         filenameNew=$(egrep '^Subject:' $f | sed -e 's/Subject: \(.*\)/\1/')
         filenameOrig=$(ls -1 $d | head -n 1)
         version=$(egrep '^Date: ' $f | sed -e 's/Date: \(.*\)/\1/')
         #echo "---------------------------"
         #echo $filenameOrig
         #echo "$filenameNew"
         rm upload/*
         cp $d/$filenameOrig "upload/$filenameNew"
         # prepare upload
         echo -e ">$filenameNew\n$filenameNew\n$version\n(autoconverted from socialtext wiki)" > upload/files.txt
         # upload
         ./upload.pl upload
     fi
 done

Notes

socialtext wiki REST-API (unused)