Manual:Pywikibot/transwikiimport.py

transwikiimport.py is a Pywikibot script used to transfer pages from a source wiki to a target wiki.

The complete edit history can be imported. So a recursive import of all transcluded pages and templates is possible.

Internal links are not repaired!

The script gives access to all options available on the Specialpage:Import and over the appropriate API (see API:Import.

Examples edit

Transfer all pages in category "Query service" from the source wiki to the home wiki, import all versions of the page (full history), assign the changes to the locally existing accounts, set an appropriate summary, do not overwrite existing pages when they have the same name:

$ python pwb.py transwikiimport -interwikisource:en -cat:"Query service" -fullhistory -assignknownusers -summary:"Pages from the category Query service copied."

Copy the page "Page:How to become famous.djvu/333" from the source wiki to the home wiki, import all versions of the page (full history), assign the changes to the locally existing accounts, set an appropriate summary, do not overwrite existing pages when they have the same name:

$ python pwb.py transwikiimport -interwikisource:en -page:"How to become famous.djvu/333" -fullhistory -assignknownusers -summary:"Page copied from the oldwiki."

Copy all pages "Page:How to become famous.djvu/?" from the source wiki to the home wiki, import all versions of the page (full history), assign the changes to the locally existing accounts, set an appropriate summary, do not overwrite existing pages when they have the same name:

Copy pages 111–222 of "Page:How to become famous.djvu/?" from the source wiki to the home wiki, import all versions of the page (full history), assign the changes to the locally existing accounts, set an appropriate summary, do not overwrite existing pages when they have the same name:

The one-liner that could achieve this on a Linux system could look like:

$ for i in {111..222} ; do python3 pwb.py transwikiimport -interwikisource:en -page:"How to become famous.djvu/$i" -fullhistory -assignknownusers -summary:"Page copied from the oldwiki." ; done

On a Windows machine one could use:

$ FOR /L %A IN (111,1,222) DO python pwb.py transwikiimport -interwikisource:en -page:"How to become famous.djvu/%A" -fullhistory -assignknownusers -summary:"Page copied from the oldwiki."


Parameters edit

ParameterDescription
-interwikisource: The interwiki code of the source wiki.
-fullhistory: Include all versions of the page.
-includealltemplates: All templates and transcluded pages will be copied (dangerous).
-assignknownusers: If user exists on target wiki, assign the editions to them.
-correspondingnamespace: The number of the corresponding namespace.
-rootpage: Import as subpages of ...
-summary: Log entry import summary.
-tags: Change tags to apply to the entry in the import log and to the null revision on the imported pages.
-test: No import, the names of the pages are output.
-overwrite: Existing pages are skipped by default. Use this option to overwrite pages.
-target: Use page generator of the target site.

Pages to work on can be specified using any of:


Generators and filters available

Generator options
Parameter Description
-cat Work on all pages which are in a specific category. Argument can also be given as "-cat:categoryname" or as "-cat:categoryname|fromtitle" (using # instead of | is also allowed in this one and the following)
-catr Like -cat, but also recursively includes pages in subcategories, sub-subcategories etc. of the given category. Argument can also be given as "-catr:categoryname" or as "-catr:categoryname|fromtitle".
-subcats Work on all subcategories of a specific category. Argument can also be given as "-subcats:categoryname" or as "-subcats:categoryname|fromtitle".
-subcatsr Like -subcats, but also includes sub-subcategories etc. of the given category. Argument can also be given as "-subcatsr:categoryname" or as "-subcatsr:categoryname|fromtitle".
-uncat Work on all pages which are not categorised.
-uncatcat Work on all categories which are not categorised.
-uncatfiles Work on all files which are not categorised.
-file Read a list of pages to treat from the named text file. Page titles in the file may be either enclosed with brackets (example: [[Page]]), or be separated by new lines. Argument can also be given as "-file:filename".
-filelinks Work on all pages that use a certain image/media file. Argument can also be given as "-filelinks:filename".
-search Work on all pages that are found in a MediaWiki search across all namespaces .
-logevents Work on articles that were on a specified Special:Log. The value may be a comma separated list of these values:
logevent,username,start,end

or for backward compatibility:

logevent,username,total

To use the default value, use an empty string. You have options for every type of logs given by the log event parameter which could be one of the following:

spamblacklist, titleblacklist, gblblock, renameuser, globalauth, gblrights, gblrename, abusefilter, massmessage, thanks, usermerge, block, protect, rights, delete, upload, move, import, patrol, merge, suppress, tag, managetags, contentmodel, review, stable, timedmediahandler, newusers

It uses the default number of pages 10.

Examples:

-logevents:move gives pages from move log (usually redirects)
-logevents:delete,,20 gives 20 pages from deletion log
-logevents:protect,Usr gives pages from protect by user Usr
-logevents:patrol,Usr,20 gives 20 patroled pages by Usr
-logevents:upload,,20121231,20100101 gives upload pages in the 2010s, 2011s, and 2012s
-logevents:review,,20121231 gives review pages since the beginning till the 31 Dec 2012
-logevents:review,Usr,20121231 gives review pages by user Usr since the beginning till the 31 Dec 2012
In some cases it must be given as -logevents:"move,Usr,20"
-interwiki Work on the given page and all equivalent pages in other languages. This can, for example, be used to fight multi-site spamming. Attention: this will cause the bot to modify pages on several wiki sites, this is not well tested, so check your edits!
-links Work on all pages that are linked from a certain page. Argument can also be given as "-links:linkingpagetitle".
-liverecentchanges Work on pages from the live recent changes feed. If used as -liverecentchanges:x, work on x recent changes.
-imagesused Work on all images that contained on a certain page. Can also be given as "-imagesused:linkingpagetitle".
-newimages Work on the most recent new images. If given as -newimages:x, will work on x newest images.
-newpages Work on the most recent new pages. If given as -newpages:x, will work on x newest pages.
-recentchanges Work on the pages with the most recent changes. If given as -recentchanges:x, will work on the x most recently changed pages. If given as -recentchanges:offset,duration it will work on pages changed from 'offset' minutes with 'duration' minutes of timespan.

Examples:
-recentchanges:20 - gives the 20 most recently changed pages
-recentchanges:120,70 - will give pages with 120 offset minutes and 70 minutes of timespan
-recentchanges:visualeditor,10 - gives the 10 most recently changed pages marked with 'visualeditor'
-recentchanges:"mobile edit,60,35" - will retrieve pages marked with 'mobile edit' for the given offset and timespan

rctags are supported, and the rctag must be the very first parameter part.
-unconnectedpages Work on the most recent unconnected pages to the Wikibase repository. Given as -unconnectedpages:x, will work on the x most recent unconnected pages.
-ref Work on all pages that link to a certain page. Argument can also be given as "-ref:referredpagetitle".
-start Specifies that the robot should go alphabetically through all pages on the home wiki, starting at the named page. Argument can also be given as "-start:pagetitle". You can also include a namespace. For example, "-start:Template:!" will make the bot work on all pages in the template namespace. default value is start:!
-prefixindex Work on pages commencing with a common prefix.
-transcludes Work on all pages that use a certain template. Argument can also be given as "-transcludes:Title".
-unusedfiles Work on all description pages of images/media files that are not used anywhere. Argument can be given as "-unusedfiles:n" where n is the maximum number of articles to work on.
-lonelypages Work on all articles that are not linked from any other article. Argument can be given as "-lonelypages:n" where n is the maximum number of articles to work on.
-unwatched Work on all articles that are not watched by anyone. Argument can be given as "-unwatched:n" where n is the maximum number of articles to work on.
-property:name Work on all pages with a given property name from Special:PagesWithProp.
-usercontribs Work on all articles that were edited by a certain user. (Example : -usercontribs:DumZiBoT)
-weblink Work on all articles that contain an external link to a given URL; may be given as "-weblink:url"
-withoutinterwiki Work on all pages that don't have interlanguage links. Argument can be given as "-withoutinterwiki:n" where n is the total to fetch.
-mysqlquery Takes a Mysql query string like "SELECT page_namespace, page_title, FROM page WHERE page_namespace = 0" and works on the resulting pages. See Manual:Pywikibot/MySQL .
-sparql Takes a SPARQL SELECT query string including ?item and works on the resulting pages.
-sparqlendpoint Specify SPARQL endpoint URL (optional). (Example : -sparqlendpoint:http://myserver.com/sparql)
-searchitem Takes a search string and works on Wikibase pages that contain it. Argument can be given as "-searchitem:text", where text is the string to look for, or "-searchitem:lang:text", where lang is the language to search items in.
-random Work on random pages returned by Special:Random. Can also be given as "-random:n" where n is the number of pages to be returned.
-randomredirect Work on random redirect pages returned by Special:RandomRedirect. Can also be given as "-randomredirect:n" where n is the number of pages to be returned.
-google Work on all pages that are found in a Google search. You need a Google Web API license key. Note that Google doesn't give out license keys anymore. See google_key in config.py for instructions. Argument can also be given as "-google:searchstring".
-yahoo Work on all pages that are found in a Yahoo search. Depends on python module pYsearch. See yahoo_appid in config.py for instructions.
-page Work on a single page. Argument can also be given as "-page:pagetitle", and supplied multiple times for multiple pages.
-pageid Work on a single pageid. Argument can also be given as "-pageid:pageid1,pageid2,." or "-pageid:'pageid1|pageid2|..'" and supplied multiple times for multiple pages.
-linter Work on pages that contains lint errors. Extension Linter must be available on the site. -linter select all categories. -linter:high, -linter:medium or -linter:low select all categories for that prio. Single categories can be selected with commas as in -linter:cat1,cat2,cat3 Adding '/int' identifies Lint ID to start querying from: e.g. -linter:high/10000 -linter:show just shows available categories.
Filter options
Parameter Description
-catfilter Filter the page generator to only yield pages in the specified category. See -cat generator for argument format.
-grep A regular expression that needs to match the article otherwise the page won't be returned. Multiple -grep:regexpr can be provided and the page will be returned if content is matched by any of the regexpr provided. Case insensitive regular expressions will be used and dot matches any character, including a newline.
-grepnot Like -grep, but return the page only if the regular expression does not match.
-intersect Work on the intersection of all the provided generators.
-limit When used with any other argument -limit:n specifies a set of pages, work on no more than n pages in total.
-namespaces
-namespace
-ns
Filter the page generator to only yield pages in the specified namespaces. Separate multiple namespace numbers or names with commas.

Examples:

-ns:0,2,4 -ns:Help,MediaWiki

You may use a preleading "not" to exclude the namespace. Examples:

-ns:not:2,3 -ns:not:Help,File

If used with -newpages/-random/-randomredirect/-linter generators, -namespace/-ns must be provided before -newpages/-random/-randomredirect/-linter. If used with -recentchanges generator, efficiency is improved if -namespace is provided before -recentchanges.

If used with -start generator, -namespace/-ns shall contain only one value.
-onlyif A claim the page needs to contain, otherwise the item won't be returned. The format is property=value,qualifier=value. Multiple (or none) qualifiers can be passed, separated by commas.

Examples:
P1=Q2 (property P1 must contain value Q2)
P3=Q4,P5=Q6,P6=Q7 (property P3 with value Q4 and qualifiers: P5 with value Q6 and P6 with value Q7)

Value can be page ID, coordinate in format: latitude,longitude[,precision] (all values are in decimal degrees), year, or plain string. The argument can be provided multiple times and the item page will be returned only if all claims are present. Argument can be also given as "-onlyif:expression".
-onlyifnot A claim the page must not contain, otherwise the item won't be returned. For usage and examples, see -onlyif above.
-ql Filter pages based on page quality. This is only applicable if contentmodel equals 'proofread-page', otherwise has no effects. Valid values are in range 0-4. Multiple values can be comma-separated.
-subpage -subpage:n filters pages to only those that have depth n i.e. a depth of 0 filters out all pages that are subpages, and a depth of 1 filters out all pages that are subpages of subpages.
-titleregex A regular expression that needs to match the article title otherwise the page won't be returned. Multiple -titleregex:regexpr can be provided and the page will be returned if title is matched by any of the regexpr provided. Case insensitive regular expressions will be used and dot matches any character.
-titleregexnot Like -titleregex, but return the page only if the regular expression does not match.


Global arguments available

These options will override the configuration in user-config.py settings.

Global options
Parameter Description Config variable
-dir:PATH Read the bot's configuration data from directory given by PATH, instead of from the default directory.  
-config:file The user config filename. Default is user-config.py. user-config.py
-lang:xx Set the language of the wiki you want to work on, overriding the configuration in user-config.py. xx should be the language code. mylang
-family:xyz Set the family of the wiki you want to work on, e.g. wikipedia, wiktionary, wikitravel, ... This will override the configuration in user-config.py. family
-user:xyz Log in as user 'xyz' instead of the default username. usernames
-daemonize:xyz Immediately return control to the terminal and redirect stdout and stderr to file xyz. (only use for bots that require no input from stdin).  
-help Show the help text.  
-log Enable the log file, using the default filename 'script_name-bot.log' Logs will be stored in the logs subdirectory. log
-log:xyz Enable the log file, using 'xyz' as the filename. logfilename
-nolog Disable the log file (if it is enabled by default).  
-maxlag Sets a new maxlag parameter to a number of seconds. Defer bot edits during periods of database server lag. Default is set by config.py maxlag
-putthrottle:n
-pt:n
-put_throttle:n
Set the minimum time (in seconds) the bot will wait between saving pages. put_throttle
-debug:item
-debug
Enable the log file and include extensive debugging data for component "item" (for all components if the second form is used). debug_log
-verbose
-v
Have the bot provide additional console output that may be useful in debugging. verbose_output
-cosmeticchanges
-cc
Toggles the cosmetic_changes setting made in config.py or user-config.py to its inverse and overrules it. All other settings and restrictions are untouched. cosmetic_changes
-simulate Disables writing to the server. Useful for testing and debugging of new code (if given, doesn't do any real changes, but only shows what would have been changed). simulate
-<config var>:n You may use all given numeric config variables as option and modify it with command line.  

Warning edit

The parameter -test disables the import and the bot prints the names of the pages that would be imported.

Since the import of pages is a quite exceptional process and potentially dangerous it should be made carefully and tested in advance.

The -test parameter can help to find out which pages would be moved and what would be the target of the import.

However it does not print the titles of the transcluded pages (e.g. templates) if -includealltemplates is set. This option is quite dangerous. If the title of an existing page on home wiki clashes with the title of one of the linked pages it would be overritten. The histories would be merged. (If the imported version is newer.)

Even if -overwrite is not set the linked page can be overwritten.

The correspondingnamespace is used only if the namespaces on both wikis do not correspond one with another.

correspondingnamespace and rootpage are mutually exclusive.

target and rootpage are mutually exclusive. (This combination does not seem to be feasible.)

If the target page already exists, the target page will be overwritten if -overwrite is set or skipped otherwise.

The module gives access to all parameters of the API (and special page).

However for most scenarios the following parameters should be avoided:

  • overwrite (by default set as False)
  • target (by default set as False)
  • includealltemplates (by default set as False)

Transwikiimportbot.py is also compatible to the transferbot module (Manual.

Rights edit

transwikiimport.py requires an appropriate flag be set on the account.

Even the Specialpage:Import can be accesed by administrators, tranwiki importers or importers only.

Interwikisource edit

The list of wikis that can be used as a source is defined in the variable $wgImportSources

It can be viewed on the Specialpage:Import.