Manuel:Importer les dumps XML
This page describes methods to import XML dumps. XML Dumps contain the content of a wiki (wiki pages with all their revisions), without the site-related data. A XML dump does not create a full backup of the wiki database, the dump does not contain user accounts, images, edit logs, etc.
The Special:Export page of any MediaWiki site, including any Wikimedia site and Wikipedia, creates an XML file (content dump). See meta:Data dumps and Manual:DumpBackup.php. XML files are explained more on meta:Help:Export.
Que faut-il importer ?
Comment importer ?
Il existe plusieurs méthodes pour importer les dumps XML.
Importer des dumps XML très gros (tels que la Wikipedia anglophone)
Cette section est vide. Vous pouvez aider MediaWiki.org en l'étendant. |
Utiliser Special:Import
Special:Import can be used by wiki users with import
permission (by default this is users in the sysop
group) to import a small number of pages (about 100 should be safe).
Trying to import large dumps this way may result in timeouts or connection failures.
- See meta:Help:Import for a detailed description.[1]
- Large files might be rejected due to limits in PHP configuration, see meta:Help:Import#Large-scale_transfer
You are asked to give an interwiki prefix. For instance, if you exported from the English Wikipedia, you have to type 'en'.
Modifier les autorisations
Voir Manuel:Droits utilisateurs .
To allow all registered editors to import (not recommended) the line added to "LocalSettings.php" would be:
$wgGroupPermissions['user']['import'] = true;
$wgGroupPermissions['user']['importupload'] = true;
Problèmes éventuels
For using Transwiki-Import PHP safe_mode must be off and "open_basedir" must be empty (both of them are variables in php.ini). Otherwise the import fails.
Si vous avez des erreurs de ce type :
Warning: XMLReader::open(): Unable to open source data in /.../wiki/includes/Import.php on line 53 Warning: XMLReader::read(): Load Data before trying to read in /.../wiki/includes/Import.php on line 399
And Special:Import shows: "Import failed: Expected <mediawiki> tag, got ", this may be a problem caused by a fatal error on a previous import, which leaves libxml in a wrong state across the entire server, or because another PHP script on the same server disabled entity loader (PHP bug).
This happens on MediaWiki versions prior to MediaWiki 1.26, and the solution is to restart the webserver service (apache, etc), or write and execute a script that calls libxml_disable_entity_loader(false);
(see tâche T86036).
Utiliser importDump.php, si vous avez accès au shell
- Recommended method for general use, but slow for very big data sets.
- See: Manual:importDump.php, including tips on how to use it for large wikis.
Utiliser le script de maintenance importTextFiles.php
Version de MediaWiki : | ≤ 1.23 |
Version de MediaWiki : | ≥ 1.27 |
If you have a lot of content converted from another source (several word processor files, content from another wiki, etc), you may have several files that you would like to import into your wiki. In MediaWiki 1.27 and later, you can use the importTextFiles.php maintenance script.
You can also use the edit.php maintenance script for this purpose.
rebuildall.php
For large XML dumps, you can run rebuildall.php
, but, it will take a long time, because it has to parse all pages.
This is not recommended for large data sets.
Using pywikibot, pagefromfile.py and Nokogiri
pywikibot is a collection of tools written in python that automate work on Wikipedia or other MediaWiki sites. Once installed on your computer, you can use the specific tool 'pagefromfile.py' which lets you upload a wiki file on Wikipedia or MediaWiki sites. The xml file created by dumpBackup.php can be transformed into a wiki file suitable to be processed by 'pagefromfile.py' using a simple Ruby program similar to the following (here the program will transform all xml files which are on the current directory which is needed if your MediaWiki site is a family):
# -*- coding: utf-8 -*-
# dumpxml2wiki.rb
require 'rubygems'
require 'nokogiri'
# This program dumpxml2wiki reads MediaWiki xml files dumped by dumpBackup.php
# on the current directory and transforms them into wiki files which can then
# be modified and uploaded again by pywikipediabot using pagefromfile.py on a MediaWiki family site.
# The text of each page is searched with xpath and its title is added on the first line as
# an html comment: this is required by pagefromfile.py.
#
Dir.glob("*.xml").each do |filename|
input = Nokogiri::XML(File.new(filename), nil, 'UTF-8')
puts filename.to_s # prints the name of each .xml file
File.open("out_" + filename + ".wiki", 'w') {|f|
input.xpath("//xmlns:text").each {|n|
pptitle = n.parent.parent.at_css "title" # searching for the title
title = pptitle.content
f.puts "\n{{-start-}}<!--'''" << title.to_s << "'''-->" << n.content << "\n{{-stop-}}"
}
}
end
For example, here is an excerpt of a wiki file output by the command 'ruby dumpxml2wiki.rb' (two pages can then be uploaded by pagefromfile.py, a Template and a second page which is a redirect):
{{-start-}}<!--'''Template:Lang_translation_-pl'''--><includeonly>Tłumaczenie</includeonly>
{{-stop-}}
{{-start-}}#REDIRECT[[badania demograficzne]]<!--'''ilościowa demografia'''-->
<noinclude>
[[Category:Termin wielojęzycznego słownika demograficznego (pierwsze wydanie)|ilościowa demografia]]
[[Category:Termin wielojęzycznego słownika demograficznego (pierwsze wydanie) (redirect)]]
[[Category:10]]</noinclude>
{{-stop-}}
The program accesses each xml file, extracts the texts within <text> </text> markups of each page, searches the corresponding title as a parent and enclosed it with the paired {{-start-}}<!--'''Title of the page'''--> {{-stop-}} commands used by 'pagefromfile' to create or update a page. The name of the page is in an html comment and separated by three quotes on the same first start line. Please notice that the name of the page can be written in Unicode. Sometimes it is important that the page starts directly with the command, like for a #REDIRECT; thus the comment giving the name of the page must be after the command but still on the first line.
Please remark that the xml dump files produced by dumpBackup.php are prefixed by a namespace:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/">
In order to access the text node using Nokogiri, you need to prefix your path with 'xmlns':
input.xpath("//xmlns:text")
Nokogirist un analyseur HTML, XML, SAX, et Reader permettant de rechercher des documents via des sélecteurs XPath ou CSS3 et issu de la dernière génération des analyseurs XML utilisant Ruby.
Example of the use of 'pagefromfile' to upload the output wiki text file:
python pagefromfile.py -file:out_filename.wiki -summary:"Reason for changes" -lang:pl -putthrottle:01
Comment importer les journaux ?
Exporting and importing logs with the standard MediaWiki scripts often proves very hard; an alternative for import is the script pages_logging.py
in the WikiDAT tool, as suggested by Felipe Ortega.
Résolution des problèmes
Fusionner les historiques, conflits de versions, modifier les résumés et autres difficultés
Liens interwikis
If you get the message
Page "meta:Blah blah" is not imported because its name is reserved for external linking (interwiki).
the problem is that some pages to be imported have a prefix that is used for interwiki linking.
For example, ones with a prefix of 'Meta:' would conflict with the interwiki prefix meta:
which by default links to https://meta.wikimedia.org.
You can do any of the following.
- Remove the prefix from the interwiki table. This will preserve page titles, but prevent interwiki linking through that prefix.
- Example: you will preserve page titles 'Meta:Blah blah' but will not be able to use the prefix 'meta:' to link to meta.wikimedia.org (although it will be possible through a different prefix).
- How to do it: before importing the dump, run the query
DELETE FROM interwiki WHERE iw_prefix='prefix'
(note: do not include the colon in theprefix
). Alternatively, if you have enabled editing the interwiki table, you can simply go to Special:Interwiki and click the 'Delete' link on the right side of the row belonging to that prefix.
- Replace the unwanted prefix in the XML file with "Project:" before importing. This will preserve the functionality of the prefix as an interlink, but will replace the prefix in the page titles with the name of the wiki where they're imported into, and might be quite a pain to do on large dumps.
- Example: replace all 'Meta:' with 'Project:' in the XML file. MediaWiki will then replace 'Project:' with the name of your wiki during importing.
Voir aussi
- Manual:DumpBackup.php
- meta:Data dumps/Other tools
- Manual:System administration#Getting data
- Manual:Configuring file uploads#Set maximum size for file uploads – May come in handy if you are doing massive imports
- Manual:Errors and Symptoms#Fatal error: Allowed memory size of nnnnnnn bytes exhausted .28tried to allocate nnnnnnnn bytes.29 – Settings that may need to be changed if you are doing massive imports
- Manual:ImportImages.php - for importing images.
- Manual:Importing external content
- Manual:Importing Wikipedia infoboxes tutorial
Références
- ↑ See Manual:XML Import file manipulation in CSharp for a C# code sample that manipulates an XML import file.