Incremental dumps/File format/XML output
The XML output from incremental dumps should be exactly the same as the current XML dumps, with the following exceptions. Any exception not listed here is most likely a bug and should be reported.
The exceptions (from most serious to least):
- Revisions of a page are ordered by their id in history dumps. XML dumps don't actually have any order specified.
- The
<restrictions>
tag is omitted.
Thepage_restrictions
field in the database is not used anymore, so the<restrictions>
tag doesn't provide accurate information about the restrictions of a page. - The
id
attribute is missing for the<text>
tag in stub dumps.
This is currently used in the dump infrastructure for creating pages dumps, but is not useful to users. - Comments that are 255 bytes long and end in an invalid UTF-8 sequence are shortened.
In the current dumps, the invalid sequence is replaced with U+FFFD REPLACEMENT CHARACTER. In the XML produced by incremental dumps, the invalid sequence is removed.
This applies only to the last character of full-length comments. In other cases, incremental dumps use U+FFFD REPLACEMENT CHARACTER, just like current dumps. - Anonymous IPv6 contributors whose address is not in full form (i.e. it contains
::
) will be normalized to full form. This should be very rare, the addresses should almost always be in full form already. - The
minor
tag is consistently written as<minor />
(with space).
In current dumps, this is inconsistent: pages dumps use<minor />
, while stub dumps use<minor/>
.
This could affect users who read the dumps using regular expressions or similar methods, it doesn't make any difference for those who use XML parsers.