Incremental dumps/File format/Specification

This page describes the current format of the incremental dumps file. It is far from finished, which means the format can change daily and that this page can easily become out of date.

The format is binary; the file contains various objects and can also contain free space (remaining after deleted objects).

Data types

When reading an object, the type of the next piece of data is always known from context. This means that objects don't contain any information about what field is next or what is its type.

The encoding for various data types is as follows:

integers: 1, 2, 4 and 6-byte unsigned integers are saved directly in little-endian order. (6-byte integers are used to represent offsets in the file). Signed integers are first casted to unsigned integers of the same size and then saved the same way as unsigned integers.
timestamps: timestamps from 1 January 2000 to beyond 2100 with second accuracy are represented as 4-byte unsigned integers. The integer is not the number of seconds from the start date, but is instead directly calculated from parts of the date as (((((Year - 2000) * 12 + Month - 1) * 31 + Day - 1) * 24 + Hour) * 60 + Minute) * 60 + Second.
strings: strings are saved as length of the string (n) followed by n bytes of its content. For short strings (those that are guaranteed to be at most 255 bytes long), the length is 1-byte integer, for long strings it's 4-byte integer.
generic lists: lists are saved as 4-byte count of items (n) followed by n items. The size of each item depends on its type and can be variable (e.g. for a list of strings). This is basically a representation of vector<T>.
generic maps: maps are saved as 2-byte count of items (n) followed by n key-value pairs. This is basically a representation of map<TKey, TValue>.
generic pair: pairs are saved as the first item followed by the second item of the pair. Pairs are typically used as the value of a map. This is basically a representation of pair<T1, T2>.

File header

File header always starts at offset 0 and contains offsets of indexes, which can be used to access the data stored in the file.

4 bytes magic number: MWID
1 byte file format version: 1
1 byte data version: 1
1 byte dump kind flags:
- 0x01 for pages dump: a dump with revision text
- 0x02 for current dump: a dump without old revisions of pages
- 0x04 for articles dump: a dump that doesn't contain pages from talk namespaces and the User namespace
6 bytes offset to the end of the file
6 bytes offset to the root of the page id index
6 bytes offset to the root of the revision id index
6 bytes offset to the root of the text group id index
6 bytes offset to the root of the model & format index
6 bytes offset to the free space index
6 bytes offset to the site info object

Site info

The site info object contains metadata about the whole wiki and its namespaces.

1 byte object kind: 0x21
short string name of the dump (e.g. enwiki)
short string timestamp of the dump
short string XML language code (e.g. en for English or cs for Czech)
short string site name
short string base URL
short string “generator”: the version of MediaWiki used
1 byte site case:
- 0x01 for first letter
- 0x02 for case sensitive
map of namespaces
- the key is signed 2-byte integer namespace id
- the value is a pair of case (see above) and short string namespace name

Index

The file currently contains 5 indexes:

page id index maps 4 byte page ids to 6 byte offsets of the corresponding page object
revision id index maps 4 byte revision ids to 6 byte offsets of the corresponding revision object
text group id index maps 4 byte text group ids to 6 byte offsets of the corresponding text group object
model format index maps 1 byte synthetic id to a pair of short strings representing model and format
- this index is used to save space; using it, a revision's model and format can be represented as a single byte
free space index maps 6 byte offsets of free space blocks to their 4 byte lengths

The index is saved as a B-tree,^[1] with leaf nodes on the last level and inner nodes on levels above.

A leaf node is saved as:

1 byte object kind: 0x01
map of keys to values

An inner node is saved as:

1 byte object kind: 0x02
2 bytes count (n)
n keys
(n + 1) 6 byte offsets to child nodes

Page

The page object describes a single page and references its revisions. It is saved as:

1 byte object kind: 0x11
4 bytes page id
2 bytes namespace id
short string page title
short string redirect target; if it's empty, page is not a redirect
list of 4 byte ids of revisions of this page

Revision

The revision object describes a revision of a page. It is saved as:

1 byte object kind: 0x12
4 bytes revision id
1 byte flags (the right values are ORed together)
- 0x01: minor edit
- 0x02: the model of this revision is wikitext, the format is text/x-wiki
  - this is a special value to save additional byte for the most common model & format
- 0x04: the contributor for this revision is not an IP-address anonymous user
- 0x08: the contributor is an IPv4 anonymous user
- 0x10: the contributor is an IPv6 anonymous user
- 0x20: the text of this revision was deleted
- 0x40: the comment of this revision was deleted
- 0x80: the contributor of this revision was deleted
4 byte id of the parent revision
4 byte timestamp of the revision
if the “contributor was deleted” flag is not set:
- embedded object describing the user who made this revision
if the “comment was deleted” flag is not set:
- short string comment
if the “model & format is wikitext” flag is not set:
- 1 byte id of model & format of this revision
if the “text was deleted” flag is not set:
- 20 bytes (little endian) SHA-1 of the revision
- if this is a pages dump (with text):
  - 4 bytes id of the text group that contains the text for this revisions
  - 1 byte index into the the text group of the text of this revision
- else (a stub dump):
  - 4 byte revision text length

Text group

A text group contains a group of revision texts compressed together to achieve better compression.

For saving, the texts are concatenated together using the null character ('\0') as delimiter. If a text is deleted from the group, it is replaced by UTF-8 encoded U+FFFF Unicode Noncharacter.

Specific texts from the group are accessed by a 1-byte index (which means a group can contain at most 256 texts).

A text group is saved as:

1 byte object kind: 0x31
long string LZMA-compressed string containing the texts from the group (see above for details)

User

The contributor who made a revision can be saved in one of three possible formats.

For anonymous IPv4 editors:

4 byte integer of the IP address

For anonymous IPv6 editors:

16 bytes of the IP address

For other editors (including normal logged-in editors, historical anomalies, …):

4 byte integer user id (can be 0)
short string user name

Notes

↑ Technically, it's not a B-tree, because it doesn't follow the rules of a B-tree when it comes to deletion.

[1] Technically, it's not a B-tree, because it doesn't follow the rules of a B-tree when it comes to deletion.

[1]