Requests for comment/Schema update for multiple content objects per revision (MCR) in XML dumps

DRAFT for discussion

Request for comment (RFC)
Schema update for multiple content objects per revision (MCR) in XML dumps
Component General
Creation date
Author(s) ArielGlenn, Daniel Kinzler (WMDE)
Document status implemented
See Phabricator.

We need to update the XML export schema (https://www.mediawiki.org/xml/) so that it accommodates multiple content revisions.

Background edit

Currently, each revision is associated with one piece of content, which may reside directly in the text table or may be retrieved via an address in the text table pointing to some external storage cluster.

By October 1, 2018, Multi-Content Revisions [1] is expected to be writeable on Commons (citation needed); this means that each revision may be associated with multiple pieces of content, connected via entries in the slots table. These pieces of content may, as before, reside directly in the text table or be retrievable from some external storage cluster. In either case, a reference will now be stored in the content table.

XML dumps of page content with full revision history are made available every month [2] for various uses, including bots that fix up content, researchers that do analysis, and sites that maintain local or public mirrors of Wikimedia projects. Additionally, users may export collections of pages from Wikimedia projects as XML, using Special:Export. The schema for these dumps will need to be updated so that multiple pieces of content can be provided for a revision.

Tables introduced by MCR that will need to be added to the dumps, either directly or as part of XML formatted output: slots, content, content_models and slot_roles.

Problem edit

XML dumps of revision content are generated so that we re-use the previous dump content to the extent possible; this is faster than querying the database server for each content blob, and it avoids extra load on those servers. Thus, the content dumps are generated in two passes, first writing out all of the metadata for each piece of content (the so-called 'stub dumps'), and then writing out the content itself (the 'revision content dumps'). We should be sure that the new schema permits this.

The October 1 2018 deadline is not so far away. If this RFC were to be adopted and code were to be written and published by then to generate dumps containing multi-content revisions without maintaining basic backwards compatibility, there would be virtually no time for dumps users to rewrite their tools or reconfigure their workflows for processing of the new dumps.

It would be nice if the schema treated the content in all slots identically as to format, but doing this right away means that we'd break backwards compatibility; doing it later means folks would have to update their tools twice in a short (some months) period of time. Instead we have a compromise which will make everyone just a little bit unhappy.

Discussion/background reading edit

Proposal edit

The new tables to be accounted for are:

table fields
content content_id, content_size, content_sha1, content_model, content_address
slots slot_revision_id, slot_role_id, slot_content_id, slot_origin
content_models model_id, model_name
slot-roles role_id, role_name

Of these, the fields content_id, content_size, content_sha1 and content_model correspond to fields or attributes of the text in the existing dumps and their information should simply be swapped in for those; the role_name corresponding to a given slot_role_id should be published since it tracks a specific piece of content over multiple revisions; slot_origin should be published so that dumps users can easily see which pieces of content have been changed for a given revision, even for 'stubs' dumps; and the rest are either duplicate information or can be ignored.

In general, id numbers associated with role names or content model names aren't useful to dump processors; we should avoid exposing those and use the full names, expecting that they will not change over time.

Revision changes edit

Under the current XML schema, pages are written out with one or all of their revisions; ordering is not specified but we assume that ordering is by revision id. We won't need to alter anything else, so only the portion of the schema dealing with revisions is shown here.

Schema edit

Below, the current and proposed new schemas:

Stubs dumps output edit

Below, the current, proposed transitional and proposed final schemas applied to the same revision and content, 'stubs' dump output (values for content and slot table fields are made up for this example, to show at least one extra chunk of content in the output):

Revision content dumps output edit

Below, the current, proposed transitional and proposed final schemas applied to the same revision and content, 'revision content dump' output (values for content and slot table fields are made up for this example):