User:Brooke Vibber/Compacting the revision table round 2

Overview

edit

Per ongoing discussion in ArchCom and at WikiDev17 about performance, future requirements, and future-proofing for table size it's proposed to do a major overhaul of the revision table, combining the following improvements:

  • Normalization of frequently duplicated data to separate tables, reducing the dupe strings to integer keys
  • Separation of content-specific from general-revision metadata to support:
    • Multi-content revisions allowing for storing of multiple content blobs per revision -- not related to compaction, but of great interest for structured data additions planned for multimedia and articles
  • general reduction in revision table width / on-disk size will make schema changes easier in future
  • trying to avoid inconsistencies in live index deployments
    • ideally all indexes should fit on all servers, making it easier to switch database backend around in production

cf other & older notes:

Current

edit
revision
Column Type
rev_id int
rev_page -> page_id int
rev_text_id -> old_id int
rev_comment varbinary(767)
rev_user -> user_id int
rev_user_text varchar(255)
rev_timestamp binary(14)
rev_minor_edit tinyint
rev_deleted tinyint
rev_len int
rev_parent_id -> rev_id int
rev_sha1 varbinary(32)
rev_content_model varbinary(32)
rev_content_format varbinary(64)
archive
Column Type
ar_id int
ar_namespace int
ar_title varchar(255)
ar_text mediumblob
ar_comment varbinary(255)
ar_user int
ar_user_text varbinary(255)
ar_timestamp binary(14)
ar_minor_edit tinyint
ar_flags tinyblob
ar_rev_id int
ar_text_id int
ar_deleted tinyint
ar_len int
ar_page_id int
ar_parent_id int
ar_sha1 varbinary(32)
ar_content_model varbinary(32)
ar_content_format varbinary(64)

Provisional

edit

/tables.sql (needs updating)

revision
Column Type Change
rev_id bigint int -> bigint
rev_page -> page_id int
rev_comment_id -> comment_id bigint Moved to own table.
rev_actor -> actor_id bigint Separated user/user_text to own table.
rev_timestamp binary(14)
rev_minor_edit tinyint
rev_deleted tinyint
rev_parent_id -> rev_id bigint int -> bigint
archive
Column Type Change
ar_id bigint int -> bigint
ar_namespace int
ar_title varchar(255)
ar_page_id int
ar_rev_id bigint int -> bigint
ar_comment_id -> comment_id bigint Moved to own table.
ar_actor -> actor_id bigint Separated user/user_text to own table.
ar_timestamp binary(14)
ar_minor_edit tinyint
ar_deleted tinyint
ar_parent_id -> rev_id bigint int -> bigint
slots (>1 per revision)
Column Type Change
slot_revision -> rev_id/ar_rev_id bigint Link back to revision or archive
slot_content -> cont_id bigint Link to content
slot_role -> cr_id smallint Link to content_role
content (>1 per revision)
Column Type Change
cont_id bigint N/A
cont_address int Replaces rev_text_id.
cont_len int Moved; no other change.
cont_sha1 varbinary(32) Moved; no other change.
cont_model -> cm_id smallint Link to content_model
cont_format -> cf_id smallint Link to content_format
comment (1 per revision)
Column Type Change
comment_id bigint N/A
comment_text mediumblob varbinary(767) -> mediumblob (for phab:T6714)
actor (<<1 per revision)
Column Type Change
actor_id bigint N/A
actor_user -> user_id int Link to user; moved from rev_user
actor_text varchar(255) Moved from rev_user_text; no other change.
content_model (few; ~10)
Column Type Change
cm_id smallint N/A
cm_model varbinary(32) Moved; no other change.
content_format (few; ~5)
Column Type Change
cf_id smallint N/A
cf_format varbinary(64) Moved; no other change.
content_role (few; ~5)
Column Type Change
cr_id smallint N/A
cr_role varbinary(32) New concept for MCR.

TODO: consider further changes to archive table (full revdel?)

TODO: maybe switch text table around and change the text_id ref to a URL? (In progress of thought)

TODO: plan updates to other tables that have the user_text pattern.

Thoughts

edit
That seems like a lot of tables!
Most of them are the small tables for inlining strings -- content models, content formats, content slot roles for MCR, and user refs/IP addresses for actor. These should save a fair chunk of duplicated space. Additionally the MCR split between revision & content makes each of the two tables smaller and more malleable.
What happened to rev_text_id?
content.cont_address replaces it.
It may be an open question whether we want to make that change immediately, or whether to change the 'text' table as well, etc.
Why isn't rev_deleted moved to content?
rev_deleted is a bitfield and most of its options apply to things that aren't part of a Content object, such as the edit comment and the username. If separately "rev-deleting" just one content item is needed, a second bitfield or flag will need to be added on the content table too...
What about rev_len, rev_sha1 -- do they belong in content?
Not sure about this. Do we need to keep the fields for summing from multiple content objects?
How hard will it be to change queries?
Those that WHERE on rev_user/rev_user_text directly, or read fields directly, etc will need to be updated. :(
Things that just use Revision::*FromConds() and the accessor functions will be able to fall back to lazy loading without
Stuff that touches rev_text_id directly will need changing.
Stuff that wants to pre-join a bunch of data may need changing. May be able to add abstractions on Revision function to hide some of that, or build new abstractions that are less freaky.
Stuff that inserts manually will need updating; stuff that uses Revision accessors should remain safe, but should update to MCR-specific interfaces in future
What would a transition look like? What kind of background processing and what kind of downtime to expect?
See #Transition -- we'll need a transitionary mode in MediaWiki where a background process runs filling out the new tables. This may take some time -- several weeks sounds likely for the biggest sites. This may increase load on servers and will require additional disk space.
Most likely we will first add the new fields to revision, allowing them to be filled out without disturbing ongoing actions, and then remove the now redundant fields afterwards.
In principle, once the background process is complete, it should be possible to switch a wiki to read-only, flip its mode, and then switch back to read-write with little downtime for editors.
This could also allow quickly aborting/reverting to the previous state of revisions if things look bad in read-only... but if a rollback after going read-write is desired, that's a lot freakier to deal with.
What about manual queries (tool labs, etc) that might not want to update?
In principle a view can be created that replicates the classic view of the revision table... except for that text_id maybe. #Compatibility view
Space and performance concerns about indexes; possibility of "sharding" different sub-tables with specific indexes?
Could consider extra tables to support usage patterns, as long as they all fit on standard servers. #Denormalization
How will the archive table be transitioned?
In current model, archive may drop some legacy fields from ancient MediaWiki versions, which may require upgrade of existing rows in these circumstances: separation of inline text storage to 'text' table, setup of 'content' rows, and creation of an ar_rev_id for those that lack it (to be used in content slot role assignment).
It's possible we could make a bigger change where deleted rows stay in the 'revision' table (full 'revision deletion') but this is not yet decided.

Transition plan

edit

Clean up archive table (it's conceivable this has already been done, but expect surprises)

  • background process finishes any outstanding migrations on the archive table for deleted revs
    • migrate any inline text storage to text table
    • assign ar_rev_id if none present

Intermediate schema

  • intermediate schema will have all the old fields *and* all the new fields
  • apply intermediate schema on DB replicas while out of rotation, until all servers ready

Transition mode

  • keep reading from the old fields, but write to the new fields too when updating/inserting
  • set up the content model, content format, slot role entries.
  • background process churns through existing rows:
    • break out comment, actor entries
    • break out content entries and create slot rows assigning them to the revs

Provisional switchover

  • (?) activate full usage of the new fields, but write to the old fields to when updating/inserting
  • (?) gives us a chance to test without losing data

Full switchover

  • activate full usage of new fields, ignore use of the old fields

Final schema

  • drop the old fields on DB replicas out of rotation, until all servers applied.


Compatibility view

edit

Something like this could work for providing a back-compatible view of revision:

Main pain points are rev_text_id, if we change it to a full address in content it's harder to just use a reference. And whether to explicitly pull a main content slot or what...

create view revision_old
as select
rev_id,
rev_page,
0 as rev_text_id, -- XXXX no exact match?
comment_text as rev_comment,
actor_user as rev_user,
actor_text as rev_user_text,
rev_timestamp,
rev_minor_edit,
rev_deleted,
cont_len as rev_len,
rev_parent_id,
cont_sha1 as rev_sha1,
cm_name as rev_content_model,
cf_name as rev_content_format,
FROM revision
left join slots on slot_revision=rev_id and slot_role=1 -- ????
left join content on content_id=slot_content
left join content_model on cm_id=cont_model
left join content_format on cf_id=cont_format
left join actor on actor_id=rev_user_entry
left join comment on comment_id=rev_comment_id;

Denormalization

edit

Part of the space usage of the revision table is the several indexes on it:

CREATE INDEX /*i*/rev_page_id ON /*_*/revision (rev_page, rev_id); -- used for direct current-ver lookups
CREATE INDEX /*i*/rev_timestamp ON /*_*/revision (rev_timestamp); -- is this used directly? looks bogus
CREATE INDEX /*i*/page_timestamp ON /*_*/revision (rev_page,rev_timestamp); -- used for history, editing, lots of stuff
CREATE INDEX /*i*/user_timestamp ON /*_*/revision (rev_user,rev_timestamp); -- used for contribs
CREATE INDEX /*i*/usertext_timestamp ON /*_*/revision (rev_user_text,rev_timestamp); -- used for contribs
CREATE INDEX /*i*/page_user_timestamp ON /*_*/revision (rev_page,rev_user,rev_timestamp); -- what is this for? seems weird

(Question: are there any extra indexes added in production or tools that might be needed too?)

There might or might not be benefit to creating denormalized summary tables containing only a few keys rows and then the indexes needed, like this:

--
-- Summary table to isolate the usage of the indexes for Special:contributions
-- and ApiQueryUserContributions.
--
CREATE TABLE /*_*/contribs (
  -- key to rev_id
  contribs_revision bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Key to user_entry.ue_id of the user who made this edit.
  -- ue_user or ue_user_text will have the actual id or IP address.
  contribs_user_entry bigint unsigned NOT NULL default 0,

  -- The timestamp
  rev_timestamp binary(14) NOT NULL default '',
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/user_entry_timestamp ON /*_*/revision (rev_user_entry,rev_timestamp);

This can then be joined to the main tables to get the rest of the information. Does this make sense? We're not sure... yet :)

Questions and comments

edit

Use the talk page please