User:Brooke Vibber/Compacting the revision table round 2

Overview

Per ongoing discussion in ArchCom and at WikiDev17 about performance, future requirements, and future-proofing for table size it's proposed to do a major overhaul of the revision table, combining the following improvements:

Normalization of frequently duplicated data to separate tables, reducing the dupe strings to integer keys
Separation of content-specific from general-revision metadata to support:
- Multi-content revisions allowing for storing of multiple content blobs per revision -- not related to compaction, but of great interest for structured data additions planned for multimedia and articles
general reduction in revision table width / on-disk size will make schema changes easier in future
trying to avoid inconsistencies in live index deployments
- ideally all indexes should fit on all servers, making it easier to switch database backend around in production

cf other & older notes:

Current

revision
Column	Type
rev_id	`int`
rev_page -> page_id	`int`
rev_text_id -> old_id	`int`
rev_comment	`varbinary(767)`
rev_user -> user_id	`int`
rev_user_text	`varchar(255)`
rev_timestamp	`binary(14)`
rev_minor_edit	`tinyint`
rev_deleted	`tinyint`
rev_len	`int`
rev_parent_id -> rev_id	`int`
rev_sha1	`varbinary(32)`
rev_content_model	`varbinary(32)`
rev_content_format	`varbinary(64)`

archive
Column	Type
ar_id	`int`
ar_namespace	`int`
ar_title	`varchar(255)`
ar_text	`mediumblob`
ar_comment	`varbinary(255)`
ar_user	`int`
ar_user_text	`varbinary(255)`
ar_timestamp	`binary(14)`
ar_minor_edit	`tinyint`
ar_flags	`tinyblob`
ar_rev_id	`int`
ar_text_id	`int`
ar_deleted	`tinyint`
ar_len	`int`
ar_page_id	`int`
ar_parent_id	`int`
ar_sha1	`varbinary(32)`
ar_content_model	`varbinary(32)`
ar_content_format	`varbinary(64)`

Provisional

/tables.sql (needs updating)

revision
Column	Type	Change
rev_id	`bigint`	`int` -> `bigint`
rev_page -> page_id	`int`	—
rev_comment_id -> comment_id	`bigint`	Moved to own table.
rev_actor -> actor_id	`bigint`	Separated `user`/`user_text` to own table.
rev_timestamp	`binary(14)`	—
rev_minor_edit	`tinyint`	—
rev_deleted	`tinyint`	—
rev_parent_id -> rev_id	`bigint`	`int` -> `bigint`

archive
Column	Type	Change
ar_id	`bigint`	`int` -> `bigint`
ar_namespace	`int`	—
ar_title	`varchar(255)`	—
ar_page_id	`int`	—
ar_rev_id	`bigint`	`int` -> `bigint`
ar_comment_id -> comment_id	`bigint`	Moved to own table.
ar_actor -> actor_id	`bigint`	Separated `user`/`user_text` to own table.
ar_timestamp	`binary(14)`	—
ar_minor_edit	`tinyint`	—
ar_deleted	`tinyint`	—
ar_parent_id -> rev_id	`bigint`	`int` -> `bigint`

slots (>1 per revision)
Column	Type	Change
slot_revision -> rev_id/ar_rev_id	`bigint`	Link back to revision or archive
slot_content -> cont_id	`bigint`	Link to content
slot_role -> cr_id	`smallint`	Link to content_role

content (>1 per revision)
Column	Type	Change
cont_id	`bigint`	N/A
cont_address	`int`	Replaces rev_text_id.
cont_len	`int`	Moved; no other change.
cont_sha1	`varbinary(32)`	Moved; no other change.
cont_model -> cm_id	`smallint`	Link to content_model
cont_format -> cf_id	`smallint`	Link to content_format

comment (1 per revision)
Column	Type	Change
comment_id	`bigint`	N/A
comment_text	`mediumblob`	varbinary(767) -> mediumblob (for phab:T6714)

actor (<<1 per revision)
Column	Type	Change
actor_id	`bigint`	N/A
actor_user -> user_id	`int`	Link to user; moved from `rev_user`
actor_text	`varchar(255)`	Moved from `rev_user_text`; no other change.

content_model (few; ~10)
Column	Type	Change
cm_id	`smallint`	N/A
cm_model	`varbinary(32)`	Moved; no other change.

content_format (few; ~5)
Column	Type	Change
cf_id	`smallint`	N/A
cf_format	`varbinary(64)`	Moved; no other change.

content_role (few; ~5)
Column	Type	Change
cr_id	`smallint`	N/A
cr_role	`varbinary(32)`	New concept for MCR.

TODO: consider further changes to archive table (full revdel?)

TODO: maybe switch text table around and change the text_id ref to a URL? (In progress of thought)

TODO: plan updates to other tables that have the user_text pattern.

Thoughts

That seems like a lot of tables!: Most of them are the small tables for inlining strings -- content models, content formats, content slot roles for MCR, and user refs/IP addresses for actor. These should save a fair chunk of duplicated space. Additionally the MCR split between revision & content makes each of the two tables smaller and more malleable.
What happened to rev_text_id?: content.cont_address replaces it.; It may be an open question whether we want to make that change immediately, or whether to change the 'text' table as well, etc.
Why isn't rev_deleted moved to content?: rev_deleted is a bitfield and most of its options apply to things that aren't part of a Content object, such as the edit comment and the username. If separately "rev-deleting" just one content item is needed, a second bitfield or flag will need to be added on the content table too...
What about rev_len, rev_sha1 -- do they belong in content?: Not sure about this. Do we need to keep the fields for summing from multiple content objects?
How hard will it be to change queries?: Those that WHERE on rev_user/rev_user_text directly, or read fields directly, etc will need to be updated. :(; Things that just use Revision::*FromConds() and the accessor functions will be able to fall back to lazy loading without; Stuff that touches rev_text_id directly will need changing.; Stuff that wants to pre-join a bunch of data may need changing. May be able to add abstractions on Revision function to hide some of that, or build new abstractions that are less freaky.; Stuff that inserts manually will need updating; stuff that uses Revision accessors should remain safe, but should update to MCR-specific interfaces in future
What would a transition look like? What kind of background processing and what kind of downtime to expect?: See #Transition -- we'll need a transitionary mode in MediaWiki where a background process runs filling out the new tables. This may take some time -- several weeks sounds likely for the biggest sites. This may increase load on servers and will require additional disk space.; Most likely we will first add the new fields to revision, allowing them to be filled out without disturbing ongoing actions, and then remove the now redundant fields afterwards.; In principle, once the background process is complete, it should be possible to switch a wiki to read-only, flip its mode, and then switch back to read-write with little downtime for editors.; This could also allow quickly aborting/reverting to the previous state of revisions if things look bad in read-only... but if a rollback after going read-write is desired, that's a lot freakier to deal with.
What about manual queries (tool labs, etc) that might not want to update?: In principle a view can be created that replicates the classic view of the revision table... except for that text_id maybe. #Compatibility view
Space and performance concerns about indexes; possibility of "sharding" different sub-tables with specific indexes?: Could consider extra tables to support usage patterns, as long as they all fit on standard servers. #Denormalization
How will the archive table be transitioned?: In current model, archive may drop some legacy fields from ancient MediaWiki versions, which may require upgrade of existing rows in these circumstances: separation of inline text storage to 'text' table, setup of 'content' rows, and creation of an ar_rev_id for those that lack it (to be used in content slot role assignment).; It's possible we could make a bigger change where deleted rows stay in the 'revision' table (full 'revision deletion') but this is not yet decided.

Transition plan

Clean up archive table (it's conceivable this has already been done, but expect surprises)

background process finishes any outstanding migrations on the archive table for deleted revs
- migrate any inline text storage to text table
- assign ar_rev_id if none present

Intermediate schema

intermediate schema will have all the old fields *and* all the new fields
apply intermediate schema on DB replicas while out of rotation, until all servers ready

Transition mode

keep reading from the old fields, but write to the new fields too when updating/inserting
set up the content model, content format, slot role entries.
background process churns through existing rows:
- break out comment, actor entries
- break out content entries and create slot rows assigning them to the revs

Provisional switchover

(?) activate full usage of the new fields, but write to the old fields to when updating/inserting
(?) gives us a chance to test without losing data

Full switchover

activate full usage of new fields, ignore use of the old fields

Final schema

drop the old fields on DB replicas out of rotation, until all servers applied.

Compatibility view

Something like this could work for providing a back-compatible view of revision:

Main pain points are rev_text_id, if we change it to a full address in content it's harder to just use a reference. And whether to explicitly pull a main content slot or what...

create view revision_old
as select
rev_id,
rev_page,
0 as rev_text_id, -- XXXX no exact match?
comment_text as rev_comment,
actor_user as rev_user,
actor_text as rev_user_text,
rev_timestamp,
rev_minor_edit,
rev_deleted,
cont_len as rev_len,
rev_parent_id,
cont_sha1 as rev_sha1,
cm_name as rev_content_model,
cf_name as rev_content_format,
FROM revision
left join slots on slot_revision=rev_id and slot_role=1 -- ????
left join content on content_id=slot_content
left join content_model on cm_id=cont_model
left join content_format on cf_id=cont_format
left join actor on actor_id=rev_user_entry
left join comment on comment_id=rev_comment_id;

Denormalization

Part of the space usage of the revision table is the several indexes on it:

CREATE INDEX /*i*/rev_page_id ON /*_*/revision (rev_page, rev_id); -- used for direct current-ver lookups
CREATE INDEX /*i*/rev_timestamp ON /*_*/revision (rev_timestamp); -- is this used directly? looks bogus
CREATE INDEX /*i*/page_timestamp ON /*_*/revision (rev_page,rev_timestamp); -- used for history, editing, lots of stuff
CREATE INDEX /*i*/user_timestamp ON /*_*/revision (rev_user,rev_timestamp); -- used for contribs
CREATE INDEX /*i*/usertext_timestamp ON /*_*/revision (rev_user_text,rev_timestamp); -- used for contribs
CREATE INDEX /*i*/page_user_timestamp ON /*_*/revision (rev_page,rev_user,rev_timestamp); -- what is this for? seems weird

(Question: are there any extra indexes added in production or tools that might be needed too?)

There might or might not be benefit to creating denormalized summary tables containing only a few keys rows and then the indexes needed, like this:

--
-- Summary table to isolate the usage of the indexes for Special:contributions
-- and ApiQueryUserContributions.
--
CREATE TABLE /*_*/contribs (
  -- key to rev_id
  contribs_revision bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Key to user_entry.ue_id of the user who made this edit.
  -- ue_user or ue_user_text will have the actual id or IP address.
  contribs_user_entry bigint unsigned NOT NULL default 0,

  -- The timestamp
  rev_timestamp binary(14) NOT NULL default '',
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/user_entry_timestamp ON /*_*/revision (rev_user_entry,rev_timestamp);

This can then be joined to the main tables to get the rest of the information. Does this make sense? We're not sure... yet :)

Questions and comments

Use the talk page please