User:Brooke Vibber/Compacting the revision table

provisional work in progress

The revision table in the MediaWiki database is a bottleneck for performance and storage on large sites. It's frequently used along several different access patterns, has a bunch of keys, and some variable-sized data, and is traditionally seen as difficult to shard or partition.

TODO add details of table size in production here

Row size

On-disk storage and in-memory buffering needs, and the time/performance expense of alterations to the table, are increased by tables having 'wide' rows. By reducing the size of the rows, even a very 'tall' table with many rows will be more compact and should reduce strain on database servers.

Columns that are not needed, or can be stored more efficiently in different ways, are candidates for improvement.

Things that are particularly big now:

rev_comment
- Sometimes very large, often moderate with lots of duplication (section titles or common shortcuts) or empty
- Never used as a lookup key.
rev_user_text
- Lots of duplication and it's often literally a copy of user.user_name.
- Used as a key for Contributions lookups, adding to index size.
- QUESTION: are prefix searches ever done on this for IP ranges? If so they'd be indexed awkwardly.
rev_content_model, rev_content_type
- Lots of duplication of fixed strings, not used as a lookup key.
- There is already a plan in place to deduplicate these into indirected integers.
  - TODO add reference

String indirection

One strategy for compacting rows is indirecting potentially large string fields to a separate table, indexed by a more compact integer.

This can provide de-duplication of common strings, reducing total storage and buffering requirements, and the join can sometimes be skipped if the indirected field is not needed.

Compatibility concerns

Because existing fields will be dropped, existing code that touches the revision table directly will need to be updated to include joins. Tools operating on replicas will also need to be updated, potentially a large burden for community tools provided by volunteers.

To aid in conversion, a view could be created that looks like the old revision table layout, pulling from the new compacter revision table and the indirected strings.

TODO do a quick audit and see if central places will take care of most of this

TODO performance conerns about views? Would it perform well enough to just keep existing code as-is?

Proposal

Schema

Create revmeta table with all the fields from revision except these turned into string indirections:

rev_comment_id -> comment.c_id, comment.c_text
rev_user_text_id -> usertext.ut_id, usertext.ut_text
rev_content_model_id -> contentmodel.cm_id, contentmodel.cm_text
rev_content_type_id -> contenttype.ct_id, contentmodel.ct_text

Separate tables for each type are proposed for locality, though all string indirections could use the same table in theory.

De-duplicating the revision comments will require being able to look up matching comments and a large index may be counterproductive; consider adding a hash field or using a limited-length index.

TODO: best practices for string indirections?

Transition

Because existing fields will be dropped, this can't be done with just an online ALTER TABLE and a master switch.

A possible "mostly-online" deployment with a brief switchover period:

Add code to MediaWiki core with support for old revision and new revmeta layouts, with a config switch.
- revmeta table will be the new compact version
- usertext and revcomment (?) tables to hold string indirection
Run in a transitionary mode where we primarily use the classic revision table:
- Run a background process that crawls classic revision and upserts to the new tables
- Have code that inserts/updates into revision also upsert to the new tables
Once conversion is complete, switch to read-only for maintnenance:
- rename revision out of the way for later archival/removal
- create revision compatibility view onto the new tables
Switch to using new table layout and disable read-only.
Indirection tables could be pre-seeded "online" while running on the old schema, then when a

Major deployment concerns

requires room for the old tables and the new tables during transition
increases write activity on live databases
- the background process can be throttled as necessary, but live upserts will be doubled.

Row count

Alternately, or in combination, we could attack the large row count by partitioning/sharding the revision table.

Basic theory

Many rows of revision data will be relatively rarely used, especially older data, but our schema lumps the entire 15-year history of Wikipedia into one data set that must sit on the same disk and share the same caching.

Possible alternative: multiple tables with the same layout and same rev_id space but separate storage/buffering. These separate tables can be queried in the same way and results combined as if they're a single table in a single server.

Partitioning axes

'hot' vs 'cold' by time/id/current-ness
along namespaces, types, page update frequency, or other boundaries

Compatibility

If the partitioned tables can live in the same database server, a union view can probably be provided that's compatible with existing revision table for many purposes, however this may or may not be efficient to pull from, and may not provide the necessary performance or storage space characteristics desired from partitioning.

If the separate tables live on separate servers, life is harder: code must fetch from multiple databases (serial? parallel?) and possibly combine/interleave results. Potentially a lot of code updates required.

TODO check MySQL proxying facilities to see if we can present the multiple tables as a single table for queries

Proposal

None yet on this end.