User:Brooke Vibber/Compacting the revision table
provisional work in progress
The revision
table in the MediaWiki database is a bottleneck for performance and storage on large sites. It's frequently used along several different access patterns, has a bunch of keys, and some variable-sized data, and is traditionally seen as difficult to shard or partition.
TODO add details of table size in production here
Row size
editOn-disk storage and in-memory buffering needs, and the time/performance expense of alterations to the table, are increased by tables having 'wide' rows. By reducing the size of the rows, even a very 'tall' table with many rows will be more compact and should reduce strain on database servers.
Columns that are not needed, or can be stored more efficiently in different ways, are candidates for improvement.
Things that are particularly big now:
rev_comment
- Sometimes very large, often moderate with lots of duplication (section titles or common shortcuts) or empty
- Never used as a lookup key.
rev_user_text
- Lots of duplication and it's often literally a copy of user.user_name.
- Used as a key for Contributions lookups, adding to index size.
- QUESTION: are prefix searches ever done on this for IP ranges? If so they'd be indexed awkwardly.
rev_content_model
,rev_content_type
- Lots of duplication of fixed strings, not used as a lookup key.
- There is already a plan in place to deduplicate these into indirected integers.
- TODO add reference
String indirection
editOne strategy for compacting rows is indirecting potentially large string fields to a separate table, indexed by a more compact integer.
This can provide de-duplication of common strings, reducing total storage and buffering requirements, and the join can sometimes be skipped if the indirected field is not needed.
Compatibility concerns
editBecause existing fields will be dropped, existing code that touches the revision
table directly will need to be updated to include joins. Tools operating on replicas will also need to be updated, potentially a large burden for community tools provided by volunteers.
To aid in conversion, a view could be created that looks like the old revision table layout, pulling from the new compacter revision table and the indirected strings.
TODO do a quick audit and see if central places will take care of most of this
TODO performance conerns about views? Would it perform well enough to just keep existing code as-is?
Proposal
editSchema
editCreate revmeta
table with all the fields from revision
except these turned into string indirections:
rev_comment_id
->comment.c_id
,comment.c_text
rev_user_text_id
->usertext.ut_id
,usertext.ut_text
rev_content_model_id
->contentmodel.cm_id
,contentmodel.cm_text
rev_content_type_id
->contenttype.ct_id
,contentmodel.ct_text
Separate tables for each type are proposed for locality, though all string indirections could use the same table in theory.
De-duplicating the revision comments will require being able to look up matching comments and a large index may be counterproductive; consider adding a hash field or using a limited-length index.
TODO: best practices for string indirections?
Transition
editBecause existing fields will be dropped, this can't be done with just an online ALTER TABLE and a master switch.
A possible "mostly-online" deployment with a brief switchover period:
- Add code to MediaWiki core with support for old
revision
and newrevmeta
layouts, with a config switch.revmeta
table will be the new compact versionusertext
andrevcomment
(?) tables to hold string indirection
- Run in a transitionary mode where we primarily use the classic
revision
table:- Run a background process that crawls classic
revision
and upserts to the new tables - Have code that inserts/updates into
revision
also upsert to the new tables
- Run a background process that crawls classic
- Once conversion is complete, switch to read-only for maintnenance:
- rename
revision
out of the way for later archival/removal - create
revision
compatibility view onto the new tables
- rename
- Switch to using new table layout and disable read-only.
- Indirection tables could be pre-seeded "online" while running on the old schema, then when a
Major deployment concerns
edit- requires room for the old tables and the new tables during transition
- increases write activity on live databases
- the background process can be throttled as necessary, but live upserts will be doubled.
Row count
editAlternately, or in combination, we could attack the large row count by partitioning/sharding the revision table.
Basic theory
editMany rows of revision data will be relatively rarely used, especially older data, but our schema lumps the entire 15-year history of Wikipedia into one data set that must sit on the same disk and share the same caching.
Possible alternative: multiple tables with the same layout and same rev_id space but separate storage/buffering. These separate tables can be queried in the same way and results combined as if they're a single table in a single server.
Partitioning axes
edit- 'hot' vs 'cold' by time/id/current-ness
- along namespaces, types, page update frequency, or other boundaries
Compatibility
editIf the partitioned tables can live in the same database server, a union view can probably be provided that's compatible with existing revision
table for many purposes, however this may or may not be efficient to pull from, and may not provide the necessary performance or storage space characteristics desired from partitioning.
If the separate tables live on separate servers, life is harder: code must fetch from multiple databases (serial? parallel?) and possibly combine/interleave results. Potentially a lot of code updates required.
TODO check MySQL proxying facilities to see if we can present the multiple tables as a single table for queries
Proposal
editNone yet on this end.