Core Platform Team/Initiative/Shard Revision Table/Initiative Description

< Shard Revision Table

Summary

The ability to do multiple MySQL read queries in parallel, hitting multiple shards at the same time, would allow sharding of the revision table to be done without duplication of data, and without any client latency impact. However, this comes at the expense of duplicating the read query rate. A read query that includes an index can be done without duplicating read query traffic if shards are split by the same index. However, this adds complexity in the application.

Joins on small tables could be supported by replicating those tables into all shards. Joins on large tables, such as a join of the revision table with itself, would have to be emulated in MediaWiki.

The minimum viable product is to distribute rows into shards randomly and to duplicate all read queries asynchronously. If approved, the project plan would then be:

  • Evaluate existing asynchronous/parallel MySQL clients in PHP. Choose one.
  • Provide an interface within MediaWiki with the following capabilities:
    • Distributing SELECT queries to all shards, reassembling and sorting results.
    • Distributing DELETE and UPDATE queries to all shards.
    • Sending INSERT queries to a single shard based on some simple policy, e.g. fixed or random.
  • Remove problematic joins from MediaWiki.
  • Provide a maintenance script for moving data between shards.
  • Work with DBAs to deploy the system to the Wikimedia cluster.
Significance and Motivation

Splitting up storage over multiple servers will allow faster schema changes and possibly cheaper hardware. In some cases, this can increase our capacity. This will enable us to handle more applications like JADE.

Outcomes

Increase the scalability of the platform for future applications and new types of content, as well as a growing user base and amount of content

  1. Schema change speed. This process directly slows down any feature that requires database changes. The amount of improvement we get is dependent on the number of shards we introduce.
  2. Performance. We should have parity with current MYSQL access. We should have no performance degradation.
Baseline Metrics

None given

Target Metrics

None given

Stakeholders
  • Extension developers
  • Scoring
  • WMDE (Wikibase)
Known Dependencies/Blockers

Actor and Comment Table Schema migrations

MCR (Slot and Content Table) Schema migrations