Core Platform Team/Initiatives/Shard Revision Table

Initiative Description

< Initiatives

Summary

The ability to do multiple MySQL read queries in parallel, hitting multiple shards at the same time, would allow sharding of the revision table to be done without duplication of data, and without any client latency impact. However, this comes at the expense of duplicating the read query rate. A read query that includes an index can be done without duplicating read query traffic if shards are split by the same index. However, this adds complexity in the application.

Joins on small tables could be supported by replicating those tables into all shards. Joins on large tables, such as a join of the revision table with itself, would have to be emulated in MediaWiki.

The minimum viable product is to distribute rows into shards randomly and to duplicate all read queries asynchronously. If approved, the project plan would then be:

  • Evaluate existing asynchronous/parallel MySQL clients in PHP. Choose one.
  • Provide an interface within MediaWiki with the following capabilities:
    • Distributing SELECT queries to all shards, reassembling and sorting results.
    • Distributing DELETE and UPDATE queries to all shards.
    • Sending INSERT queries to a single shard based on some simple policy, e.g. fixed or random.
  • Remove problematic joins from MediaWiki.
  • Provide a maintenance script for moving data between shards.
  • Work with DBAs to deploy the system to the Wikimedia cluster.
Significance and Motivation

Splitting up storage over multiple servers will allow faster schema changes and possibly cheaper hardware. In some cases, this can increase our capacity. This will enable us to handle more applications like JADE.

Outcomes

Increase the scalability of the platform for future applications and new types of content, as well as a growing user base and amount of content

  1. Schema change speed. This process directly slows down any feature that requires database changes. The amount of improvement we get is dependent on the number of shards we introduce.
  2. Performance. We should have parity with current MYSQL access. We should have no performance degradation.
Baseline Metrics

None given

Target Metrics

None given

Stakeholders
  • Extension developers
  • Scoring
  • WMDE (Wikibase)
Known Dependencies/Blockers

Actor and Comment Table Schema migrations

MCR (Slot and Content Table) Schema migrations

Epics, User Stories, and Requirements

< Initiatives

  • Survey all the code that uses the revision table in order to get the full scope of the work
  • Investigate options for asynchronous/parallel MYSQL queries
  • Write RFC - Covering the whole project. This includes:
    • Possible Implementation options
      • abstract DB layer
      • have business logic be aware of multiple databases
    • Migration path
  • Depending on the implementation, potentially reach out to extension developers about changes
  • Implement
  • Migrate

Time and Resource Estimates

< Initiatives

Estimated Start Date

April 2019

Actual Start Date

None given

Estimated Completion Date

None given

Actual Completion Date

None given

Resource Estimates

TBD - we will know after analysis in the first step of the project. Should be comparable to the Actor/Comment table project

Collaborators
  • SRE

Open Questions

< Initiatives

  • Do we do parallel or serial MYSQL queries? PHP doesn’t support parallel MYSQL queries and so we will need to use a library.
  • How do you split up the table, over which indexes?
  • How much of as speed improvement do we want for schema changes? This will determine the number of shards.
  • How do we determine measure our metrics?

Subpages