User:Anomie/Abstract schema

Tracked in Phabricator
Task T191231

This page holds some notes on the idea of abstracting MediaWiki's schema. The goal is to eventually make this into an RFC, but first a lot of things need to be written down.

Problem

MediaWiki claims to support five databases: MySQL/MariaDB, SQLite, PostgreSQL ("PG"), Microsoft SQL Server ("MSSQL"), and Oracle. For normal runtime-type queries, we have abstractions that make these all mostly work pretty well.

But at the DDL level it's a completely different story. One major piece of (and source of) technical debt is the fact that MediaWiki does not have a database schema, it has four. And most schema changes have to be written five times, one for each supported database. In practice, this means schema changes for the less-supported databases are often omitted, or when not omitted are often merged without being tested.

MySQL/MariaDB always works, since it's the one used on Wikimedia wikis and most third-party installations.
SQLite is fairly well supported, since the WMF CI infrastructure uses it for some tests. And its schema changes are often extremely annoying to write, since many have to be done as "create the new version of the table, copy all data over, drop the old, and rename".
PostgreSQL (PG) mostly works, and is tested in WMF CI. There are a few WMF developers who can easily run tests (and will do so if asked), and it's included in Linux distributions.
MSSQL and Oracle probably don't work all that well. No WMF developers both use either one and care enough to have made themselves known, and the few volunteer developers who're interested can't devote a lot of time to supporting them. Licensing prevents us from adding them to CI, even if someone were willing to do the work to make it technically possible.

Requirements

We can easily improve the situation by abstracting the schema and schema change definitions, with code per database to translate that into the actual DDL statements. To do this, we need:

An abstract data structure for schemas.
An abstract data structure for schema changes.
APIs to deal with these data structures. Functions needed include:
- Turn a schema structure into something resembling tables.sql, for human review.
- Populate an empty database with tables corresponding to a schema structure (i.e. the installer).
- Compare a schema structure with an existing database and report differences.
- Turn a schema change structure into a set of SQL commands, for human review and WMF DBAs.
- Alter an existing database according to the instructions in a schema change structure (i.e. update.php).
- Given a schema structure and a schema change structure, output a new schema structure that is the result of applying those changes.
Implementations of those APIs for supported databases.
Integration with MediaWiki, including translation of the existing tables.sql and schema change files.

Requirements at the database level are in /DB Requirements.

Abstract data structures

The sections below are intentionally specifying a data structure rather than a file format. We can use any generic file format as long as it can represent the structure correctly.

The features we need from a file format are:

Arrays, i.e. an ordered list of values.
Unordered key-value maps. Keys are always strings.
Ordered key-value maps (although we could work around that by making "columns" an array of unordered maps and move the column name inside the map)
Basic data types: strings, integers, booleans, null. Maybe floating point numbers too.

Using JSON as the format on disk would seem to be the current standard, although T212460 suggests we should use PHP array format.

For simplicity, let's define the canonical on-disk ordering of an unordered map as sorting it by key.

Schema format

The root of the data structure is an unordered map:

"tables": map, probably unordered:
- Each key is the table name. Values are unordered maps ("table definition"):
  - "comment": (string, optional) A description of the purpose of the table. This may be stored into the DB if the DB supports it.
  - "columns": An ordered map (but see below):
    - Each key is the column name. Values are unordered maps ("column definition"):
      - "comment": (string, optional) A description of the purpose of the column. This may be stored into the DB if the DB supports it.
      - "type": (string) Type of the column, see /DB Requirements#Data types.
      - "nullable": (boolean, default false) Whether the column is declared NOT NULL (false) or NULL (true). Some types don't support this flag.
      - "default": (type depends on "type", optional) If present, the DEFAULT for the column. If omitted and the column is nullable, the default should be NULL (most if not all of our DBs do this automatically anyway). If omitted and the column isn't nullable, then the column has no default and must be specified on every INSERT.
      - "references": (optional) Defines a foreign key constraint for this column. See /DB Requirements#Foreign keys for semantics. An unordered map:
        "table": (string) The table being referenced.
        
        "column": (string) The column being referenced.
        
        "on update": (string, default "restrict") The action to take if the referenced column is updated. See /DB Requirements#Foreign keys.
        
        "on delete": (string, default "restrict") The action to take if the referenced row is deleted. See /DB Requirements#Foreign keys.
      - Other types may have additional flags. See the type documentation for details.
  - "pk": (array of 1+ strings, optional) The names of the columns making up the primary key.
  - "indexes": (optional) An unordered map. See /DB Requirements#Indexes for semantics.
    - Keys are the index names. Values are unordered maps ("index definition"):
      - "comment": (string, optional) A description of the purpose of the index. This may be stored into the DB if the DB supports it.
      - "columns": (array of 1+ strings or unordered maps) The columns making up the index. If a string, the string is the column name. If an unordered map, it has the following keys:
        "column": (string) The column name.
        
        "length": (int, optional) The prefix length for text-type columns. See /DB Requirements#Column-prefix indexes for details.
        
        "direction": (string, default 'ASC') One of the constants 'ASC' or 'DESC'.
      - "unique": (boolean, default false) If true, the index also defines a unique constraint. See /DB Requirements#Unique indexes for semantics.
  - "initRows": (optional) A valid value for the $a parameter to IDatabase::insert(). When the table is first created, these rows are inserted.

Two schema data structures can be merged. At the root level, values for each key are merged independently:

"tables": If both schema files define the same key, an error is raised.

Implementation notes:

Yes, a schema is just a bunch of tables. We could add views later, if we decide to start using views in the schema. Other things seem unlikely to be sanely abstracted.
Someday the "pk" should be non-optional. But first we'd need to fix MediaWiki's schema.
The "installer" will have to make at least two passes: one to create all the tables and one to create any foreign keys after all the tables are created.
While the columns are defined as an ordered map, MediaWiki must not actually depend on the order since some DBs don't support sane schema changes for addition of columns in the middle of the table.
This proposal has yet to specify how to deal with "fulltext" indexes. See /DB Requirements#Fulltext indexes.

Schema change format

The root of the data structure is an array of unordered maps defining individual changes:

"name": (string) A short human-readable name, as might be displayed by update.php
"comment": (string, optional) A longer description of the change.
"order": (string) A unique key for this update that also defines its order. Sorting uses PHP's version_compare function. The format string would look like "1.31.0-wmf.12+core+123" or "1.31.0-wmf.12+extension-MyCoolExtension+123", where "1.31.0-wmf.12" is the MediaWiki version number, the next field is a constant identifying the source, and the final field is a serial number in case there are multiple schema changes within one MW version.
"check": (array, optional) Precondition check to run. If the check returns false, the individual schema change is skipped. The array consists of an operation followed by 0 or more parameters to the operation. Operations are:
- "AND": Parameters are "check" arrays. They are evaluated in order. If any of those checks returns false, the "AND" immediately returns false (without evaluating the rest). Returns true if all pass.
- "OR": Parameters are "check" arrays. They are evaluated in order. If any of those checks returns true, the "OR" immediately returns true (without evaluating the rest). Returns false if none pass.
- "NOT": There must be exactly one parameter, which is a "check" array. Evaluates that check and returns the logical inverse.
- Any boolean-returning function defined in the #Introspector interface.
"updates": (array of arrays) The actual updates to perform. Each update-array consists of a function name on the #SchemaOperations interface followed by any parameters that function requires. If any update fails, the rest are not run.

Two schema change arrays can be merged by concatenating them. Note "order" values must remain unique after combining.

Implementation notes:

Ideally we'd say to wrap each change in a transaction so a failure in the middle of running the "updates" can be cleanly rolled back, but MySQL for one can't actually do that.

APIs

We already have IDatabase. This proposes adjusting that slightly and adding two more interfaces, Introspector and SchemaOperations, and a class SchemaUpdater that is not subclassed per database.

Changes to IDatabase proposed here are relatively minor:

Add a method to return an Introspector attached to the IDatabase.
Add a method to return a SchemaOperations attached to the IDatabase.
Deprecate the following methods in favor of corresponding methods on Introspector: fieldExists, indexExists, tableExists
Deprecate the following methods in favor of corresponding methods on SchemaOperations: setSchemaVars
Remove encodeBlob and decodeBlob, or somehow make them mandatory for MySQL to work right. Experience teaches us that anything that's required for some non-MySQL database but works in MySQL is likely to be forgotten.

As for implementations of IDatabase,

Many public methods on Database that aren't in IDatabase should be deprecated, and eventually either be made protected/private or removed entirely.
We should remove database-specific table renaming, e.g. PostgreSQL renaming "user" to "mwuser". Quoted identifiers work in every database.
The "IMaintainableDatabase" interface looks like it should mostly be split into Introspector and SchemaOperations.

Introspector

This interface has methods for introspecting the schema of an existing database.

I'm not going to try to anticipate all the methods it might need, but off the top of my head the possibilities include:

bool tableExists( string $table )
bool columnExists( string $table, string $column )
bool columnIsType( string $table, string $field, string $type )
bool columnIsNullable( string $table, string $field )
bool columnHasDefault( string $table, string $field, mixed $value )
bool indexExists( string $table, string $index )
bool indexIsUnique( string $table, string $index )
string[] listTables()
string[] listColumns( string $table )
string[] listIndexes( string $table )
array getTableDefinition( string $table )
array getColumnDefinition( string $table, string $column )
array getIndexDefinition( string $table, string $index )

Implementation notes:

We can very likely have a base class that implements most of the boolean methods in terms of the get-definition methods.
Should columnExists() and indexExists() return false if the table doesn't exist, or throw exceptions? I lean towards the former.
We'll also want a version that introspects a schema structure directly with no backing database. This will be used with the similar no-DB SchemaOperations to handle the "Given a schema structure and a schema change structure, output a new schema structure that is the result of applying those changes" use case.

SchemaOperations

This interface has methods for making changes to a schema.

I'm not going to try to anticipate all the methods it might need, but off the top of my head the possibilities include:

bool createTable( string $table, array $tableDefinition )
bool dropTable( string $table )
bool setTableComment( string $table, string|null $comment )
bool addColumn( string $table, string $column, array $columnDefinition )
bool dropColumn( string $table, string $column )
bool renameColumn( string $table, string $oldname, string $newname )
bool alterColumn( string $table, string $column, array $columnDefinition )
bool setColumnDefault( string $table, string $column, mixed $default )
bool removeColumnDefault( string $table, string $column )
bool setColumnNullable( string $table, string $column, bool $nullable )
bool setColumnComment( string $table, string $column, string|null $comment )
bool addIndex( string $table, string $index, array $indexDefinition )
bool dropIndex( string $table, string $index )
bool setIndexComment( string $table, string $index, string|null $comment )
bool runMaintenanceScript( string $className )

Implementation notes:

Can we just use setColumnDefault( $table, $column, null ) instead of having removeColumnDefault()? Or is there benefit in explicitly declaring DEFAULT NULL in some database?
Should it be an error to drop a table/column/index that doesn't exist, or just return true?
Should it be an error to create a table or add a column or index that already exists with the same definition, or just return true?
We'll also want a version that updates a schema structure directly rather than using a backing database. This will be used with the similar no-DB Introspector to handle the "Given a schema structure and a schema change structure, output a new schema structure that is the result of applying those changes" use case.
I'm hoping we can completely leave out any equivalent to the existing sourceFile() and sourceStream() methods.
Low-level: All interaction with IDatabase should go via calls to IDatabase::query(), and we'll need a way to override that for the "output an SQL file" use cases.

SchemaUpdater

This is the class that turns a schema structure or schema update structure into calls to methods on Introspector and SchemaOperations.

Implementation notes:

I'm hoping we can completely leave out any equivalent to the existing sourceFile() and sourceStream() methods.

MediaWiki integration

Implementation notes:

For CI we currently have an old version of tables.sql, which we load into SQLite and then apply all the schema changes to to see if it result in an equivalent database. I propose we do the same thing with the abstract schema, and possibly that we run that against the testing database as well.
- We might even save the abstract schema at every release and check upgrades from all of them.
We might make a maintenance script like maintenance/generateLocalAutoload.php
For MySQL, we'll need to make sure that the code will work with both the old tables.sql schema and the new abstract schema, and we'll want to hold needed schema changes to make an existing database match the abstract schema to a minimum. For other databases that restriction isn't needed.
- We'll probably want to make use of foreign keys optional, and disable that on Wikimedia wikis. At least at first.
We'll likely (especially for non-MySQL) need to make a special schema change that does nothing to an installation using the abstract schema but will update an old tables.sql installation to match the abstract schema.

Prior art

Requests for comment/Abstract table definitions - declined due to implementation concerns with a custom DSL
abstract-schema SVN branch - was backed out of the installer rewrite to avoid delaying that project