User talk:Anomie/Abstract schema

About this board

Legoktm (talkcontribs)

If you have a wiki in postgres, there's no good migration path to mysql. Could we introduce some abstract SQL dump format? Could also be used by Wikimedia as well (in addition to the mysql dumps).

It would be neat if the schema definition could indicate whether the data in the column is private or not, so that the abstract dump could just omit that information if --public or something was provided to the dump script (T140977). It could also (partially?) replace the logic that's maintained in maintain-views.yaml.

Probably a separate project in itself, but wanted to mention the idea to make sure any necessary groundwork wouldn't be missed.

Tgr (WMF) (talkcontribs)

The abstraction layer would have to be revdelete-aware to properly support --public.

Anomie (talkcontribs)

An "abstract dump format" seems outside the scope of this proposal. Although this proposal's implicit fixing of various schema inconsistencies might make it more feasible to use pg_dump --inserts or other existing tools to do such a migration.

Indicating public data seems far more complex than something sensibly maintained in the core schema, as is indicated by the complexity of the logic for some tables in the file you link.

MZMcBride (talkcontribs)

I think a private/public flag, or the ability to flag and annotate specific fields, would be useful functionality. Even if it were like a comment/docstring near the schema definition for a particular table column?

Reply to "Abstract SQL dumps?"
Skizzerz (talkcontribs)

What work still needs to be done on this before we can get this officially on the roster of RFCs to be discussed? I'd like to get this moving forwards sooner rather than later.

Anomie (talkcontribs)

If you're fine with it from a MSSQL perspective, I think we're good to put more details into the Phab task and submit it for scheduling. It doesn't seem like anyone is going to start discussing any of the other questions in the various "implementation notes" sections before the real RFC.

When I do that, I'll probably present this page as one alternative and trying to use Doctrine's migration stuff as another.

Skizzerz (talkcontribs)

Works for me! I don't particularly care which way we end up going (Doctrine would save effort in some areas and add additional effort in other areas), and either way I believe we'd need the schema representation / json thing so that we can be more friendly to the install/upgrade process (as legoktm pointed out, loading json files is a lot less fragile than partially loading extension php files).

Reply to "Moving Forwards"

Implementation format

8
Legoktm (talkcontribs)

Since I've been poking at the installer lately, I'd like to suggest that if it's possible, the format for the schema definition should be a non-executable format (not PHP, yes JSON/XML) so that the installer can read it without executing any code. Currently the installer autoloads all classes of an extension, and runs the schema change hook. So if that hook checks a configuration variable, it won't be set. There are plenty of other scenarios where this implementation method falls apart.

In my ideal world the installer could just read the schema out of a JSON file (or your favorite format), so there's no need to half-load extensions like we currently do. It doesn't need to be in extension.json, maybe a parallel dbschema.json that extension.json points to.

Skizzerz (talkcontribs)

I think the implementation would need to consist of two pieces:

  1. JSON schema (as described on the page) which core and extensions use to define tables.
  2. On the PHP side, we process the schema and then pawn it off to Doctrine's DBAL library to handle the actual SQL generation -- gives us less to maintain on the DDL side of things

Using Doctrine directly means that the schema files would need to be PHP instead of JSON, which runs into the issues you've mentioned in the installer and upgrader. As such, I think we still need an abstraction layer on top of Doctrine. If this sounds good, I can work that into the actual page and then we're probably at the point where we can kick off the actual RFC.

Anomie (talkcontribs)

I never planned to use Doctrine.

I haven't looked too closely at Doctrine, but any schema abstraction layer seems very likely to be closely tied to its own database abstraction layer. Unless we plan to replace MediaWiki's Database class with Doctrine (a far bigger project), I suspect trying to use it only for schema abstraction would have issues.

See for comparison some of the issues identified for cakephp's abstraction in Topic:Uamlkn3gywc8kdds.

Skizzerz (talkcontribs)

Doctrine's schema layer purports to be wholly disconnected from the rest of the ORM, so that it can be used alongside other things. I think it brings a lot to the table, mainly by allowing us to not need to create all of that migration logic ourselves. Furthermore, using Doctrine would allow for downgrades/rollbacks to previous versions, which if we rolled our own would likely not be supported at least initially simply because it'd be extra engineering effort for very little benefit.

To address the points you listed in the other topic:

  • This layer has functions that output the SQL to run, and it's up to you to take those SQL statements and run them through whatever db layer you use. This allows deploying that SQL using our existing Database classes for normal installs/upgrades as well as generating .sql files for deployment on larger clusters.
  • The varchar vs varbinary thing looks to be an issue. We'd have to build our own custom MySQL platform (extending the shipped one) to override the method that generates string fields to return varbinary instead of varchar (ditto for blob vs text on the text fields).
  • It has native support for binary/varbinary/blob fields but these are mapped to PHP resource streams rather than strings. This would be good for the cases where we're storing true binary data in a column as opposed to utf-8 text.
  • Translating connection configuration is straightforward given that we're taking it in as separate variables.

In the long term, I would like to replace MediaWiki's Database classes with something that is less MySQL-centric (the way queries are built by concatenating strings works great when you assume everything works the way it does in MySQL, but supporting other dbmses requires nasty fragile hacks in a lot of cases using regex replacement and whatnot). But, that is outside the scope of this RFC, and it is likely better served by upgrading the internals of our Database class to put things together in a more structured manner rather than putting together a SQL string piecemeal rather than replacing it wholesale with something else (keeps largely the same API for backwards compat, doesn't throw away all of our institutional knowledge of what is actually needed by MediaWiki, etc.)

Anomie (talkcontribs)

I suspect the "nasty fragile hacks in a lot of cases using regex replacement" are not actually required, but are an artifact of people in the distant past taking the easy way to try to get something that functions for an immediate use case with minimal work put into writing the code.

I'm a bit skeptical of claims about rolling back a schema change. How do you roll back the data loss when a table or column is dropped? ... Looking at their documentation, it seems it works by simply having you write both the schema change and the "undo" schema changes, separately, and handling lost data is up to the implementor. If you don't fill in the down() method, the "rollback" won't work.

The "PHP resource streams" thing doesn't seem particularly relevant. First, that seems to be part of their DB abstraction rather than their schema abstraction. Second, PHP strings are already inherently binary, so "resource streams" aren't really needed anyway.

Skizzerz (talkcontribs)

In mssql, the two main hacks were to support LIMIT/OFFSET and GROUP_CONCAT. I've reworked the LIMIT/OFFSET code to make use of a new feature in SQL Server 2012 that greatly reduces the hackiness (patch still pending and it still requires a bit of regex), but the GROUP_CONCAT hack will need to remain until MediaWiki stops relying on this MySQL-specific feature. The main issue with the original LIMIT/OFFSET was that the method is passed a string and is expected to return a string, rather than getting anything more structured. The method basically assumes that the limit/offset stuff goes at the end of the string, which is simply not true in that case; I needed to actually move clauses around and mess with subqueries whenever offsets are involved. There are other ways to implement offsets, but all of them require wrapping the query into a subquery or CTE and moving or copying the select list.

I don't see any particular need or requirement to be able to roll back changes, which is why I said that should we make our own, that feature would likely not be included.

Anomie (talkcontribs)

I don't think you can claim a concatenating aggregate function is a MySQL-specific feature, since every other database supports it too (although with no standardized syntax). BTW, it looks like MSSQL implemented it too in SQL Server 2017, as STRING_AGG.

For older versions, would it be possible to make a workalike with CREATE AGGREGATE? Or does that sort of thing always require external DLLs that we couldn't include with MediaWiki?

Skizzerz (talkcontribs)

Yes, that requires an external dll, which requires sysadmin (root) level access to mssql to enable the loading of external dlls as well as to run the create aggregate statement. In any case, this conversation is diverging a lot from the purpose of this talk page (discussing the RFC). If you want to talk more about our existing database abstraction layer and its shortcomings in terms of non-MySQL support, feel free to poke me on IRC :)

Reply to "Implementation format"

Existing solutions to avoid the "not invented here" mentality

8
JCrespo (WMF) (talkcontribs)
JCrespo (WMF) (talkcontribs)

Not necessarily to use as is, but maybe abstraction code can be copied, if it has a compatible license?

Anomie (talkcontribs)

That looks potentially interesting, but at a first look through the documentation I see some potential issues:

  • One thing I don't see in there is the ability to output SQL files for you to apply in a controlled manner on WMF sites. Or would you
  • I suspect it'll want to use varchar and text rather than varbinary and blob for "string" and "text" columns.
  • It doesn't have a generic "blob" column (it has one but mysql-only), which would seem to leave us no good option for storing compressed article text and such.
  • And we'd have to figure out how to auto-translate MediaWiki's connection configuration into its syntax, or else have duplicate configuration.

As for copying, it looks to use an MIT license which should be compatible.

JCrespo (WMF) (talkcontribs)

Phynx was just an example. I am afraid of doing everthing from 0, and not contribute or for to an existing project.

Skizzerz (talkcontribs)

The --dry-run param on it supposedly prints the SQL to stdout instead of executing, so there's that. I need to take a deeper dive to see how customizable this is for our uses; my personal preference is that if we make use of a third-party library that we use the library directly instead of porting all the code over (and therefore taking on the maintenance burden of it ourselves)

Tgr (WMF) (talkcontribs)

The most popular such library is probably Doctrine (DBAL, more specifically).

Skizzerz (talkcontribs)

Better link (leads directly to the bits talking about the schema abstraction). It appears that this is decoupled from the ORM, so it could be usable by us as well. Documentation is incomplete on their site, so we'd need to dig into the code in order to determine what does what.

It also looks like Doctrine builds on PDO behind-the-scenes. While not an insurmountable barrier, we don't use PDO in MediaWiki and would therefore have extra installation requirements if we were to make use of it.

Tgr (WMF) (talkcontribs)

They used to build on PDO, and the API tries to match the PDO API, but they have quite a few native drivers these days, including mysqli.

Reply to "Existing solutions to avoid the "not invented here" mentality"
😂 (talkcontribs)
Anomie (talkcontribs)

I've seen the latter in the past, but I didn't refer to it when writing this. I don't care much for having to write a parser for a custom flat text file format when we could just represent the data structure more directly in PHP or JSON.

I hadn't seen the former.

Reply to "Prior efforts here"

runMaintenanceScript

2
MaxSem (talkcontribs)

What's the use case for bool runMaintenanceScript( string $className ) in SchemaOperations? Why can't generic Maintenance::runChild() be used?

Anomie (talkcontribs)

Because every operation in the schema change data structure has to be a call of a method on SchemaOperations. And also because Maintenance::runChild() doesn't actually run the script, something still needs to call ->execute().

The actual implementation in a SchemaOperationsBase class could look something like

public function runMaintenanceScript( $className ) {
    $task = $this->maintenance->runChild( $className );
    return $task->execute();
}

Except I hope to put all the new code except the "MediaWiki integration" bits somewhere under includes/libs/rdbms, and MW's Maintenance class isn't (currently) a library that could be depended on by a librarized rdbms.

If nothing else I suppose update.php could inject a callback with the above code into the SchemaOperations instance when it's setting everything up.

More generally, I want to be able to define something like (from https://gerrit.wikimedia.org/r/c/393929/)

protected function migrateArchiveText() {
    if ( $this->db->fieldExists( 'archive', 'ar_text', __METHOD__ ) ) {
        $this->output( "Migrating archive ar_text to modern storage.\n" );
        $task = $this->maintenance->runChild( MigrateArchiveText::class, 'migrateArchiveText.php' );   
        $task->setForce();     
        if ( $task->execute() ) {
            $this->applyPatch( 'patch-drop-ar_text.sql', false,
                'Dropping ar_text and ar_flags columns' );
        }              
    }          
}

in the generic syntax. In the current proposal that could be an entry like this in the schema updates array

[
    'name' => 'Migrate archive ar_text to modern storage',
    'order' => '1.31.0-wmf.??+core+1',
    'check' => [ 'fieldExists', 'archive', 'ar_text' ],
    'operations' => [
        [ 'runMaintenanceScript', 'MigrateArchiveText' ],
        [ 'setColumnDefault', 'archive', 'ar_text_id', 0 ],
        [ 'setColumnNullable', 'archive', 'ar_text_id', false ],
        [ 'dropColumn', 'archive', 'ar_flags' ],
        [ 'dropColumn', 'archive', 'ar_text' ],
    ],
]

(although the "setForce()" bit in the original would still need an equivalent)

Reply to "runMaintenanceScript"
Skizzerz (talkcontribs)

I'd rather that the data structure splits partial indexes into something more easy for a machine to parse rather than having to rely on special syntax such as "column_name(20)". Perhaps something like the following:

  • Allow tables.tname.indexes.iname.columns to be an array of 1+ "index objects", where an index object is either a string (to represent an index which covers the full column), or an unordered map with the keys "column" (required; value is the string column name), "length" (optional; value is an integer length of the column index), and "direction" (optional; value is either the string "asc" or "desc"). Example:
"columns": [
    "column_1",
    { "column": "column_2", "length": 20 },
    { "column": "column_3", "direction": "desc" }
]

This correlates to an index on column_1, column_2(20), column_3 DESC. We can add other keys to the index object should the need arise, or we can skip the standard string form and require that it be an array of objects in order to make parsing and schema validation easier.

Notwithstanding the representation, "index the full column" may be a non-starter in MSSQL, as indexes can be a max of 900 bytes long. If the full column is larger, MSSQL will need to find some other way of indexing it. I'm still running tests, but I believe that if you create a computed column (based on the prefix length you care about), and the index that computed column instead of the main column, the query optimizer will select your index should the query be able to make use of it. I'm not 100% sure on this though, and need to perform additional testing to verify. It may also be a feature in the optimizer gated by SKU (as in, it's available in Enterprise but not Standard or Express).

Anomie (talkcontribs)

Yeah, you're probably right. I'd rather keep the string shorthand since that's the most common case.

MySQL has a similar limit, although I don't know offhand the exact value. That's why indexes on "text" and "blob" type columns require the length. I'd like to prohibit them entirely, but MediaWiki's MySQL schema does have indexes on a few tinyblob columns: ipblocks.ipb_address, ipblocks.ipb_range_start, and ipblocks.ipb_range_end.

Anomie (talkcontribs)

Actually, since it's only those few tinyblob columns, let's go ahead and disallow non-fulltext indexes on "text" and "blob" columns when size is not tiny.

Reply to "Partial indexes"
There are no older topics
Return to the user page of "Anomie/Abstract schema".