Currently, there is no page_content_format field, and "Each row in the content_model table will refer to a unique tuple of (content_model, content_format)." If a page's content model allows multiple formats, which model ID would be used? Perhaps there should be separate content_model and content_format tables.
Talk:Requests for comment/Content model storage
Yeah, that's a good idea. I talked with @Duesentrieb briefly and he liked the idea of having two tables. Will update the RfC shortly.
Separate content_model and content_format tables are much better, thanks.
Perhaps it would also be useful to put the default format for each model into the content_model table, as cm_default_format. This would be useful during migration, for filling in NULL values in fields like rev_content_format.
Maybe I'm misunderstanding what you meant, but wouldn't we fill in NULL with whatever ContentHandler::getDefaultFormat() returns? Why does it need to be in the database table?
Having it in the database would allow us to fill in the blanks without leaving the database. It's an option to consider, which may come in handy for some use cases. But you may very well be right that we don't need it.
I would rather keep it in PHP, mostly because you already need PHP code to "fill in the blanks" for things like namespace names. In any case, I'd like to leave it out for now because I don't think we should have the default specified in two places (PHP and database). Plus one of the goals of this RfC is to make changing the defaults easier, and this would somewhat work against that.
Instead of adding columns to various tables (page, revision, archive), I suggest to create a separate table that holds meta-data about revision content (at least the model and format, but we will also want things like the role/slot, blob address, and hash there later, for multi-content-revision support, see Phab:T107595). The table would have at least these fields:
CREATE TABLE /*_*/content (
cnt_revision INT NOT NULL,
cnt_model INT NOT NULL,
cnt_format INT NOT NULL,
PRIMARY KEY (cnt_revision)
) /*$wgDBTableOptions*/;
This table can then be used to acquire the model and format for a given revision by joining cnt_revision against page_current, rev_id, or ar_rev_id.
If we want to support multiple content "slots" per revision (as per Phab:T107595), cnt_revision would no longer be sufficient to identify the desired content. A cnt_role field would be added to identify the role the content plays in the revision (e.g. main, style, categories, meta, blame, etc). cnt_role would reference a content_role table defined in the same way as content_model and content_format. cnt_revision and cnt_role form a unique key. The table would then look like this:
CREATE TABLE /*_*/content (
cnt_revision INT NOT NULL,
cnt_role INT NOT NULL,
cnt_model INT NOT NULL,
cnt_format INT NOT NULL,
-- more fields to add for multi-content-revision support:
-- cnt_address, cnt_hash, cnt_logical_size, cnt_is_primary, etc
PRIMARY KEY (cnt_revision, cnt_role)
) /*$wgDBTableOptions*/;
CREATE TABLE /*_*/content_role (
cr_id smallint NOT NULL PRIMARY KEY AUTO_INCREMENT,
cr_role VARBINARY(32) NOT NULL
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/cr_role ON /*_*/content_role (cr_role);
When joining against page_current, rev_id, etc., cnt_role will then have to be fixed (e.g. to the "main" role) to allow a unique match per revision.
My only concern with splitting to another table is making sure that the tables stay in sync, and if they get out of sync, handling failure gracefully. While it mostly should work properly with our transaction handling, we occasionally have random freak accidents where page_latest isn't updated properly or something, so it's not 100% perfect.Using the same table (revision/archive) would neatly avoid that problem.
Manual:Coding conventions/Database#Table naming says, "Table names should be singular nouns: user, page, revision, etc." So shouldn't the new table's name be content_model
, not content_models
, in order to follow this convention?
I wasn't aware of that, updated. Thanks!
Hi legoktm,
I think this proposal makes sense and should be done. I have one (maybe dumb) qiestion: The mapping of conten model ids and formats sounds like it could be done as an php array in a global var, too. Maybe I don't see any point you considered, but wouldn't it be possible to write this as a global array instead of a new database table, and if yes, why we shouldn't do that?
Thanks for your answer!
It could. But then we run into the same probably we currently have with namespaces. Each extension randomly picks a number it uses for namespaces which end up conflicting.
Setting it as a global would also encourage usage by developers, which we don't want. They should not be aware of what the id is, and shouldn't care either.
This proposal is similar to what we currently do with change tags. Everywhere in code (except for the ChangeTags class) we refer to tags by their string name, but in the database they are stored as integers, and there is a table that maps the integer to the string name.
Ok, yeah, that makes sense. Thanks for your explanation! :)
I added a new section at the bottom of your RFC. It contains statistics from which I conclude that using ENUM would be a good alternative.
Hi,
Using an enum is not feasible. First off, the page/revision/archive tables are in MediaWiki core, and should not contain any references to extensions (Flow, Wikibase, Scribunto, etc.). Currently any extension can define arbitrary content models and formats, like MassMessage's MassMessageList content type (example).
Also, using an enum also means we need to do a schema change any time we wish to add a new content model or format.
Finally, I would have preferred discussing your alternative proposal on the talk page before we added it to the RfC. I've removed it from the page now because it doesn't fit our requirements.