Wikimedia Developer Summit/2017/Multi-Content-Revisions

Authors of these notes: Brion, James F., Addshore, Roan, Daniel + 3 unnamed authors

Title: Multi-Content Revisions
Day & Time: 2017-01-10 11:00–12:10 PST
Room: Cypress
Phabricator Task Link: https://phabricator.wikimedia.org/T107595
Slides: https://commons.wikimedia.org/wiki/File:WikiDev2017-MCR.pdf or http://brightbyte.de/repos/papers/2017/WikiDev2017-MCR.pdf
Facilitator(s): Daniel Kinzler
Note-Taker(s): Roan, Brion
Remote Moderator: N/A (no stream)
Advocate:

Video recording of session

Trying to go over what it is & why we want it
Related to scaling the schema
Who thinks they know what multi-content revisions is and how it works? (some hands)
mediawiki.org has extensive design documents (link below)

https://www.mediawiki.org/wiki/Multi-Content_Revisions

Basically, fixes two things:

1) "stop embedding weird shit in wikitext".

E.g. template data embedded in sub-pages with custom page properties; image metadata stored on img descr page w templates
Special parser tags sometimes used to store the data (e.g. templatedata)
These things are usually edited separately from other content, with a custom syntax/editor often.
Other things using custom XML or JSON etc embedded.
Categories seem a similar kind of annotation. Some might argue infoboxes too!
Pretty sure we want to move image metadata, page quality, etc out... less certain about categories & infoboxes but it'd be nice if we could.
These things are typically edited separately from the rest of the content -- you edit the wikitext *or* you edit the page quality template, usually don't do both at once.
Editing template metadata separately from template itself are separate.
With MCR, multiple content objects, multiple data blobs -> in a single revision. Every revision has all slots; if you edit only one slot, the other slots just don't change, so this should reduce storage bloat as users edit: less data to store if we don't have to update the page text blob when editing a smaller metadata blob

2) Properly support associated content.

For instance, template page is one page, and a subpage has documentation. Or separate namespaces -- lua module in one namespace, documentation in another with a mostly-standard naming convention. Gadgets have JS and CSS parts, distinguished by 'extension' convention on the title. some of these conventions are hardcoded in MW, others are hardcoded in user scripts, templates etc., still others only exist in the minds of (some of) the users
If you update a template, it would be nice to *atomically* update the template and the documentation together.
If you update a lua module, update the module and its style template together.
Previewing similar changes atomically would help during development of templates/modules (before saving).
Lots of things use this mechanism of associated pages....
MCR gives atomic updates when editing multiple aspects of the page when you *do* want to make multiple changes together.
Page history will also be all together -- don't have to go searching for doc separate from code. Diff view will show them together.

Questions/comments so far? Who thinks this is a good thing? (most/many hands go up) Oh good :)

cscott: so edits can be atomic, but don't have to be? can do either?
Daniel: yes, storage infrastructure allows an edit to span multiple slots, or if you only want to update one at a time that works too.
James F: Theoretically you could enforce atomicity. E.g. if you add a new CSS class you also have to use it in the same edit, or it won't let you save
Daniel: Standard mechanism for editing text-based extra slots probably multiple text boxes. If other type, needs a custom editor.
cscott: are there different permissions on the different sections?
Daniel: Not right away
CScott: Would there be technical problems adding that?
Daniel: Don't think there would be big challenges, but it would mean either encoding slot name in page protection table... Permissions are currently per-namespace....?
CScott: Strawman, I store translations and I want only admins to edit translations
Daniel: MW doesn't currently really have a mechanism for requiring specific perms for specific types of content. But let's discuss this later. Don't see big problems, would need to add one dimension in config/storage for perms.

Ok so how is that going to work, and what are implications on storage / database layer? Have to split the revision table... Currently revision table covers two conceptually separate aspects:

information about the revision itself (who did it, summary, when?)
information about the content itself (where is it stored, how big is it, what's the hash, etc)

When looking at Recent Changes only need 1)...

If we have MCR, 2) needs to be split out as we may have multiple content entries in a revision. Indirection needed. If just split in two, have to repeat the content info multiple times. Better to split it into three parts: revision table, content table, and relationship table

on the left, the revision data table <- changes on every edit
center table holds pointer to revision, the slot role, and pointer to the content details
on the right, table with the content details. <- this item doesn't have to change if not edited.

Roles/slots could be bound to content models but don't have to be, the main slot wouldn't be for example. Code assuming revs have only one content would hit main slot.

Normalized model:

Rev 1
   |--- main -> content 1 -> (blob storage)
   +-- quality -> content 2 -> (blob storage)

What happens when you edit? (this is harder to draw in ascii):

Rev 2
  |--- main -> points to content 1 same as rev 1 if unchanged
  +-- quality -> content 3 (different from revision 2)

Rev 3:

similar but touches different bits :)
(keeps going :D)

In some cases you do edit both; the model supports either way cleanly and doesn't have to dupe unchanged items.
Reverts would just recycle old content -- no new content rows, but still new rows for rev & the center table

Question roan: planning to have hash lookups for content objects? How to detect reverts?
Daniel: application layer knows it's a revert
Tim: no automatic detection proposed here, similar to existing revert of single-content revs
Roan: manual reverts are possible though
Daniel: yes, that's harder.
Tim: automatic revert detection could require indexing on the hash, and the hash is often biggest thing on that content table...
cscott: backend blob storage might "know" from the hash (cf git), reducing need to detect this
Daniel: possible with content-addressible blob storage yes
One more complication: hashes are something we need to think about because they are relatively big. Ideally would like to have a hash for each content *and* a hash for the combined content, allowing you to identify revisions fully with identical content. Could help with some common cases for API and tools looking at rev hashes.
Tim: how useful is it to detect precisely byte-identical?
Gabriel: common case is edits, ...
Tim: if we store diffs instead of content, could reduce our storage by 90%?
Daniel: yes.... a bit different scope. Touches different areas of our storage.
Gabriel: related though, concentrates on blobs. Locality for compression?
Brad: can mostly abtstract this on the storage layer....
Tim: like Gabriel says, need the locality information to do this most efficiently.
Daniel: We're getting pretty deep into nitty-gritt storage stuff. :) Do we want to address this in content metadata, or push entirely out to the storage layer? In proposal for the blob layer, you pass it the ID of the previous blob, which it can either use to optimize or ignore.
Gabriel: Potential bottleneck, additional latency... UUID for content detail? (notetaker: expand this?)
(is this correct?) Could do separate tables for each content type with a common revision id. (?) Or delegate the detail to the storage layer....
Tim: think we did talk about this in a archcom meeting
Daniel: content _object_ is in SQL; content blob _bytes_ are in separate storage service (or SQL or whatever) .... might be possible to include the content object data in restbase too....?
Gabriel: why storing so much in SQL; not sure best place for division
Daniel: Need to be able to know what's there, enumerate it etc. Might not _need_ to all be in SQL, but probably need a pure-SQL implementation so can run MediaWiki without external service. Doesn't feel like a huge problem?
Gabriel: Revision table is large, won't this make it bigger?
Daniel: splitting revision table into three bits does make some overhead -- central table linking revs to content objects is very "tall" but "narrow". (Reusing content objects means total space overhead shouldn't be as big as it looks overall.)
Gabriel: (always adding three rows?)
Roan: always add revision row, center table rows, only add new content when needed.
cscott: do we always need to add 3 rows to center table? or can use the structure of revision ids, table to associate?
Daniel: may be possible but leery of not having clear connections in the db. Historically was trying to just split between revision-data and content-data tables, but the 3-table solution seemed cleaner. Can do some of this incrementally -- blob storage can be reused as-is. Have carefully thought about migration path and recovery option: can abort without losing data. If we change the storage model more, it gets harder...?
Gabriel: most secondary content is keyed by revision already...
Daniel: most of it's on current revision only (page props etc) :) some by rev_id though

Should be achievable through small incremental steps... Willing to reconsider the trinary table layout if necessary.
In terms of database queries: query needed to find right content entries are more expensive with <= mode. Trinary table layout is very efficient to load (single index lookup on int columns).

Brian Wolff: backend makes sense. But want to know more about the semantic model of separate itesm... is there always a main content and subsidiaries whose meaning is determined by the main one, or...? Want to make sure is well defined...
Daniel: idea is not to just let users decide to use extra slots; mediated by the software. Take real use-cases and turn them into well-supported sub-content items.
JamesF: concrete example: want an ext that takes over the Template namespace with a template blob (wikitext), documentation blob (also wikitext), template styles blob (CSS), template data blob (json) -> make templates fit together...?

Brian Wolff: is there a common data model? What part of the code will be responsible for determining how to do stuff...
Daniel: that will depend! Some cases may just expose multiple edit boxes. Others will need specific editor types.
bawolff: want some specific examples of cases to support. Seems a little hand-wavy (things that look cool)? Want more specific.
Daniel: structured data for media files is a definitely want that!
cscott: parsing team currently has a bunch of page properties with invisible tags, want to handle these sensibly.
Roan: VE currently has to do a lot of magic around inserting and recording these.
cscott: could filter RC feed or page history by type -- for instance only see those that change metadata. Could be very useful for some workflows.
Daniel: from confs, hackathons etc have seen lots of things that can benefit from tying into revision versioning. Potentially could even store old image data in, but it's not a super-high priority. Categories desperately need better API for editing them! Just finding out what's there is hard. Extremely important for us to have the ability to have this kind of storage... some cases we're commited to (image metadata) really need it, others could benefit but need it less.

Roan: on the wiki page there are some PHP example code bits / interface definitions -- want to talk a bit about this now or have more slides?
Daniel: (out of slides) could talk about it a bit, but things get complicated with examples. :)
(time check: 24 minutes remaining)

(internal PHP interface proposals)

RevisionContentInfo, RevisionSlotLookup
RevisionSlotLookup->getRevisionSlots(): Interface to query available revisions. Give revision id and either a subset of slots you want, or it queries all slots. Gives you back the content metadata for each slot.
(Actual blob storage is indirected in the load/store interface.)

Do we need additional context there for, say, RESTBase stuff?
Gabriel, Tim: would be nice to have page id yeah
Gabriel: if no use case *requires* not having a page id, would be great to include it for locality on the storage backend.
Daniel: added concept of "hints" ... could be relevant here... could include page id or other context stuff at save time. (Do we need it at other times like lookup?)
Gabriel: could also associate metadata with the page by including page id...
Daniel: I would probably use a dedicated interface for page-related metadata that's not revisioned, rather than adding it here.
Gabriel: namespacing etc might be relevant?
Brad: probably get that from the page, not from this interface
Daniel: PageUpdateController interface takes the hints along with setting up a page/slots update. Can definitely pass the page id in there.
JamesF: big concern about MCR is that operations is concerned about large-scale database changes. They're also concerned with the *current* database architecture. Would this make it worse? Jaime kind of seems ok with mitigation strategies like sharding tables. Views for making everyone happy?
Daniel: Will never make everyone happy. :)
Tim: Gabriel's proposing fetching things via restbase, think these could be bridged.... pluggable storage system. Instead of content table being directly queried in MW, could keep it in an abstraction layer.
Daniel: That's the idea of the storage service yeah (for the blobs at least)
Daniel: To reply to James... too bad Jaime not here right now. :) He will be in the later session on schema scalability. In my mind, revision table will not have more entries -- content metadata table will not have significantly more (most cases will edit only one main text entry.) Tallest table will be the center id table -- but it's very narrow w/ integers and should be efficient. Could shard/partition it but how? For revision could separate on page id... to do with the tall table would have to add the page id in. Makes it slightly wider. But could be a benefit for sharding, in which case it's less pressing.
(pulls up table w/ estimates on storage impact)

Gabriel: coming back to page update controller; can enforce structure. But would this apply to legacy content?
Daniel: just for new revisions. Migration of legacy content shouldn't be affected... Split the tables and batchwise insert! Kill the old tables when sure everything's fine. Actual in-DB migration there, so shouldn't explode...
Gabriel: what if we change some of the layout in future?
Daniel: Storage layer shouldn't care there, new stuff in future should remain consistent. (But be careful when changing plans.)
Lots of stuff on the wiki pages, be sure to check it out everybody!
Things like user permissions could be enforced by page update controller....
"Derived slots" have been dropped from the proposal, but auto-maintained slots could be handled in the page update controller too. (For instance 'blame maps' maintained as an extra slot, auto-generated from edit info.)
If want to *stop* having a slot, tell PageUpdateController to stop propogating it in new edits. That's something that might be harder with other models (<=)

?: revision deletion would still work on the level of revisions, not slots?
Daniel: yes for now at least...
Gabriel: can operate on the summary and other bits...
Daniel: those flags are in the rev table though. Mechanism would stay th esame for now. If need for per-slot flags, could add that to content in future. Keeping minimal for now.
?: will summary be a slot too?
Daniel: Don't intend to, but .... it's possible.
JamesF: thought of making separate tables too, maybe slots would be good here. :)
Daniel: thought a lot about metadata and derived slots.... let's not treat everything as a nail just because we have a hammer now. :D Separately, if we want to normalize DB.... might make sense to have separate tables for title, summary. Will come back to that at the schema session.

Daniel: anyone thing it's completely crazy? :D
(no hands go up)
(?) i'm much more convinced than i used to be! :D

Daniel: point of content between self & gabriel... look at SQL layer vs separating to the separate backend.... we need to work this out. Trying to stay relatively conservative in changes. But it does change some of our biggest, most important tables. So trying to keep it conceptually not too far off from what was there before. Any tools on labs that look directly at the rev table will need to be ported... it's a lot easier if some stuff is the same and rest is mostly just a different join. It's still a big step! Goal of moving out of SQL entirely maybe more risky, but should think about it.

Brad: i think MediaWiki needs an SQL implementation even if WMF doesn't need it.
Daniel: I agree
Gabriel: I'm actually not totally against SQL, we do need transactions and such for core bits. :) Main concern is on scalability & flexibility of the SQL layer and how we structure the interfaces so in the future we can migrate at least some stuff out.
Daniel: agree with these concerns! Happy to discuss further on need to store primary relation in the DB.
In practical terms, hashes are possible a difficulty, take a lot of room, etc. Do we really need to store them as much?
Tim: not 100% sure we have a real use case. ;)
Daniel: they _do_ take up lots of room in narrow tall tables.
?: can we store half a hash? 64-bit...
cscott: depends on what we need the hash for...
jamesf: if this room doesn't know what the hashes are for, we're in trouble ;)