Wikimedia Developer Summit/2016/T114019/Minutes


  • Session name: Dumps 2.0 for realz (planning/architecture)
  • Meeting goal: Make headway on the question: "what should the xml/sql/other dumps infrastructure look like in order to meet current/future user needs, and how can we get there?"
  • Meeting style: Depending on participants, either Problem-solving or Strawman
  • Phabricator task link:

Topics for discussion

  • use cases for the dumps, known and desired
  • where we currently fall short or are expected to fall short in the future
  • an ideal architecture for dumps that would address the main issues would look like... what?

General notes


Problems/Desired Features

  • Ariel: would like to get to the list of big things that users lack from dumps, or will find they need soon, or things that are broken about dumps. How would we meet these needs in an ideal world, not bound by the existing code
  • Ariel: Things users dislike (Also listed here - )
  • requests for different formats (JSON, for example) generally ignored
  • speed of generation, once a month is not enough, partial dumps are not enough.
  • reliability: broken "pretty often" is not fun to hear
  • download speed: pretty slow to download the dumps
  • incremental dumps don't exist
  • Wikidata perspective: Wikidata has its own dumps and it would be nice to have this integrated and have a machine generated index of all the dumps, and a nice user interface on top of all of it
  • Aaron: dumps available in "line-by-line" chunked format so analysis type processing can be parallelized -> some kind of unique section marker. One of the use cases is insurance against anything bad happening, protecting the corpus. Some data is not available, mediawiki core had to be changed to make some of it available
  • Mark: how does this related to sanitization for labs?
  • Ariel: not sure, but we want to make sure anything that's not staff-only should be available
  • Mark: don't want to solve this problem twice, sanitization. A wiki farm doing the same dumping process would run into too many problems, so the next implementation of dumps should be friendly to third party use
  • Nuria: is this worth it for other sites that don't have the scale problem?
  • Mark: we should be consistent in our SOA approach, so even if this isn't the highest priority we should make sure it gets done just in case someone else needs it
  • Ariel: agree that we need to have the baseline dump script available for smaller wikis, and simple to use
  • Andrew: sort of a philosophical question about how far do we go to make our stack very easy to use for third parties.
  • Andrew: no XML please, hard to deal with in distributed systems (strings get cut off and the resulting pieces are invalid XML)
  • Ariel: when we talk about improvements we're talking about two areas: one python script that does the dumping, and one mediawiki code base that I (Ariel) am not as familiar with
  • Ariel: three layers of jobs, hideous and organic, doesn't account for any new dumps (wikidata, html), would love to chuck the code base (both python and mw code)

Thoughts on Architecture

  • Ariel: let's talk architecture
  • Marko: Event Bus could be a good solution, formats just become a matter of adaptors that dump in different ways
  • Dan: snapshot vs. full stream of changes
  • Ariel: I want to get the status at any arbitrary moment in time.
  • debating stream vs. snapshot
  • Dan: hybrid approach of dumping DB state for certain things (like categories), dumping the full event stream for other things (like revisions)
  • Andrew: would take a long time to recompute the state at an arbitrary period in time, might not be worth it for large wikis
  • Ariel: it'd be interesting to see those new streams, but I'd want to see some use cases before we invest effort
  • Nuria: can we do a prototype of doing dumps off of Event Bus?
  • Andrew: what takes a long time?
  • Andrew: do you have dedicated slaves that generate these dumps?
  • Ariel: yes, we have a separate cluster
  • Mark: this logic was created in 2006, and back then it made sense, but now it could be faster / better to just dump the full db every time
  • Andrew: thinks we should look into stopping the slaves and dumping
  • Ariel: revisions get deleted, so we have to re-generate full history sometimes
  • Mark: right, otherwise we could use a git style system
  • Mark: what's consuming the time
  • Ariel: file processing (reading the old dumps, uncompressing / re-compressing)
  • Mark: new compressions considered?
  • Ariel: yes, but we're not getting an order of magnitude better
  • How much time does the querying the database take?
  • Ariel: we only query in a few cases, so it's not a bottleneck
  • For Wikibase, we just iterate over the database and read stuff from external storage. It takes about 11 hours to dump all current revisions of wikidata. A lot of it is waiting for external storage. About 70% in snapshot service.
  • Wikibase also has bz2, which took hours to create, so we use pbzip2, and some libraries can't read these. Takes 4 or 5 hours to compress
  • Nuria: many different projects
  • dumps getting slower and slower, performance
  • incremental dumps are separate from that, and more like a user requested feature
  • Andrew: why is the dump job not distributed
  • Ariel: don't want to complicate and add even more layers to a complicated system
  • manually figure out which jobs to run and babysit problems like PHP memory leaks
  • would love to just dump things in a centralized place that farms out the work to a distributed systems
  • Marko: Event Bus can be used to distribute the jobs to available consumers
  • Ariel: my plan: take 100,000 pages, send them off and say run them, time limit max 1 hour, get the job back and it looks like it only ran 20,000 jobs, split them up, throw them back to the queue and run again
  • Nuria: seems like a logistics problem, because we'll have to spend resources working on a new solution and maintaining the old system
  • Andrew: 15 min. warning

Next steps

  • Ariel: I want to walk out with a list of things we should investigate further, possible solutions, people interested in following up
  • Possible Ideas:
  • Event Bus for the incremental part of the dumps
  • Change Propagation system to coordinate the jobs
  • HTML dumps to use the html directly instead of asking mediawiki to render it
  • Ariel: would like to have a system where someone just sets up a little puppet configuration, and that's all that's needed to run a specific dump job
  • Mark: think there should be consideration about what is the actual format we want to dump. HTML and raw event dumps are different enough that transformations need to be thought of ahead of time (more so than XML to JSON)
  • Dan: idea: use heuristics to regenerate data only when it makes sense. For example, if most revisions are deleted within 5 days, don't do daily regeneration of revisions that are more than 5 days old. Those would still change, but we can regenerate them monthly and still keep the level of vandalism in the dumps nomially low
  • Mark: if we had rolling regeneration and de-coupled the streams of raw data from the output, we could have flexibility to generate different outputs and they would all be as clean as possible
  • Andrew: should look at hadoop because this problem matches the map-reduce paradigm pretty well
  • Nuria: maybe celery's easier as an interrim solution, since Hadoop doesn't talk to mediawiki easily
  • (?) be able to view the state of a wiki page at that moment in time (not just wikitext but how the templates would expand, what category it's in, etc), what would be needed, are there use cases?
  • Post-meeting discussion: keep pages plus their revision content as single items in HDFS, incremental update means adding pages, removing pages? Regenerating means rolling these up into small downloadable files?

Jaime discussion notes

  • incremental dumps for sql tables. some of these tables we could keep track of the last row dumped and then dump everything inserted afterwards.
  • some tables are insert only, these in particular can be easily managed for incrementals
  • private data and dumping only certain columns based on values of flags in other columns (or even fields in other tables)
  • would it make sense for performance to not rely on MW managing the decision about which data is public? do we want to rely on the views we use for replication to labs?
  • want a more formalized process for changes to schema, where every item in every table is annotated by the devloper as private, public, or depends on certain flag values.
  • formalize the db schema change workflow. See: This will not cover cases where the use of a field is changed; look at past changes for examples.
  • note that not all changes to dbs are done via mw but some are directly out of band, need to track those.
  • note tokudb + mysqldump of all tables = huge speed win. takes 12 hours. use this as a basis for filtering out private data? but there is still es. use the "labs view"
  • : this is the production-side filtering, done with replication filters + triggers. Maintained by production DBAs (Jaime)
  • Maintained by labs (was Coren) Views restricting access to certain fields based on flags in other fields, or based on user privileges
  • if we had a db with only public data in it, how useful would that be? we could export it as full sql dbs and provide the data in that format. so bypass mwbzutils conversion tools!
  • sort revision content by primary key order and retrieve that way, might be much faster, up to ten times because of mysql db insert order - close physical proximity on the disk! and then a read of one item pulls in the rest etc
  • jaime mysql wrapper abstracting lb class, for use by monitoring and db admin, in development, eventually want to remove dependence on MW so this would be a separate library. currently opens new connection for every request (need this to be dependent on an option)

Hoo discussion notes

  • wants api for the user to find out which dumps are in progress, complete, one-click download of latest <lang> wiki<type> etc.
  • desired: way to capture data from history that is not revision/page metadata and would only be present in a full dump of the specific date (e.g. category changes, we don't have a history of those)

Adam Wight discussion notes

  • I don't have anything architecturey yet, but from a risk management perspective, the migration path seems a bit unclear. I tried to imagine a low-investment way to experiment with Dumps 2.0, in task T126722.
  • Would symmetric nodes be better than a master scheduler node? How much complexity that would add? This would give us extra fault tolerance and make it simple to run on additional clusters when needed. One easy (but not high availability) division of labor would be to have a master scheduler assign an entire dump to a worker scheduler, which can then subcontract chunks to other workers in turn. And pay them very little indeed... :S
  • We should use multiple brokers for high availability. Redis Cluster would give us that for free, otherwise it seems to be supported by Celery somehow.
  • Can someone who has survived a Celery integration please vouch that it's easy and fun? @Yuvipanda: ?
  • I like JCrespo's idea of using WMF Labs views as our sanitization, mostly because that lets us reuse existing logic. Dumping from a paused Labs slave seems ideal, if the network and filesystem considerations aren't too horrible. Presumably, the Labs sanitization happens on the main cluster, so we should be able to replicate a slave from there, to avoid crossing network segments?
  • We need some statistics about the time it takes to suppress revisions in oversight situations. Rather than risk mirroring these bad changes, we might want to delay incremental (and full) dumps by a day or two. I guess that would look like, is that we would dump from an up-to-date, sanitized database, but daily incrementals for example would actually contain 2 days > age > 1 day old revisions.
  • The discussion about dump formats is really interesting, and IMO should take place independently of this framework rewrite. A lot of the dump consumption use cases sound badly broken, we should collect those stories and identify pain points. I agree with apergos that a diversity of formats is great, but at the same time we should choose a canonical set and then put energy into improving that to cover 90% of potential uses, include indexes and stuff, and provide reference consumer libraries.
  • gnosygnu's is really onto something, historical dumps should only change when the formats are updated.
  • Just a tiny detail, I think it's important that our runner framework interfaces with chunk processing jobs at the command-line, to minimize the amount of glue we need. Jobs will be written in multiple languages, lets not mess with bindings. On that note, the object store interface should be something simple to access from any language, maybe Redis?

Gabriel Wicke discussion notes


The main needs seem to be to speed up the dump process by doing incrementals and to allow third parties to incrementally update their dumps.

For HTML dumps we resolved this by using compressed sqdblite dbs. Lots of tools for sqlite, only issue is concurrency, but since updates take so little time you can do without.

Current revisions for the main namespace for en wikipedia are 200GB uncompressed, 27.4GB with xz compression, it takes less than two hours to walk through all page titles and revisions, retrieve information via the api and compare to current sqlite db contents. Note that we uncompress the db first, run this, then recompress, but even so it's not that expensive. The db is currently page-title based instead of page id, so page renames are a bit of a drag. For large dbs (with revision history) we would want to divide up the db into several smaller ones, probably by page id range. Storing revisions per page together would mean better compression too.

We were looking into brotli compression, not enough support for it yet. Have a look at benchmarks:

So once you have a full dump, an incremental works like this: uncompress full db, get a working copy, get the changes for the page id range, apply those, recompress, atomically move to new location. If it fails you just get a new copy of the db and uncompress again, have many many shards so it's fast. This assumes you get a feed of relevant events. There is no authentication right now for the events feed til we go to kafka 0.9, you use a kafka consumer, there's a python library for that.

Eventbus stuff:

Misc notes:

  • if people download new dumps, how do we support people with xml based tools? write a converter
  • not all fields in xml dumps are yet in the sql dbs, but adding additional fields is relatively easy
  • most folks have import step from xml to db or something else. some people also stream it, most researchers don't stream
  • note that db format allows parallel access right away (but you do have to uncompress first)

Supporting OpenZIM production:

  • zim format is good for random access into a file, has lzma compression
  • a while ago they switched to restbase (the openzim / kiwix folks), they want the html dumps instead of hitting the restbase api for everything
  • they need hardware to really get their job done, maybe share our server for the html dumps? or just rent a server somewhere, they have been doing by themselves so far
  • funding is covered right now, they are mainly interested in comaintaining and being integrated into normal dump process somehow
  • there's a ticket open for network isolation about isolating services from each other, that might help, but this is very manual right now, mark and faidon are worried about doing this right now for that reason:

For sql table dumps, maybe leave as is, if we need incrementals, we could consider providing a script that reads sqlite & writes each row to mysql (or generally $DB)

We might allow users to apply changes from eventbus themselves;(we had talked about daily event logs for download with a tool to apply them). Most of those API requests (for recent changes) can be cached, so shouldn't cause terribly high load on our backend infrastructure, varnish would cache and serve these

Milimetric notes


The whole team participated in a discussion so these notes are a summary of that discussion.

EventBus discussion


Concern raised about a recent site outage related to MW failure to send events when kafka server was down. This turned out to be an hhvm bug in fsockopen implementation, so that when you try to connect to a dead host, it will not respect the timeout you pass it. Ticket: So we really should not have normally the situation where MW app servers are backed up waiting to write events. However, we then have the problem of how not to miss events from MW.

For MW an event is 'committed' when sql transaction is complete, but then there is the possibility of failure to emit the event. One possibility: have a process comb through dbs periodically looking for events not emitted; this would require db schema change to have some field indicating that an action had its corresponding event emitted. Another option less likely; have a process reading the binlog and feeding the entries to a script that generates the appropriate events.

The binlog approach is potentially hard because we have to translate from that format to get actual clean events. But it might be a good way if used in addition to MW emitting events, because events produced from these two sources could be mutually compared and cleansed.

kafka mysql binlog libs:

Question: how well do the binlog producers replicate events, how many events do they miss? What about if they're put under stress? kill -9, memory constraints, etc. existing event stream, improvements pending:

Users might want only a subset of the dumps, e.g. some wiki project, do we want to apply filtering to events somehow? Is the api just 'fast enough'? We always have to provide full history dumps though because right to fork. Depending on answers to the previous, look into store pre-computed partial dumps (would be trading storage and computation for bandwidth and potentially helping researchers); haivng a daily file (santized, reformatted) of events for users to grab and apply, and for processing incrementals, instead of having to real time process events; converter scripts that convert our canonical format to format the users want (maybe we can get them to write those scripts!)

Output formats: Aaron wanted line by line format so each revision is in its own object (record, whatever) atomic so it can be dealt with, json event level data would be good, you could partition it in folders and import it into hive. Note that we can't just do a straight up grouping of revisions by page id, because there are too many revisions in some pages to deal with. We do however want to connect the revisions by page somehow and order them.

A weird problem: sometimes events are received that are a month old or even older. Why? How far back? what happens when we receive these events, how would we process them?

Question: what is the nature of the events that we need to store? Do we really get revision events from over a year in the past? Knowing this will help us decide on an incremental file storage scheme. Without knowing this, we might have to store files containing revisions in two separate directory hierarchies, so we can access page-lev el information and time-based information (THE FOLLOWING IS NOT A SUGGESTION!! :)): :-)

  • <<project_id>>/<<page_id>>/<<year-month-day-hour>>_revision_metadata.json
  • <<project_id>>/<<year-month-day-hour>>/<<page_id>>_revision_metadata.json
  • <<project_id>>/<<year-month-day-hour>>/<<page_id>>/<<revision_id>>/

But if we do know the nature of the events, we may be able to be more efficient

we want to have the text in one canonical location and then (like indices) have multiple "Pointers" to the text. (otherwise we run the very real risk of the text copies getting out of sync, and then we're screwed.) this lets us not have to do very expensive sorting operations. Time-stamped based? One thing about how revisions are applied is that imports of pages + their revisions from other wikis and subsequent merges can really screw the pooch for that (just a thing to keep in mind). Note that undelete of pages now finally means we keep the old page id at any rate, that change was recently merged.

  • Schemas:
  • EventLogging:


Halfak notes


Output format thoughts


Having an output format useful for distributed computing (Hadoop etc) would be good. So JSON records -- one line per revision (page partitioned and chronological) for import into Hadoop/Spark. The current dumps get converted to "RevDoc" format, then sorted into timestamp order, split up and imported. (Sorting is necessary because we don't produce revisions per page in a guaranteed order, see and comments. RevDoc format information here: Note that some information (namespaces, dbname) is lost during this process, and while the api could be queried for that info at the time of the import into Hadoop, things might have changed since the dump was run.

It would be nice to have modular output formats that can be maintained by the interested parties (e.g. Aaron like Revdocs, so he maintains revdocs output) so data gets into hadoop as described above (let's say) and then halfak's converted job takes over to create the desired output. So we need a modular way to add new output formats and a way to schedule these conversion jobs. E.g.: Halfak writes code to perform conversion, system then implements conversion with each new dump (or incremental).

We could also have a converter directly for events emitted to EventBus that can convert to RevDoc format without worrying about ordering.

Secondary data sets


Secondary data sets that could be generated from dumps,e.g. diffs between revisions, are used frequently for research. Collapsing differences between several revisions is cheap. Diffs (generally applying ) N^3 at worst case -- (guess) Takes about 100 days to generate diffs for Enwiki's articles current revs vs prev revs, on one core. On hadoop, we have this down to 3 days -- Could get down to 24 hours possibly (note that includes compression time) using bz2 (sadly) because of ecosystem consistency considerations. Anyways a pipeline produces several intermediate sets of data when going from xml to (data I process in hadoop), the intermediate sets are useful, want to be able to do diff generation on these too.

Other issues


what do we do with page imports as events with old data? Do we want a known event order (token with each event in serial) so we can know that we are missing X from the stream and ask for the back events?

What about incremental dumps with sensitive content, i.e. deleting (rev oversight) old revisions that make it into the dumps?

Is it worth looking into hadoop as a service for cost reasons? We wouldn't be able to give them private tables/wikis/columns. What about bandwidth for downloaders? AWS backend would cover that, there's issue of cost of bw course. Current example: the Research Team is working with a vendor that provides Cloudera package services (Hadoop/Spark/etc.) on a Analytics Team-scale cluster for ~$5k per month. (public data only!)

Action items with owners

  • Create a project and get everyone on it for further discussion (Ariel) - DONE
  • identify other action items from discussion on project , coordinate further discussion (Ariel) - IN PROGRESS

DON’T FORGET: When the meeting is over, copy any relevant notes (especially areas of agreement or disagreement, useful proposals, and action items) into the Phabricator task.

See for more details.

Post-session supplemental notes


Lightweight VCS as intermediary


Update 2016-03-06: Brion is thinking along different lines, more akin to a relatively lightweight version-control system as an intermediary between the MediaWiki database on the one hand, and a local data set on the other.

Notes & questions mirrored from blog post:

The basic data model for the main content dumps hasn’t changed much in 10 years or so, when I switched us from raw blobs of SQL ‘INSERT’ statements to an XML data stream in order to abstract away upcoming storage schema changes… fields have been added over the years, and there have been some changes in how the dumps are generated to partially parallelize the process but the core giant-XML-stream model is scaling worse and worse as our data sets continue to grow.

One possibility is to switch away from the idea of producing a single data snapshot in a single or small set of downloadable files… perhaps to a model more like a software version control system, such as git.

Specific properties that I think will help:

  • the master data set can change often, in small increments (so can fetch updates frequently)
  • updates along a branch are ordered, making updates easier to reason about
  • local data set can be incrementally updated from the master data set, no matter how long between updates (so no need to re-download entire .xml.bz2 every month)
  • network protocol for updates, and access to versioned storage within the data set can be abstracted behind a common tool or library (so you don’t have to write Yet Another Hack to seek within compressed stream or Yet Another Bash Script to wget the latest files)

Some major open questions:

  • Does this model sound useful for people actually using Wikipedia data dumps in various circumstances today?
  • What help might people need in preparing their existing tools for a switch to this kind of model?
  • Does it make sense to actually use an existing VCS such as git itself? Or are there good reasons to make something bespoke that’s better-optimized for the use case or easier to embed in more complex cross-platform tools?
  • When dealing with data objects removed from the wiki databases for copyright/privacy/legal issues, does this have implications for the data model and network protocol?
    • git’s crypto-hash-based versioning tree may be tricky here
    • Do we need a way to both handle the “fast-forward” updates of local to master and to be able to revert back locally (eg to compare current and old revisions)
  • Technical issues in updating the master VCS from the live wikis?
    • Push updates immediately from MediaWiki hook points, or use an indirect notify+pull?
    • Does RCStream expose enough events and data already for the latter, or something else needed to ‘push’?
    • Can update jobs for individual revisions be efficient enough or do we need more batching?