Incremental dumps

Name and contact information

edit
  • Name: Petr Onderka
  • Email: gsvick@gmail.com
  • IRC or IM networks/handle(s):
    • jabber: gsvick@gmail.com
  • Location: Prague, Czech Republic
  • Typical working hours: 15:00–22:00 CEST (13:00–20:00 UTC)

Synopsis

edit

Mailing list thread

Currently, creating a database dump of larger Wikimedia sites takes a very long time, because it's always done from scratch. Creating new dump based on previous one could be much faster, but not feasible with the current XML format. This project proposes to create a new binary format for the dumps, which would allow efficient modification of the dump, and thus creating new dump based on the previous one.

Another benefit would be that this format would also allow seeking, so a user can directly access the data they are interested in. A similar format will be also created, which will allow downloading only changes since the last dump was made and applying them to previously downloaded dump.

Deliverables

edit

I will start working right after my final exam on 27 June.

Required by midterm

edit
  • script for creating dumps in the new format
    • using basic compression
    • includes full history dumps, current version dumps and “stub” dumps (which contain metadata only)
  • library for reading the dumps by users

Detailed timeline

edit
28 June – 30 June
  • create proposal of the file format
  • set up my working environment
1 July – 7 July
  • write code for creating and updating stub current dump based on a range of revisions (getting required information from PHP, saving it in C/C++)
8 July – 14 July
  • add other dump types (with history, with page text, with both; text of each page will be compressed separately using a general purpose algorithm like 7z)
15 July – 21 July
  • handle articles and revisions that were deleted or undeleted
22 July – 28 July
  • write library for reading dumps

Required by final deadline

edit
  • script for creating dumps
    • using smarter compression techniques for better space efficiency
    • will also create incremental dumps (for all 3 types of dumps) in a similar format, containing only changes since the last dump
  • user script for applying incremental dump to previous dump
  • the file format of the dumps will be fully documented

Timeline

edit
1 August – 7 August
  • reading directly from MediaWiki, including deleted and undeleted pages and revisions
8 August – 14 August
  • creating diff dumps (they contain changes since last dump; can be applied to existing dump)
15 August – 28 August (2 weeks)
  • implementing and tweaking revision text delta compression; decreasing dump size in other ways
29 August – 11 September (2 weeks)
  • tweaking performance of reading and writing dumps

Optional

edit
  • user library for downloading only required parts of a dump (and only if they changed since the last download)
  • script to convert from the new format to the old format
  • SQL dumps?

About you

edit

For quite some time now, I've been interested in working with Wikipedia data and usually creating some sort of report out of it. Over time, I have accessed this data in pretty much any form: XML dumps,[1] SQL dumps,[2] the API[3] and accessing the database using SQL on Toolserver.[4] I haven't done much in this area lately (partly due to work and school), but the interest is still there.

I am also interested in compression and file formats in general, specifically things like the format of git's pack files or ProtocolBuffers. Though this is more of a passive interest, since I didn't find anything where I could use this. Until now.

I'm also among the top users on Stack Overflow for the mediawiki and wikipedia tags.

Participation

edit

If it's clear what I'm supposed to do (or what I want to do), I tend to work alone. For example, in my last GSoC (for another project), where the interface was set in stone, I didn't communicate with my mentor (or the community) much. This project is much more open-ended though, so I plan to talk with my mentor more (and to a lesser degree, the community).

I will certainly publish my changes to a public git repo at least daily. If I'm going to get access to my own branch in the official repo, I will push my changes there, otherwise, I will use github. I already have some experience with Wikimedia's gerrit, though I don't expect to use it much in this project.

Past open source experience

edit

Lots of my work is open source in name only: I have published the code, but no one else ever worked on it.[5]

But for my bachelor thesis, I needed to extend the MediaWiki API, so I did just that. During my work on the thesis, I also noticed some bugs in the API, so I fixed them.

Last summer, I participated in GSoC for mono. The goal was to finish implementation of a concurrency library TPL Dataflow, which I finished successfully.

Updates

edit

June report

edit

As mentioned, I have actually started working on 28 June, so there isn't much to report.

What I did do:

July report

edit

Most of the work planned for July is done. The application can create dumps in the new format (which can deal with incremental updates efficiently) from an existing XML dump. It can then convert a dump in the new format back to XML. The generated XML dumps are the same as the originals (with few expected exceptions).

The original plan was to have a library for reading the new dumps, but this was changed to XML output, because that's more convenient for current users of dumps.

There are two items that I planned for July but didn't actually do: reading directly from MediaWiki (instead of an existing XML dump) and handling of un-/deleted pages and revisions.

August report

edit

The timeline slipped in August. I have completed the first two planned features: creating incremental dumps directly from MediaWiki (i.e. not from XML dumps) and diff dumps (that can be used to update existing dump). Work on compression, that was supposed to be finished in August, is currently in progress.

Notes

edit
  1. Used in generating the Dusty articles report.
  2. I have written a library for processing SQL dumps from .Net and use it in the Category cycles report.
  3. My bachelor thesis was writing a library for accessing the API from C#.
  4. Used in generating cleanup listings for WikiProjects.
  5. Most of the work mentioned in the above notes belong in this category.