Parsoid/Round-trip testing/Cassandra

The Parsoid code includes a round-trip testing system that is used to test code changes on a collection of 160k wikipedia articles from 16 languages. The system is composed of a server that hands out tasks and presents results and clients that do the testing and report back to the server. The test performed on the client is the conversion of wiki articles to HTML, then the conversion of that HTML back to wikitext and finally a classification of any differences into semantic and purely syntactic differences. This is essentially a map/reduce style workflow, in which distribution to about 40 cores in different VMs lets us finish a round-trip run over all 160k pages in around 4 hours.

Problem statement / introduction

The round-trip server has become a bottleneck in the round-trip test system that prevents us from scaling up the system with more clients to process more pages. We are currently using a MySQL backend (after migrating from SQLite earlier). A few month's worth of round-trip results take up 31gb on disk, and queries on this data slow down a lot with a growing database.

We saw very good results when testing Cassandra as a backend for a revision store for page content recently. Apart from the obvious benefits of replication and automatic fail-over and scalability for writes, we were impressed by the compression ratios (~19% of input text, including indexes and overhead) achieved when storing many revisions of a wiki page in consecutive blocks on disk. Test results for a given wiki page have similar characteristics of small changes between revisions (typically), so should compress similarly well.

A challenge for data modeling with Cassandra are its relatively limited abilities to query the database. Cassandra specializes on queries that can be efficiently processed by reading a contiguous chunk of storage on one of the replicas. There is no (efficient) support for range queries on the primary 'partition' key, as this is used to map an entry to a node in the DHT. There are also no joins, and very limited support for filtering a non-contiguous result set. For more complex queries (a list of regressions for example) this means that information often needs to be pre-computed and denormalized. Compared to relational systems, data modeling is driven very heavily by the main queries expected. Overall this means that moving from the relational MySQL schema to Cassandra will require a full redesign of the data model.

Cassandra bindings

We have tested Cassandra and node.js in the Rashomon revision store prototype. The node-cassandra-cql bindings used there worked well, and can also be used to hook up Cassandra to the round-trip server.

Getting started

git clone https://github.com/gwicke/testreduce.git

Quick start on Debian

If you are running Debian / Ubuntu, try adding this to /etc/apt/sources.list:

deb http://parsoid.wmflabs.org:8080/debian wmf-production/

Now install testreduce

apt-get update
apt-get install testreduce

If everything went well you should have a test server running at http://localhost:8001/

General install

You need node.js 0.10 and MySQL, which is available in most current Linux distros (sudo apt-get install nodejs nodejs-legacy npm on Debian) and for OSX. It might also work on Windows (we heard positive reports), but we don't really support Windows. The main developers all use Debian or Ubuntu Linux.

cd testreduce
npm install

To try the MySQL version of the server, you also need to install MySQL, create a db and user

In mysql:

create user testreduce;
create database testreduce;
GRANT ALL ON testreduce.* TO 'testreduce'@'localhost';
flush privileges;

Create the db:

mysql -u testreduce testreduce < sql/create_everything.mysql

^[1]^[2]

Now copy server.settings.js.example to server.settings.js and change the following settings:

user testreduce
database testreduce
password "" (empty string)

Now start the server at http://localhost:8001/:

node server

When this is working, it is time to import some titles to test:

cd articles
./initAll.sh

You start the server again. After doing so, you can now run some round-trip tests by installing Parsoid:

git clone https://gerrit.wikimedia.org/r/mediawiki/services/parsoid
cd parsoid
npm install
cd tests/client
cp config.example.js config.js
node client

Cassandra setup

See the Cassandra download page for most systems. On Debian, simply add deb http://www.apache.org/dist/cassandra/debian 20x main to /etc/apt/sources.list, and then do

gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D
gpg --export --armor F758CE318D77295D | sudo apt-key add -
apt-get update
apt-get install cassandra openjdk-7-jdk libjna-java libjemalloc1

In /etc/cassandra/cassandra-env.sh, change this line (near the end) to point to localhost:

JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=localhost"

(Re)start cassandra: service cassandra restart. The command

nodetool status

should return information and show your node (and the other nodes) as being up. Example output:

root@xenon:~# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns   Host ID                               Rack
UN  127.0.0.1  336.9 KB   256     100.0%  1d4b5052-63db-428b-8a62-0c8b15fdae10  rack1

Now you can start playing with Cassandra using the cqlsh cli interface. See the Cassandra 2.0 and CQL 3.1 docs for more Cassandra background.

Next steps

Check out the current round-trip server code and familiarize yourself with node.js
Read up on and play with Cassandra. You can also play with the Rashomon storage service prototype as a relatively simple example of node + cassandra.
Read up on eventual consistent systems and idempotence. Some starting points (feel free to edit):
- Eventually Consistent
- Dynamo
- Building on Quicksand
- Spanner as an example of a different trade-off with a good use of logical time that tracks GPS time

Contacting us

IRC: we are hanging out in the IRC channel #mediawiki-parsoid ^connect.

Notes

↑ Your root password might be the one entered during installation. If that does not work either, this should work on Debian/Ubuntu to get a root cli: sudo mysql --defaults-file=/etc/mysql/debian.cnf
↑ While in the command prompt for the mysql client program, you must enter this first: create user testreduce

[1] Your root password might be the one entered during installation. If that does not work either, this should work on Debian/Ubuntu to get a root cli: sudo mysql --defaults-file=/etc/mysql/debian.cnf

[2] While in the command prompt for the mysql client program, you must enter this first: create user testreduce

[1]

[2]