The Parsoid code includes a round-trip testing system that is used to test code changes on a collection of 160k wikipedia articles from 16 languages. The system is composed of a server that hands out tasks and presents results and clients that do the testing and report back to the server. The test performed on the client is the conversion of wiki articles to HTML, then the conversion of that HTML back to wikitext and finally a classification of any differences into semantic and purely syntactic differences. This is essentially a map/reduce style workflow, in which distribution to about 40 cores in different VMs lets us finish a round-trip run over all 160k pages in around 4 hours.
Problem statement / introductionEdit
The round-trip server has become a bottleneck in the round-trip test system that prevents us from scaling up the system with more clients to process more pages. We are currently using a MySQL backend (after migrating from SQLite earlier). A few month's worth of round-trip results take up 31gb on disk, and queries on this data slow down a lot with a growing database.
We saw very good results when testing Cassandra as a backend for a revision store for page content recently. Apart from the obvious benefits of replication and automatic fail-over and scalability for writes, we were impressed by the compression ratios (~19% of input text, including indexes and overhead) achieved when storing many revisions of a wiki page in consecutive blocks on disk. Test results for a given wiki page have similar characteristics of small changes between revisions (typically), so should compress similarly well.
A challenge for data modeling with Cassandra are its relatively limited abilities to query the database. Cassandra specializes on queries that can be efficiently processed by reading a contiguous chunk of storage on one of the replicas. There is no (efficient) support for range queries on the primary 'partition' key, as this is used to map an entry to a node in the DHT. There are also no joins, and very limited support for filtering a non-contiguous result set. For more complex queries (a list of regressions for example) this means that information often needs to be pre-computed and denormalized. Compared to relational systems, data modeling is driven very heavily by the main queries expected. Overall this means that moving from the relational MySQL schema to Cassandra will require a full redesign of the data model.
We have tested Cassandra and node.js in the Rashomon revision store prototype. The node-cassandra-cql bindings used there worked well, and can also be used to hook up Cassandra to the round-trip server.
git clone https://github.com/gwicke/testreduce.git
Quick start on DebianEdit
If you are running Debian / Ubuntu, try adding this to /etc/apt/sources.list:
deb http://parsoid.wmflabs.org:8080/debian wmf-production/
Now install testreduce
apt-get update apt-get install testreduce
If everything went well you should have a test server running at http://localhost:8001/
You need node.js 0.10 and MySQL, which is available in most current Linux distros (
sudo apt-get install nodejs nodejs-legacy npm on Debian) and for OSX. It might also work on Windows (we heard positive reports), but we don't really support Windows. The main developers all use Debian or Ubuntu Linux.
cd testreduce npm install
To try the MySQL version of the server, you also need to install MySQL, create a db and user
create user testreduce; create database testreduce; GRANT ALL ON testreduce.* TO 'testreduce'@'localhost'; flush privileges;
Create the db:
mysql -u testreduce testreduce < sql/create_everything.mysql
Now copy server.settings.js.example to server.settings.js and change the following settings:
user testreduce database testreduce password "" (empty string)
Now start the server at http://localhost:8001/:
When this is working, it is time to import some titles to test:
cd articles ./initAll.sh
You start the server again. After doing so, you can now run some round-trip tests by installing Parsoid:
git clone https://gerrit.wikimedia.org/r/mediawiki/services/parsoid cd parsoid npm install cd tests/client cp config.example.js config.js node client
See the Cassandra download page for most systems. On Debian, simply add
deb http://www.apache.org/dist/cassandra/debian 20x main to /etc/apt/sources.list, and then do
gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D gpg --export --armor F758CE318D77295D | sudo apt-key add - apt-get update apt-get install cassandra openjdk-7-jdk libjna-java libjemalloc1
In /etc/cassandra/cassandra-env.sh, change this line (near the end) to point to localhost:
service cassandra restart. The command
should return information and show your node (and the other nodes) as being up. Example output:
root@xenon:~# nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 127.0.0.1 336.9 KB 256 100.0% 1d4b5052-63db-428b-8a62-0c8b15fdae10 rack1
Now you can start playing with Cassandra using the
cqlsh cli interface. See the Cassandra 2.0 and CQL 3.1 docs for more Cassandra background.
- Check out the current round-trip server code and familiarize yourself with node.js
- Read up on and play with Cassandra. You can also play with the Rashomon storage service prototype as a relatively simple example of node + cassandra.
- Read up on eventual consistent systems and idempotence. Some starting points (feel free to edit):
- Eventually Consistent
- Building on Quicksand
- Spanner as an example of a different trade-off with a good use of logical time that tracks GPS time
- IRC: we are hanging out in the IRC channel #mediawiki-parsoid connect.
- ↑ Your root password might be the one entered during installation. If that does not work either, this should work on Debian/Ubuntu to get a root cli:
sudo mysql --defaults-file=/etc/mysql/debian.cnf
- ↑ While in the command prompt for the mysql client program, you must enter this first:
create user testreduce