User:GWicke/Notes/Storage/Cassandra testing

Testing Cassandra as a backend for the Rashomon storage service. See also User:GWicke/Notes/Storage, Requests for comment/Storage service.

Hosts:

cerium 10.64.16.147
praseodymium 10.64.16.149
xenon 10.64.0.200

Cassandra docs (we are testing ~~2.0.1~~ 2.0.2 (latest changes)):

Setup

Cassandra node setup

apt-get install cassandra openjdk-7-jdk libjna-java libjemalloc1

Set up /etc/cassandra/cassandra.yaml according to the docs. Main things to change:

listen_address, rpc_address: set to external IP of this node
seed_provider / seeds: set to list of other cluster node IPs: "10.64.16.147,10.64.16.149,10.64.0.200"

(Re)start cassandra: service cassandra restart. The command

nodetool status

should return information and show your node (and the other nodes) as being up. Example output:

root@xenon:~# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns   Host ID                               Rack
UN  10.64.16.149  91.4 KB    256     33.4%  c72025f6-8ad8-4ab6-b989-1ce2f4b8f665  rack1
UN  10.64.0.200   30.94 KB   256     32.8%  48821b0f-f378-41a7-90b1-b5cfb358addb  rack1
UN  10.64.16.147  58.75 KB   256     33.8%  a9b2ac1c-c09b-4f46-95f9-4cb639bb9eca  rack1

Rashomon setup

The cassandra bindings used need node 0.10. For Ubuntu precise LTS, we need to do some extra work [1]:

apt-get install python-software-properties python g++ make
add-apt-repository ppa:chris-lea/node.js
apt-get update
apt-get install build-essential nodejs # this ubuntu package also includes npm and nodejs-dev

On Debian unstable, we'd just do apt-get install nodejs npm and get the latest node including security fixes rather than the old Ubuntu PPA package.

Now onwards to the actual rashomon setup:

# temporary proxy setup for testing
npm config set https-proxy http://brewster.wikimedia.org:8080
npm config set proxy http://brewster.wikimedia.org:8080
cd /var/lib
https_proxy=brewster.wikimedia.org:8080 git clone https://github.com/gwicke/rashomon.git
cd rashomon
# will package node_modules later
npm install
cp contrib/upstart/rashomon.conf /etc/init/rashomon.conf
adduser --system --no-create-home rashomon
service rashomon start

Create the revision tables (on one node only):

cqlsh < cassandra-revisions.cql

Cassandra issues

With the default settings and without working jna (see install instructions above), cassandra on one node ran out of heap space during a large compaction. The resulting state was inconsistent enough that it would not restart cleanly. The quick fix was wiping the data on that replica and re-joining the cluster.
- This was caused by multi-threaded compaction, which is scheduled for removal in 2.1. Moving back to the default setting and reducing the number of concurrent writes a bit eliminated this problem. Tweaking the GC settings (see below) also helped.
Stopping and restarting the cassandra service with service cassandra stop did not work. Faidon tracked this down to a missing '$' in the init script: [2]. Fixed in 2.0.2.
Compaction was fairly slow for a write benchmark. Changed compaction_throughput_mb_per_sec: 16 to compaction_throughput_mb_per_sec: 48 in cassandra.yaml. Compaction is also niced and single-threaded, so during high load it will use less disk bandwidth than this upper limit. See [3] for background.
Not relevant for our current use case, but good to double-check if we wanted to start using CAS: bugs in 2.0.0 Paxos implementation. The relevant bugs [4][5][6] seem to be fixed in 2.0.1.

Tests

Dump import, 600 writers

cpu idle on cassandra test cluster during write test. Hyperthreading enabled, so physical cores really all used. Test starts at spike (truncated table and re-started) and ends when the load drops. Anti-entropy and compaction activity after test ends.

Compaction activity after the write benchmark. The LSM is merged in several phases. This reclaims a lot of space and improves read efficiency.

Six writer processes working on one of these dumps ([7][8][9][10][11][12]) with up to 100 concurrent requests each. Rashomon uses write consistency level quorum for these writes, so 2 nodes out of three need to ack. The Cassandra commit log is placed on an SSD, data files on rotating metal RAID1.

6537159 revisions in 42130s (155/s); total size 85081864773
6375223 revisions in 42040s (151/s); total size 84317436542
6679729 revisions in 39042s (171/s); total size 87759806169
5666555 revisions in 32704s (173/s); total size 79429599007
5407901 revisions in 32832s (164/s); total size 72518858048
6375236 revisions in 37758s (168/s); total size 84318152281
==============================================================
37041803 revisions total, 493425716820 total bytes (459.5G)
879/s, 11.1MB/s
du -sS on revisions table, right after test: 
     162 / 153 / 120 G (avg 31.5% of raw text)
du -sS on revisions table, after some compaction activity: 
     85G (18.4% of raw text)
du -sS on revisions table, after full compaction: 
     73.7G (16% of raw text)

clients, rashomon and cassandra on the same machine
clients and cassandra CPU-bound, rashomon using little CPU
basically no IO wait time despite data on spinning disks. Compaction too throttled for heavy writes, but low wait even with a higher max compaction bandwidth cap. In a pure write workload all reads and writes are sequential. Cassandra also uses posix_fadvise for read-ahead and page cache optimization.

Write test 2: Dump import, 300 writers

Basically the same setup, except:

clients on separate machine
Cassandra 2.0.2
additional revision index maintained, which allows revision retrieval by oldid
better error handling in Rashomon
client connects to random Rashomon, and Rashomon uses set of Cassandra backends (instead of just localhost)

With default heap size, one node ran out of he heap about 2/3 through the test. Eventually a second node suffered the same, which let all remaining saves fail as there was no more quorum.

A similar failure happened before in preliminary testing. Setting the heap size to 1/2 of the RAM (7G instead of 4G on these machines) fixed this in the follow-up test, the same way it did before the first write test run.

Write test 3: Dump import, 300 writers

Same setup as in Write test 2, except heap limit increased from 4G default to 7G.

Fairly constant write throughput over the duration of the test. Some dumps finished before others, so there is some gradual ramping down at the end.

A good amount of IO wait during heavy writing, with spikes hitting 20% or so for a few seconds. This seems to correlate with client timeouts when pushed to the limit (see retry count).

With higher compaction throughput CPU time and IO are mostly balanced. Still CPU-bound most of the time.

5407950 revisions in 46007s (117.5/s); Total size: 72521960786; 36847 retries
5666609 revisions in 52532s (107.8/s); Total size: 79431029856; 41791 retries
6375283 revisions in 67059s (95.0/s); Total size: 84318276123; 38453 retries
6537194 revisions in 64481s (101.3/s); Total size: 85084097888; 41694 retries
6679780 revisions in 60408s (110.5/s); Total size: 87759962590; 43422 retries
6008332 revisions in 50715s (118.4/s); Total size: 64537467290; 39078 retries
=========================================
648/s, 7.5MB/s

0.65% requests needed to be retried after timeout
441.12G total

After test:
85G on-disk for revisions table (19.3%)
2.2G on-disk for idx_revisions_by_revid table

With improved error reporting and -handling in the client these numbers should be more reliable than the first test. The secondary index adds another query for each eventually consistent batch action, which slows down the number of revision inserts per second slightly. The higher compaction throughput also performs more of the compaction work upfront during the test, and results in significantly smaller disk usage right after the test.

Write test 4-6: Heap size vs. timeouts, GC time and out-of-heap errors

I repeated the write tests a few more times and got one more out-of heap error on a node. Increasing the heap to 8G had the effect of increasing the number of timeouts to about 90k per writer. The Cassandra documentation mentions 8G as the upper limit for reasonable GC pause times, so it seems that the bulk of those timouts are related to GC.

In a follow-up test, I reset the heap to the default value (4.3G on these machines) and lowered memtable_total_space_in_mb from the 1/3 heap default to 1200M to avoid out-of-heap errors under heavy write load despite a relatively small heap. Cassandra will flush the largest memtable when this much memory is used.

memtable_total_space_in_mb = 1200M: out of heap space
memtable_total_space_in_mb = 800M: better, but still many timeouts
default memtable_total_space_in_mb, memtable_flush_writers = 4 and memtable_flush_queue_size = 10: fast, but not tested on full run
memtable_total_space_in_mb = 900, memtable_flush_writers = 4 and memtable_flush_queue_size = 10: out of heap space
memtable_total_space_in_mb = 800m, memtable_flush_writers = 4 and memtable_flush_queue_size = 10: no crash, but slowdown once db gets larger without much IO
memtable_total_space_in_mb = 700m, memtable_flush_writers = 4 and memtable_flush_queue_size = 14: larger flush queue seems to counteract slowdown, flushing more often should reduce heap pressure even further. Slowed down with heavy GC activity, very likely related to compaction and the pure-JS deflate implementation.
reduced concurrent_compactors from the default (number of cores) to 4: ran out of heap
increased heap to 10G, reduced concurrent_compactors to 2, disabled GC threshold so that collection starts not just when the heap is 75% full: Lowering the compactor thresholds is good, but removing the GC threshold actually made it worse. Full collection happens only close to max, which makes it more likely to run out of memory.

Flush memory tables often and quickly to avoid timeouts and memory pressure

use multiple flusher threads: memtable_flush_writers = 4; See [13]
increase the queue length to absorb burst: memtable_flush_queue_size = 14
- can play with queue length for load shedding
- seems to be sensitive to number of tables: more tables -> more likely to hit limit

GC tuning for short pauses and OOM error avoidance despite heavy write load

After a lot of trial and error, the following strategy seems to work well:

during a heavy write test that has been running for a few hours, follow the OU column in jstat -gc <cassandra pid> output. Check memory usage after a major collection. This is the base heap usage.
- in heavy write tests with small memtables (600M) between 1 and 2G
size MAX_HEAP_SIZE ~5x base heap usage for headroom, but not more to preserve ram for page cache
set CMSInitiatingOccupancyFraction to something around 45 (75% default is to close to the max -> OOM with heavy writes). This marks the high point where a full collection is triggered. It should be ~2-3x the size after a major collection. Starting too late makes it hard to really collect all garbage, as CMS is inexact in the name of low pause times and misses some garbage. Starting early keeps the heap to traverse small, which limits pause times.
size HEAP_NEWSIZE to ~2/3 the base heap use
for more thorough young generation collection, increase MaxTenuringThreshold from 1 to something between 1 and 15, for example 10. This means that something in the young generation (HEAP_NEWSIZE) needs to survive 10 minor collections to make it into the old generation.

All these settings are in /etc/cassandra/cassandra-env.sh.

Write test 12 (or so)

Cassandra write test 12. The blue dip is a node being down after running out of memory. Settings are adjusted on all nodes after that and a repair is running in the background despite the high load.

Hyperthreading enabled, so really fully saturated

Settings in /etc/cassandra/cassandra-env.sh:

MAX_HEAP_SIZE = "10G" # can be closer to 6
HEAP_NEWSIZE = "1000M"
# about 3.2G, needs to be adjusted to keep 3.2G point 
# similar when MAX_HEAP_SIZE changes
CMSInitiatingOccupancyFraction = 30
MaxTenuringThreshold = 10

Results:

5407948 revisions in 29838s (181.24/s); 72521789775 bytes; 263 retries
5666608 revisions in 31051s (182.49/s); 79431082810 bytes; 295 retries
6375272 revisions in 33578s (189.86/s); 84318196084 bytes; 281 retries
6537196 revisions in 34220s (191.03/s); 85084340595 bytes; 295 retries
6679784 revisions in 34597s (193.07/s); 87759936033 bytes; 284 retries
6008332 revisions in 30350s (197.96/s); 64537469847 bytes; 288 retries
============================
1133 revisions/s

This also includes a successful concurrent repair (nodetool repair) at high load after changing settings on all nodes not too long after the test start
The retries were all triggered by 'Socket reset' on the test client, there don't seem to be any server timeouts any more. Need to investigate the reason for these.

Write test 13-15

MAX_HEAP_SIZE="7G" (reduced from 10)
HEAP_NEWSIZE="1000M"

Overwrote the db from test 12 a few times. Similar results as in test 12, no issues. Compaction was a bit slow, possibly related to a compaction order bug fixed just after the 2.0.3 release. It eventually succeeded with a manual nodetool compact.

Write test 16

Updated to debs built from git cf83c81d8, and waited for a full compaction before restarting the writes. Mainly interested in compaction behavior.

Random reads

Goal: simulate read-heavy workload for revisions (similar to ExternalStore), and verify writes from the previous test.

Access random title, but most likely the newest revision
verify md5

The random read workload will be much IO-heavier. There should be noticeable differences between data on SSD vs. rotating disks.

Mix of a few writes and random reads

Perform about 50 revision writes / second, and see how many concurrent reads can still be sustained at acceptable latency. Closest approximation to actual production workload. Mainly looking for impact of writes on read latency.