RESTBase/Table storage backend options

This page was created in November 2014 to collect details on possible backends for the RESTBase table storage layer. As a result of the evaluation, Cassandra was deployed to the Wikimedia infrastructure. See wikitech:Cassandra for more information about Wikimedia's current deployment, and see Phab:T76157 for discussion about the evaluation.

Exclusion criteria

These criteria need to be satisfied by candidates:

No single point of failure
horizontal read scaling: can add nodes at any time; can increase replication factor
horizontal write scaling: can add nodes at any time
Has been hardened by production use at scale

Candidates

	Cassandra	CouchBase	Riak	HBase	HyperDex
Opinion
Pros	Gabriel: Mature, mainstream option with good support for large data sets; Used in first backend implementation	Gabriel: Strong read performance on in-memory data sets	Gabriel: Causal contexts more thorough than pure timestamps		Gabriel: interesting indexing & consistency concepts
Cons	Gabriel: Timestamp-based conflict resolution not as thorough as Spanner or Riak	Gabriel: Makes compromises on durability, read scaling through replication and support for large data sets	Gabriel: Cross-DC replication only in enterprise version; no rack awareness	Gabriel: Complex operations, relatively weak performance	Gabriel: Has not really seen production use. Fairly complex. No read scaling through replication.
Champions for practical evaluation (please sign)
Legal
License	Apache, no CLA	Apache, requires CLA; binary distributions under more restrictive license	Apache	Apache	BSD
Community	Datastax largest contributor, but working open source community				19 contributors; academic / open source
Technical
Architecture	Symmetric DHT	Sharding with master per shard; automatic shard migration with distributed cluster manager state machine	Symmetric DHT	HDFS-based storage nodes; coordinator nodes with master/slave fail-over	Innovative hyperspace hashing scheme allows efficient queries on all attributes; replicated state machine handles load balancing / assignment of data to nodes (special nodes)
CAP type	AP default, CP per-request	CP	AP default	CP	CP
Storage backends / persistence	Memtable backed by immutable log-structured merge trees (SStables); persistent journal for atomic batch operations; writes to all replicas in parallel	Memory-based with writes to a single master per bucket; asynchronous persistence; chance of data loss if master goes down before write to disk	Bitcask, LevelDB (recommented for large data sets), Memory; writes to all replicas in parallel	HDFS	'HyperDisk' log-structured storage with sequential scan. Not sure if there is compaction.
Implementation	Java	Erlang, C++, C	Erlang	Java	C++
Primary keys	Composite hash key to distribute load around cluster, hierarchical range keys	String hash key	String hash key	Hash & Range keys	Range keys
Schemas	Enforced schemas, fairly rich data types; column storage makes adding / removing columns cheap	schemaless JSON objects	schemaless JSON objects	Enforced	Enforced
Indexing support	Per-node secondary hash indexing (limited use cases), Apache Solr integration in enterprise version	Per-node secondary hash and range indexing (limited use cases); Elasticsearch integration	Per-node secondary indexes (limited use cases); Apache Solr integration		Rich range query support on arbitrary combinations of attributes
Compression	deflate, lz4, snappy; configurable block size, so can exploit repetition between revisions with right data layout	snappy, per document	snappy with LevelDB backend	deflate, lzo, lz4, snappy	snappy
TTL / expiry	yes, per attribute	yes	likely no
Causality / conflict resolution	Nanosecond wall clock timestamps, last write wins; relies on ntp for time sync		Causal contexts; sibling version resolution		Value-dependent chaining
Single-item CAS	Supported on any table using built-in Paxos	Supported	Supported when configured for bucket		Guarantees linearizability for operations on keys
Multi-item transactions	Only manually via CAS	Only manually via CAS	Only manually via CAS	Only manually via CAS	In commercial version
Multi-item journaled batches	Yes	No?
Per-request consistency trade-offs	one, localQuorum, all, CAS	No			No
Balancing / bootstrap load distribution	DHT + virtual nodes (default 256 per physical node, configurable per node to account for capacity differences); new node streams data evenly from most other nodes & takes up key space proportional to number of virtual nodes relative to entire cluster	virtual buckets (typically 1024 per cluster); semi-manual rebalancing by moving entire vbuckets, documented as fairly expensive operation	DHT + virtual nodes	Coordinator nodes manage balancing	Coordinator nodes manage balancing
Rack aware replica placement	Yes	In enterprise edition or custom build	No	Yes	No
Multi-cluster / DC replication	Mature & widely used in large deployments; TLS supported; choice of synchronous vs. async per query	In enterprise edition or custom build	Only in enterprise version; possibly custom build from source?		No
Read hot spot scaling	Increase replication factor, reads distributed across replicas	Single node responsible for all reads for a vBucket	Increase replication factor, reads distributed across replicas		Single region server responsible for all reads to a region
Performance characterization	Very high write throughput with low latency impact on reads. Good average read performance. Somewhat limited per-instance scaling due to JVM GC limits	Performs well on small in-memory data sets. Seems to performs worse than competitors when configured for more frequent writes for durability. Support for large datasets (>> RAM) questionable.		Most benchmarks show worse performance than competitors like Cassandra	Claims high performance [1]
Operations
Large-scale production use	Yes, since 2010.	Yes	Yes	Yes, since at least 2010.	No
Support / community	fairly active community; Datastax offering help				Fairly small & young community; commercial fork
Setup complexity	Low (symmetric DHT, yaml configuration)	Low to medium: package-based install (community edition), REST configuration	Low	High (HDFS, designated coordinator nodes)	High: designated coordinator nodes, still much development churn
Packaging	Good Apache Debian packages; apt-get install gets you a fully working one-node cluster	Ubuntu packages under restrictive license / reduced functionality	Official packages with restrictive license	Comes with Debian unstable	Debian packages
Monitoring	Fairly rich data via JMX; pluggable metrics reporter support (incl. graphite & ganglia)	Graphite collector	Graphite collector	JMX, Graphite collector
Backup & restore	Immutable SSTables can be snapshotted at point in time (nodetool snapshot) & backed up as files: documentation	mbbackup tool, FS-level backups	FS-level backups	Recommended to use a second cluster	FS-level backups
Automatic rebalancing after node failure	Yes; fine-grained and implicit in DHT	Manually triggered, moves entire vBuckets, documented as expensive	Yes; fine-grained and implicit in DHT	In balancer background process; moves entire regions