RESTBase/Table storage backend options

This page was created in November 2014 to collect details on possible backends for the RESTBase table storage layer. As a result of the evaluation, Cassandra was deployed to the Wikimedia infrastructure. See wikitech:Cassandra for more information about Wikimedia's current deployment, and see Phab:T76157 for discussion about the evaluation.

Exclusion criteria edit

These criteria need to be satisfied by candidates:

  • No single point of failure
  • horizontal read scaling: can add nodes at any time; can increase replication factor
  • horizontal write scaling: can add nodes at any time
  • Has been hardened by production use at scale

Candidates edit

Cassandra CouchBase Riak HBase HyperDex
Opinion
Pros
  • Gabriel: Mature, mainstream option with good support for large data sets; Used in first backend implementation
  • Gabriel: Strong read performance on in-memory data sets
  • Gabriel: Causal contexts more thorough than pure timestamps
  • Gabriel: interesting indexing & consistency concepts
Cons
  • Gabriel: Timestamp-based conflict resolution not as thorough as Spanner or Riak
  • Gabriel: Makes compromises on durability, read scaling through replication and support for large data sets
  • Gabriel: Cross-DC replication only in enterprise version; no rack awareness
  • Gabriel: Complex operations, relatively weak performance
  • Gabriel: Has not really seen production use. Fairly complex. No read scaling through replication.
Champions for practical evaluation (please sign)
Legal
License Apache, no CLA Apache, requires CLA; binary distributions under more restrictive license Apache Apache BSD
Community Datastax largest contributor, but working open source community 19 contributors; academic / open source
Technical
Architecture Symmetric DHT Sharding with master per shard; automatic shard migration with distributed cluster manager state machine Symmetric DHT HDFS-based storage nodes; coordinator nodes with master/slave fail-over Innovative hyperspace hashing scheme allows efficient queries on all attributes; replicated state machine handles load balancing / assignment of data to nodes (special nodes)
CAP type AP default, CP per-request CP AP default CP CP
Storage backends / persistence Memtable backed by immutable log-structured merge trees (SStables); persistent journal for atomic batch operations; writes to all replicas in parallel Memory-based with writes to a single master per bucket; asynchronous persistence; chance of data loss if master goes down before write to disk Bitcask, LevelDB (recommented for large data sets), Memory; writes to all replicas in parallel HDFS 'HyperDisk' log-structured storage with sequential scan. Not sure if there is compaction.
Implementation Java Erlang, C++, C Erlang Java C++
Primary keys Composite hash key to distribute load around cluster, hierarchical range keys String hash key String hash key Hash & Range keys Range keys
Schemas Enforced schemas, fairly rich data types; column storage makes adding / removing columns cheap schemaless JSON objects schemaless JSON objects Enforced Enforced
Indexing support Per-node secondary hash indexing (limited use cases), Apache Solr integration in enterprise version Per-node secondary hash and range indexing (limited use cases); Elasticsearch integration Per-node secondary indexes (limited use cases); Apache Solr integration Rich range query support on arbitrary combinations of attributes
Compression deflate, lz4, snappy; configurable block size, so can exploit repetition between revisions with right data layout snappy, per document snappy with LevelDB backend deflate, lzo, lz4, snappy snappy
TTL / expiry yes, per attribute yes likely no
Causality / conflict resolution Nanosecond wall clock timestamps, last write wins; relies on ntp for time sync Causal contexts; sibling version resolution Value-dependent chaining
Single-item CAS Supported on any table using built-in Paxos Supported Supported when configured for bucket Guarantees linearizability for operations on keys
Multi-item transactions Only manually via CAS Only manually via CAS Only manually via CAS Only manually via CAS In commercial version
Multi-item journaled batches Yes No?
Per-request consistency trade-offs one, localQuorum, all, CAS No No
Balancing / bootstrap load distribution DHT + virtual nodes (default 256 per physical node, configurable per node to account for capacity differences); new node streams data evenly from most other nodes & takes up key space proportional to number of virtual nodes relative to entire cluster virtual buckets (typically 1024 per cluster); semi-manual rebalancing by moving entire vbuckets, documented as fairly expensive operation DHT + virtual nodes Coordinator nodes manage balancing Coordinator nodes manage balancing
Rack aware replica placement Yes In enterprise edition or custom build No Yes No
Multi-cluster / DC replication Mature & widely used in large deployments; TLS supported; choice of synchronous vs. async per query In enterprise edition or custom build Only in enterprise version; possibly custom build from source? No
Read hot spot scaling Increase replication factor, reads distributed across replicas Single node responsible for all reads for a vBucket Increase replication factor, reads distributed across replicas Single region server responsible for all reads to a region
Performance characterization Very high write throughput with low latency impact on reads. Good average read performance. Somewhat limited per-instance scaling due to JVM GC limits Performs well on small in-memory data sets. Seems to performs worse than competitors when configured for more frequent writes for durability. Support for large datasets (>> RAM) questionable. Most benchmarks show worse performance than competitors like Cassandra Claims high performance [1]
Operations
Large-scale production use Yes, since 2010. Yes Yes Yes, since at least 2010. No
Support / community fairly active community; Datastax offering help Fairly small & young community; commercial fork
Setup complexity Low (symmetric DHT, yaml configuration) Low to medium: package-based install (community edition), REST configuration Low High (HDFS, designated coordinator nodes) High: designated coordinator nodes, still much development churn
Packaging Good Apache Debian packages; apt-get install gets you a fully working one-node cluster Ubuntu packages under restrictive license / reduced functionality Official packages with restrictive license Comes with Debian unstable Debian packages
Monitoring Fairly rich data via JMX; pluggable metrics reporter support (incl. graphite & ganglia) Graphite collector Graphite collector JMX, Graphite collector
Backup & restore Immutable SSTables can be snapshotted at point in time (nodetool snapshot) & backed up as files: documentation mbbackup tool, FS-level backups FS-level backups Recommended to use a second cluster FS-level backups
Automatic rebalancing after node failure Yes; fine-grained and implicit in DHT Manually triggered, moves entire vBuckets, documented as expensive Yes; fine-grained and implicit in DHT In balancer background process; moves entire regions