RESTBase/Table storage backend options
< RESTBase
This page was created in November 2014 to collect details on possible backends for the RESTBase table storage layer. As a result of the evaluation, Cassandra was deployed to the Wikimedia infrastructure. See wikitech:Cassandra for more information about Wikimedia's current deployment, and see Phab:T76157 for discussion about the evaluation.
Exclusion criteria
editThese criteria need to be satisfied by candidates:
- No single point of failure
- horizontal read scaling: can add nodes at any time; can increase replication factor
- horizontal write scaling: can add nodes at any time
- Has been hardened by production use at scale
Candidates
editCassandra | CouchBase | Riak | HBase | HyperDex | |
---|---|---|---|---|---|
Opinion | |||||
Pros |
|
|
|
| |
Cons |
|
|
|
|
|
Champions for practical evaluation (please sign) | |||||
Legal | |||||
License | Apache, no CLA | Apache, requires CLA; binary distributions under more restrictive license | Apache | Apache | BSD |
Community | Datastax largest contributor, but working open source community | 19 contributors; academic / open source | |||
Technical | |||||
Architecture | Symmetric DHT | Sharding with master per shard; automatic shard migration with distributed cluster manager state machine | Symmetric DHT | HDFS-based storage nodes; coordinator nodes with master/slave fail-over | Innovative hyperspace hashing scheme allows efficient queries on all attributes; replicated state machine handles load balancing / assignment of data to nodes (special nodes) |
CAP type | AP default, CP per-request | CP | AP default | CP | CP |
Storage backends / persistence | Memtable backed by immutable log-structured merge trees (SStables); persistent journal for atomic batch operations; writes to all replicas in parallel | Memory-based with writes to a single master per bucket; asynchronous persistence; chance of data loss if master goes down before write to disk | Bitcask, LevelDB (recommented for large data sets), Memory; writes to all replicas in parallel | HDFS | 'HyperDisk' log-structured storage with sequential scan. Not sure if there is compaction. |
Implementation | Java | Erlang, C++, C | Erlang | Java | C++ |
Primary keys | Composite hash key to distribute load around cluster, hierarchical range keys | String hash key | String hash key | Hash & Range keys | Range keys |
Schemas | Enforced schemas, fairly rich data types; column storage makes adding / removing columns cheap | schemaless JSON objects | schemaless JSON objects | Enforced | Enforced |
Indexing support | Per-node secondary hash indexing (limited use cases), Apache Solr integration in enterprise version | Per-node secondary hash and range indexing (limited use cases); Elasticsearch integration | Per-node secondary indexes (limited use cases); Apache Solr integration | Rich range query support on arbitrary combinations of attributes | |
Compression | deflate, lz4, snappy; configurable block size, so can exploit repetition between revisions with right data layout | snappy, per document | snappy with LevelDB backend | deflate, lzo, lz4, snappy | snappy |
TTL / expiry | yes, per attribute | yes | likely no | ||
Causality / conflict resolution | Nanosecond wall clock timestamps, last write wins; relies on ntp for time sync | Causal contexts; sibling version resolution | Value-dependent chaining | ||
Single-item CAS | Supported on any table using built-in Paxos | Supported | Supported when configured for bucket | Guarantees linearizability for operations on keys | |
Multi-item transactions | Only manually via CAS | Only manually via CAS | Only manually via CAS | Only manually via CAS | In commercial version |
Multi-item journaled batches | Yes | No? | |||
Per-request consistency trade-offs | one, localQuorum, all, CAS | No | No | ||
Balancing / bootstrap load distribution | DHT + virtual nodes (default 256 per physical node, configurable per node to account for capacity differences); new node streams data evenly from most other nodes & takes up key space proportional to number of virtual nodes relative to entire cluster | virtual buckets (typically 1024 per cluster); semi-manual rebalancing by moving entire vbuckets, documented as fairly expensive operation | DHT + virtual nodes | Coordinator nodes manage balancing | Coordinator nodes manage balancing |
Rack aware replica placement | Yes | In enterprise edition or custom build | No | Yes | No |
Multi-cluster / DC replication | Mature & widely used in large deployments; TLS supported; choice of synchronous vs. async per query | In enterprise edition or custom build | Only in enterprise version; possibly custom build from source? | No | |
Read hot spot scaling | Increase replication factor, reads distributed across replicas | Single node responsible for all reads for a vBucket | Increase replication factor, reads distributed across replicas | Single region server responsible for all reads to a region | |
Performance characterization | Very high write throughput with low latency impact on reads. Good average read performance. Somewhat limited per-instance scaling due to JVM GC limits | Performs well on small in-memory data sets. Seems to performs worse than competitors when configured for more frequent writes for durability. Support for large datasets (>> RAM) questionable. | Most benchmarks show worse performance than competitors like Cassandra | Claims high performance [1] | |
Operations | |||||
Large-scale production use | Yes, since 2010. | Yes | Yes | Yes, since at least 2010. | No |
Support / community | fairly active community; Datastax offering help | Fairly small & young community; commercial fork | |||
Setup complexity | Low (symmetric DHT, yaml configuration) | Low to medium: package-based install (community edition), REST configuration | Low | High (HDFS, designated coordinator nodes) | High: designated coordinator nodes, still much development churn |
Packaging | Good Apache Debian packages; apt-get install gets you a fully working one-node cluster | Ubuntu packages under restrictive license / reduced functionality | Official packages with restrictive license | Comes with Debian unstable | Debian packages |
Monitoring | Fairly rich data via JMX; pluggable metrics reporter support (incl. graphite & ganglia) | Graphite collector | Graphite collector | JMX, Graphite collector | |
Backup & restore | Immutable SSTables can be snapshotted at point in time (nodetool snapshot) & backed up as files: documentation | mbbackup tool, FS-level backups | FS-level backups | Recommended to use a second cluster | FS-level backups |
Automatic rebalancing after node failure | Yes; fine-grained and implicit in DHT | Manually triggered, moves entire vBuckets, documented as expensive | Yes; fine-grained and implicit in DHT | In balancer background process; moves entire regions |