Platform Engineering Team/Personal Development Share Back/Distributed Storage Transactions

Idea edit

Distributed data stores can provide massive scalability, fault-tolerance, and replication semantics for robust geographic distribution — compelling features for an organization like Wikimedia. However, these systems have also sacrificed important properties for the sake of their distribution, such as joins, or ACID transactions. We are therefore required to evaluate these systems on the basis of the trade-offs between their unique capabilities, and what must be sacrificed to use of them.

To illustrate the problem, consider an application like MediaWiki. Users make edits, creating new revisions of pages. Users, pages, and revisions are all objects that require state to be persisted.  It makes sense to organize — or model — these objects grouped by like-entities (users with users, pages with other pages, etc), this is called normalization.  However, it must still be possible to maintain correctness of this disjoint data during updates, and to join linked objects on query.  MediaWiki utilizes an RDBMS, and such systems are well suited to models like this. Without multi-item transactions though, maintaining the referential integrity of such a data model is not possible, and de-normalization becomes necessary. De-normalization results in less flexibility, and duplication which can make maintaining correctness more challenging.

While it is unlikely that we'd ever both have our proverbial cake, and be able to eat it, there is a growing body of research that explores the idea of adding transactions to distributed databases. Even limited support for multi-item transactions could be a game-changer, opening the door to use-cases that would benefit from distribution, but might otherwise be considered intractable.

What is proposed here is a long-term, open-ended project to evaluate, research, and experiment with technologies and techniques to address some of these missing capabilities. This work will focus particularly on Apache Cassandra, since it is a system already in use at Wikimedia.

Outcomes/Goals edit

Phase 1 edit

An evaluation of the client-coordinated transaction commitment protocol presented in "Scalable Distributed Transactions across Heterogeneous Stores".

...we propose an approach that enables multi-item transactions with snapshot isolation across multiple heterogeneous data stores using only a minimal set of commonly implemented features such as single item consistency, conditional updates, and the ability to store additional meta-data. We define a client-coordinated transaction commitment protocol that does not rely on a central coordinating infrastructure. The application can take advantage of the scalability and fault-tolerance characteristics of modern key-value stores and access existing data in them, and also have multi-item transactional access guarantees with little performance impact.

This phase would result in a greenfield implementation of Cherry Garcia, an (unreleased) implementation outlined by the authors. Creation of this implementation is meant to serve as an opportunity to better understand the protocol, and to provide the basis with which to experiment (performance, correctness, etc). While it may be possible that such an implementation could form the basis for future production code, the scope of this phase is that of a prototype, or proof-of-concept, and nothing more.

Contributors edit

Clara Andrew-Wani and Eric Evans

Reviewers edit

Tim Starling and Daniel Kinzler