User:GWicke/Notes/Storage

See bug 48483 and the prototype implementation at https://github.com/gwicke/rashomon. Also see /Cassandra testing for rashomon / cassandra testing notes.

A lot of our quarterly goals depend on storage of metadata. Additionally, mobile and others threaten to actually use our HTML at a larger scale which makes it important to avoid cold caches.

Goals

Storage backend abstraction -- let storage specialists optimize storage, free others from having to deal with it.
External content API -- provide an efficient API to retrieve content from the mobile app, ESI, bots etc.
Caching -- important for performance and scaleability of a content API. No random query parameter URLs that cannot be purged.
Consistency -- use the same URL schema externally and internally. Return the same content internally and externally. Consistently use relative links.
Extensibility -- provide extension points for additional content-like entry points like listings.
Simple rewriting and handler registration -- use URL patterns that can be used for handler selection both in PHP and for directly proxying an entry point to a backend service in Varnish.

Rashomon REST web service

Create REST revision storage web service and use it (initially) for Parsoid HTML/metadata storage
Each revision has multiple timestamped parts
- html and JSON page properties ('meta'), one timestamped version per re-render. Lets us retrieve a page as it looked like at time X.
- single wikitext per revision
- JSON parsoid round-trip info
- arbitrary metadata added by extensions (blame maps, annotations etc)

Resource / URL layout considerations

Our page names have established URLs (/wiki/Main_Page). The content part of our API defines sub-resources, which should be reflected in the URL layout.
Relative links should work. Links are relative to the perceived path and page names can contain slashes, so relative links in a page called Foo/Bar/Baz will be prefixed with ../../. Relative links should also work both internally and externally (with differing path prefixes), and both in a sub-resource and the page itself. This basically means that we need to use a query string.
Query strings should be deterministic so we can purge URLs. This means that there should be exactly one query parameter. Options considered are:
Foo?rev=latest/html

Sounds odd, as the key does not really match the sub-resource on the right. Only the 'latest' part is really a revision, html is the part of it. ?path=rev/latest/html would avoid that, but is much longer.

Foo/?/rev/latest/html

Looks more path-y, but is longer and more noisy.

Foo?rev/latest/html

Short and does not induce strange meaning like key=value. The path is a bit more broken up than the second option, but looks natural and less noisy for people used to query strings.

Strawman Rashomon API

GET /v1/enwiki/page/Main_Page?rev/latest -- redirect to /enwiki/latest/html, cached
GET /v1/enwiki/page/Main_Page?rev/latest/html -- returns latest html, cached & purged
GET /v1/enwiki/page/Main_Page?rev/latest/ -- list properties of latest revision, cached & purged
GET /v1/enwiki/page/Main_Page?rev/ -- list revisions, latest first
GET /v1/enwiki/page/Main_Page?rev/12345/ -- list properties with timeuuids of specific revision
GET /v1/enwiki/page/Main_Page?rev/12345/html -- redirects to latest html timeuuid URL
GET /v1/enwiki/page/Main_Page?rev/2013-02-23T22:23:24Z/html
    find revision as it was at time X, not cacheable, redirects to
GET /v1/enwiki/page/Main_Page?rev/8f545ba0-2601-11e3-885c-4160918f0fb9/html
    stable revision snapshot identified by Type 1 UUID. Immutable apart from HTML spec updates.

Assumptions:

A separate table is used to record a mapping from key name to content-type, update policy etc. If a key is not defined there, conservative defaults like application/octet-stream will be used.

See this background on UUIDs. Type 1 UUIDs ('timeuuid') encode a timestamp plus some random bits. Cassandra stores them in timestamp order, which lets us query these UUIDs by time. Other backends have similar time-based UUID representations.

Editing:

POST /v1/enwiki/page/Main_Page?rev/

Atomically create a new revision with several properties.

Required post vars with example values:

_parent=Main_Page?rev/12344: The parent revision. Returned as x-parent header with regular GET requests, and part of JSON returned at /enwiki/page/Main_Page?rev/ history info.
_rev=12345: The new revision. Returned as x-rev: Main_Page?rev/12345 header with regular GET requests. Also part of history info.

Optional post vars:

_timestamp=2013-09-25T09:43:09Z: Timestamp to use in timeuuid generation. Needed to import old revisions. Should require special rights. Normal updates should use the current time.

Typical property post vars:

html, wikitext, parsoid: The html, wikitext or parsoid information of this revision. All entries that are passed in are stored atomically.
meta: The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
Returns: JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.

POST /v1/enwiki/page/Main_Page?rev/8f545ba0-2601-11e3-885c-4160918f0fb9/

Insert new (versions of) properties for the given timeuuid base revision. A new timeuuid will be generated.

Typical property post vars:

html, wikitext: The html and wikitext of this revision
meta: The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
Returns: JSON status with new timeuuid on success or JSON error message otherwise. Implicitly purges caches.

POST /v1/enwiki/page/Main_Page?rev/12345/

Insert new (versions of) properties for the given revid base revision. Alternative form for the timeuuid-based update above. A new timeuuid will be generated.

Typical property post vars:

html, wikitext: The html and wikitext of this revision
meta: The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
Returns: JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.

PUT /v1/enwiki/page/Main_Page?rev/8f545ba0-2601-11e3-885c-4160918f0fb9/html

Destructively update a versioned property. This property needs to exist. Example use case: update HTML to the latest DOM spec. Requires elevated rights.

Returns: JSON status on success, JSON error message otherwise. Implicitly purges caches.

Strawman front-end API tunneled to Rashomon

Following the goal of using the same URL schema internally and externally, the content API can be made publicly available as:

GET /wiki/Main_Page?rev/latest/html -- returns latest html, purged on new revision / re-render

Relative links in JSON listings and HTML will work independent of the prefix, and existing page URLs are used as the base for subresources.

Key/value store without versioning

This addresses a similar use case as the DataStore RFC. It is not terribly important for the immediate Parsoid needs, but can easily be added in the storage front-end. Each bucket has a name so that each extension can have its own namespace.

Create a simple blob bucket:

PUT /v1/enwiki/math-png
Content-type: application/vnd.org.mediawiki.bucket.1.0+json

{'type': 'blob'}

Get Bucket properties

GET /v1/enwiki/math-png

Content-type: application/vnd.org.mediawiki.bucket.1.0+json

{'type': 'blob'}

Add an entry to a bucket:

PUT /v1/enwiki/math-png/96d719730559f4399cf1ddc2ba973bbd.png
Content-type: image/png

Fetch the image back:

GET /v1/enwiki/math-png/96d719730559f4399cf1ddc2ba973bbd.png

List bucket contents:

GET /v1/enwiki/math-png/ -- returns a JSON list of 50 or so entries in random order, plus a paging URL
=>
Content-type: application/vnd.org.mediawiki.bucketlisting.1.0+json

{ .. }

Similarly, other bucket types can be created. Example for a bucket that supports efficient range / inexact match queries on byte string keys and a counter:

// Create an ordered blob bucket
PUT /v1/enwiki/timeseries
Content-Type: application/vnd.org.mediawiki.bucket.1.0+json

{ 'type': 'ordered-blob' }

// Add an entry
PUT /v1/enwiki/timeseries/2012-03-12T22:30:23.56Z-something

// get a list of entries matching an inequality
GET /v1/enwiki/timeseries/?lt=2012-04&limit=1

// range query
GET /v1/enwiki/timeseries/?gt=2012-02&lt=2012-04&limit=50

Another example, this time using a counter bucket:

// Create an ordered blob bucket
PUT /v1/enwiki/timeseries
Content-Type: application/vnd.org.mediawiki.bucket.1.0+json

{ 'type': 'counter' }

// Read the current count
GET /v1/enwiki/views/pages

// Increment the counter, optionally with an increment parameter
POST /v1/enwiki/views/pages

Notes:

Access rights and content-types can be configured by bucket. Entries in public buckets are directly accessible to users able to read regular page content through the public web api: GET /wiki/?blob/math-png/96d719730559f4399cf1ddc2ba973bbd.png
Paging through all keys in a bucket is possible with most backends, but is not terribly efficient.
The ordered-blob type can be implemented with secondary indexes or backend-maintained extra index tables.

A generic web service PHP API can use URLs directly:

// $store is the generic web service store object that is dependency injected

// Add a new entry
$res = $store->run( 'PUT', '/math-png/96d719730559f4399cf1ddc2ba973bbd.png', $value );
// Also set some headers explicitly. These are stored along with the value, and returned for web requests.
$res = $store->run( 'PUT', '/math-png/96d719730559f4399cf1ddc2ba973bbd.png', $value, 
                    array( 
                        'headers' => array(
                            'Content-type' => 'image/png' 
                        )
                    ) 
               );

// Add several entries in a batch, returns array of results
$res = $store->run( 
        array(
            array( 'PUT', '/math-png/96d719730559f4399cf1ddc2ba973bbd.png', $value1 ),
            array( 'PUT', '/math-png/96d719730559f4399cf1ddc2ba973bbd.png', $value2 )
        )
    )
);

// Read one entry back
$res = $store->GET( '/math-png/96d719730559f4399cf1ddc2ba973bbd.png' );

// Read a batch, returns array of results
$res = $store->run( 
    array(
        array( 'GET', '/math-png/96d719730559f4399cf1ddc2ba973bbd.png' ),
        array( 'GET', '/math-png/96d719730559f4399cf1ddc2ba973bbd.png' )
    )
);

// Read a view counter and increment a click counter in one go
$res = $store->run( 
    array(
        array( 'GET', '/pageviews/Foo' ),
        array( 'POST', '/clicks/Foo/button1', array( 'increment' => 1 ) )
    )
);

The generic store has a registry of prefixes to storage backends. Matching happens longest-prefix first. Example:

// Specialized storage service for a specific bucket
$wgStoreBackends['/math-png'] = array (
    // Simple example: round-robin through these HTTP backends
    'https://10.1.1.246/v1/enwiki/math-png/',
    'https://10.1.1.247/v1/enwiki/math-png/',
    'https://10.1.1.248/v1/enwiki/math-png/'
);

// General Rashomon storage service for all remaining buckets
$wgStoreBackends['/'] = array (
    'https://10.1.1.146/v1/enwiki/',
    'https://10.1.1.147/v1/enwiki/',
    'https://10.1.1.148/v1/enwiki/'
);

Backends can be HTTP-based, but can also be transparently mapped to local PHP classes or other protocols. The path-based namespace presents a simple and unified API and avoids the need for many different classes. New backend features or backends can be added without a need to upgrade the client bindings. Bulk operations can be performed across several backends.

Post request encoding

application/x-www-form-urlencoded

default for forms
inefficient for binary data (30% base64 overhead)
no per-part headers apart from the value name

multipart/form-data

supported in all browsers
efficient for binary data
no per-part headers apart from the value name

multipart/related

not natively supported for form submissions
efficient for binary
can use content headers per part (especially Content-Type), full mime

Likely best: Support all three, and use default values when content-type etc headers are absent. We could also add support for a http-equiv entry for each form name. Example: form field html with corresponding HTTP-equivalent headers in a JSON object called _http-equiv.html.

Minor-version API versioning per entry point via Accept header

For now we are mainly interested in API versioning (application/vnd.org.mediawiki.bucketlisting.1.0+json or application/vnd.org.mediawiki.html.1.0+xml), as we only plan to support JSON for listings and other dynamic content. Other resources have a single content-type too. The issue with using the Accept header is caching. We'll have to vary on Accept, but would also like to avoid fragmentation for browser requests.

Possible solution:

strip / normalize Accept headers that don't contain a single type that matches vnd.org.mediawiki
- these will receive the default type and latest API version
Forward all other Accept values to the backend

Use cases:

The Parsoid DOM spec version is incremented whenever anything about the HTML output changes. Stored versions will transparently be updated to the latest version, but clients can request older versions of the DOM spec using the Accept header.

Storage backend

We are looking for a solution that

is scaleable and reliable: replicated, good load balancing
compresses consecutive revisions of the same data. Storing HTML after each re-render would be very expensive otherwise.
can answer simple queries like 'page X at time Y' efficiently (clustering, limited range queries)
is easy to work with and bind to
causes little work for ops

Cassandra

Distributed storage with support for indexes, CAS and clustering / transparent compression. Avoids hot spots for IO (problem in ExternalStore sharding scheme).

Idea: Use this for revision storage, with a simple node storage service front-end. Easier to implement than trying to build a frontend for ExternalStore, provides testing for possible wider use.

Prototype API implementation: https://github.com/gwicke/rashomon
Current schema: cassandra-revisions.cql
Ongoing testing: /Cassandra testing

History compression

It used to be more efficient when pages on Wikipedia were still smaller than the (typically 64k) compression algorithm window size: meta:History_compression.

# -100 is 100 concatenations of the single file.
# First a page larger than the typical 64k compression window. 
# Only lzma fully picks up the repetition with its large window.
-rw-r--r-- 1 gabriel gabriel 143K Sep 23 14:00 /tmp/Atheism.txt
-rw-r--r-- 1 gabriel gabriel  14M Sep 23 14:01 /tmp/Atheism-100.txt
-rw-r--r-- 1 gabriel gabriel 7.8M Sep 23 14:29 /tmp/Atheism-100.txt.lz4
-rw-r--r-- 1 gabriel gabriel 5.0M Sep 23 14:02 /tmp/Atheism-100.txt.gzip9
-rw-r--r-- 1 gabriel gabriel 1.3M Sep 23 14:01 /tmp/Atheism-100.txt.bz2
-rw-r--r-- 1 gabriel gabriel  49K Sep 23 14:05 /tmp/Atheism-100.txt.lzma

# Now a small (more typical) 7k page, this time as HTML. 
# Compression works well using all algorithms.
# LZ4 (fast and default in Cassandra) outperforms gzip -9.
-rw-r--r-- 1 gabriel gabriel 7.0K Sep 23 14:16 /tmp/Storage.html
-rw-r--r-- 1 gabriel gabriel 699K Sep 23 14:16 /tmp/Storage-100.html
-rw-r--r-- 1 gabriel gabriel 6.8K Sep 23 14:17 /tmp/Storage-100.html.gz
-rw-r--r-- 1 gabriel gabriel 5.7K Sep 23 14:29 /tmp/Storage-100.html.lz4
-rw-r--r-- 1 gabriel gabriel 4.9K Sep 23 14:16 /tmp/Storage-100.html.bz2
-rw-r--r-- 1 gabriel gabriel 2.2K Sep 23 14:18 /tmp/Storage-100.html.lzma

Size stats enwiki: 99.9% of all articles are < 64k
rzip implements fairly efficient large-window matching and -compression; Go implementation of the delta encoding phase only

Cassandra compression

Cassandra stores compound primary key data as so-called wide rows. With the right schema this means that the text column of a given article will be compressed and stored sequentially. I benchmarked different compression algorithms and compression input block sizes. Of the compression algorithms available in Cassandra, Deflate (gzip) does best as it recognizes repetitions within its 32k sliding window, which means that many copies of the same article compress really well as long as it is smaller than about 32k. LZ4 and Snappy both process fixed 64k blocks at a time, so don't find many repetitions for typical (19k in this sample) article sizes.

20 * 5k articles any size (93M*20), Deflate, 256k block: 488MB (26%)
20 * 5k articles < 30k (39M*20), Deflate, 256k block: 48MB (6.2%)
20 * 10k articles < 10k (23M*20), Deflate, 256k block: 26MB (5.6%)

Adding LZMA compression would result in much better results for large articles and HTML at the price of (~5x [1]) slower compression. For highly compressible content like article revisions, decompression will likely be about as fast or even faster than deflate as the compressed input is much smaller. Slower compression might also be fine as our write rates are low and the CPU load would be evenly spread across all Cassandra boxes. With deflate on my puny dual-core laptop insertion rates are routinely in the 500 rev/second range, mostly CPU-bound in the node client. We currently have less than 50 edits / re-renders per second in the cluster.

Tests were done by inserting the first 5k/10k articles from an enwiki dump into a cassandra table with this layout:

CREATE TABLE revisions (   
    name text,   
    prop text,   
    id timeuuid,   
    value blob,   
    PRIMARY KEY (name, prop, id) 
) WITH compression = { 'sstable_compression' : 'DeflateCompressor',
'chunk_length_kb' : 256 };

The Cassandra db file size was then summed up using du -sS.

See also User:GWicke/Notes/Cassandra.

Alternatives

The stateless storage API frontend will abstract the backend, so we are free to swap that out. Some backend contenders to consider:

HBase: DB layer on top of HDFS. Strongly consistent, less available. No per-request consistency selection. The underlying HDFS is hierarchical (naming nodes) rather than a flat DHT. Possibly the strongest alternative to Cassandra.
Swift: A bit hacky, not symmetrical. Lacks clustering / compression features. Had some issues in the thumb storage application.
Riak: Similar to Cassandra. Seems to optionally offer column compression with leveldb. Reportedly less mature and slower. Smaller community. No cross-datacenter replication in open source edition, and no integrated CAS / paxos.

Related REST storage interfaces

[2] [3] Amazon S3
Swift
couchDB - underscore prefix for private resources
Sun cloud API
Google Apps
- Google drive
Facebook APIs

Authentication

MediaWiki OAuth 1 implementation
OAuth2 / OpenID connect with JWT bearer tokens. Self-describing signed claims, cheap to verify per request without DB access.
- Google authentication -- also supports SAML
- PayPal
- Most of the OAuth2 providers in en:OAuth#List_of_OAuth_service_providers -- includes Microsoft, Facebook etc