Wikibase/Indexing

Task tracking: wikidata-query-service in Phabricator; watch the project to receive notifications

Goals / Requirements

edit

End of March

edit
  • Publicly downloadable prototype (query and import and update)
  • Labs install (details hazy, same prototype)
  • Hardware requested
  • 12 month roadmap

Phase 1: MVP (Before end of June)

edit
  • Support use-cases of WikiGrok, remain flexible in architecture to eventually support external requests/third parties.
Must have:
edit
  • query console for experimentation

As a user, I want to see a WikiGrok question immediately on the mobile site, as soon as I load an article that has any active WikGrok campaigns, so that I can respond quickly and keep getting more questions in real time, based on the ones I've already answered.

Requirements:

  • available through an API
  • available through server-side requests, connecting through PHP
  • for any query, result output in JSON (XML would be nice to have)
  • results may sometimes come in the form of lists (e.g., List of all possible occupations) and sometimes in the form of a single item
    • Simple (single-item) queries: (e.g., "is this item X and not Y?")
      • generate live and run quickly – as fast as possible – to serve immediately to users via WikiGrok (and potentially continue serving more results on the fly after user input)
    • Complex queries: (e.g., list of all possible suggested occupations)
      • pre-generate results and store in a table or cache, so these queries can run longer (but still within some reasonable timeframe, e.g. 1 hour)
      • regenerate as often as practical/possible

Phase 2: Support for public/external requests

edit
  • ideally, public web service
    • external requests return within a few seconds, use reasonable resources
      • how to enforce that constraint needs to be determined and influences the architecture
    • internal requests are allowed to use more resources & time
      • these need to not crash external requests and external cannot crush internal
  • needs to support continuous updates to reflect latest Wikidata state
    • Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
  • handle high request volumes (horizontal scaling)
  • handle a large data set (sharding)
  • robust: automatic handling of node failures, cross-datacenter replication, proven in production
  • reasonable operational complexity

Candidate solutions

edit
  • Distributed graph database
  • Supports online modification (OLTP), so can reflect current state
  • Expressive query language (Gremlin); shared with other graph dbs like Neo4j
  • Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
    • async multi-cluster replication can be used for isolation of research clusters, DC fail-over
  • Supports relatively rich indexing, including complex indexes using ElasticSearch
    • Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes
  • TinkerPop blueprints support, including Gremlin and the GraphSail RDF interface

Magnus' Wikidata Query service

edit
  • Custom in-memory graph database implemented in C++
  • Relatively expressive, custom query language
  • Limited to a single machine
    • Current memory usage: 5G RSS
    • Load balanced systems are available
  • The initial dump conversion is extremely slow, so if a server process crashes and its dump gets corrupt you face a prolonged outage. Even if we optimize it by an order of magnitude, it will still be slow.
  • Server startup time is not nice either, and will only grow with the growth of the dataset. While not critical e.g. for Redis that we keep running for months, the less mature nature of WDQ and the need to cater for development/future bugs will mean a lot of restarts, each being a PITA.
  • You can't run in production the DB query used to retrieve latest changes is, so this part will have to be redone completely.
    • And this update routine results in a possible race condition making it miss some changes.
  • Each entity's properties are retrieved with a separate uncacheable HTTP request to Special:EntityData which isn't very fast so as the rate of changes increases, WDQ will bump into it hard, not being able to cope with updates.
  • Thread synchronization model needs to be totally redone, as currently used spinlocks aren't scalable.
  • Some code paths in the update code are missing a synchronization which makes one wonder how often does it crash and how much often will it crash under a production load. And if you add synchronization there, locking is probably going to bring everything to a halt.
  • With a crapload of very small objects in the same heap, a long-running server process would have problems with heap fragmentation/memory management performance, requiring custom memory management, etc.
  • Scalability would be very primitive: as you add more machines they start making more requests, etc.
  • Would need some manual solutions for approximate clustering, failover, etc.
  • Would require in-house maintenance.
  • Apache 2 License
  • Distributed graph database
  • Supports online modification (OLTP)
  • Supports Gremlin, as well as it's own SQL-like query language with graph features and no JOINs
  • Drop-in Lucene plugin for geospatial indexes (also full-text, not that we would use that)
  • Replication is multi-master and works via Hazelcast, read/write quorums, and merging conflicts
    • Isolation in ACID drops a bit when distributed (intermediate results may show briefly)
    • Isolation is SERIALIZABLE for direct FS access case, READ COMMITTED for remote access (what we would use)
    • Since I'd like to input data from hub feeds (starting with dumps), async replication isn't really an issue; both DC would just use the same process to pull in updates
      • Actually OrientDB 2.0 supports async replication via config [1]
  • Supports automatic round-robin sharding as well as application directed sharding (specifying clusters for class item insertions and reads, reads default to checking all partitions)
  • Supports various indexes (SBTree,hash, both unique or not unique) and primitive as well as embedded data structures (sets/lists/map) that can be indexed
  • Queries on the JSON itself are also possible regardless of nesting levels
  • Supports Tinkerpop blueprints
  • Good query timeout support via TIMEOUT
  • Apache 2 License
  • Multi-model database: key/value, document and graph DB with a query language combining all three models
  • Complex queries including joins between multiple collections
  • ACID: transactions spanning multiple documents and collections
  • distributed: replication (asynchronous master-slave) and sharding
  • AQL (ArangoDB Query Language), a declarative query language similar in spirit to SQL, designed with multi-model in mind and allowing joins. Has also other querying options.
  • HTTP API for RESTful interfaces
  • API extensible by user-defined JavaScript code (V8 embedded) via the Foxx framework
  • Supports relatively rich indexing (skip-list, hash, geo, full-text)
  • Extensive reference documentation in gitbook format
  • Convenient web front end integrated
  • Packages for Debian, other Linux distributions, Mac OSX and Windows available in repository on website
  • Virtual machine images and docker container available
  • Very modern code base (C++11) with many unit tests
  • Good community and professional support
  • Few open issues on github, new ones are quickly dealt with
  • Regular major releases every two to three months
  • Concerns
    • Memory-only indexes
    • Only one full-text index per collection, but a river plugin for elastic search exists[1]
    • Complex clustering/sharding model which is still in development, integration with mesos is being worked on
    • No indexes on array properties yet[2], this is being worked on
  • GPL & AGPL
  • Some capabilities may be enterprise-only
  • Dynamic schema, supports fulltext indexes and probably geospatial with plugin, also looks like pluggable indexes are possible
  • Cypher SQL-like query language, TP3 support
  • Labels may be used to make efficient lookups for yes/no properties (see in the blog)
  • Has query planner/profiler
  • Good HA support, no sharding though
  • Lucene indexes
  • Concerns
    • Mixed indexes support - i.e. is it possible to use 2 indexes together?
    • Edge indexes only old-style - either manual-fill or pre-configured autoindexer which needs to be configured each time anew
    • Not clear how we handle geodata
    • No ability to store arbitrary JSON/blob for non-indexed secondary lookup
    • Multi-valued properties support - TP3 implementation uses node-per-value
    • What we're missing by running non-Enterprise version (if anything)
    • How sharding-less replication works in practice on our data size?
  • Questions
    • Are edge properties first-class citizens re: indexing?
    • Looking up property values - more efficient as edge properties or vertex properties? I.e. are claims edges or vertices?
    • Is Enterprise version just AGPL or has some other requirements?

Other possible candidates

edit
  • "Big Data" - GPLv2, Tinkerpop 2.5
  • Graphx - Apache License, Alpha, Uses Spark
  • Virtuoso cluster
  • 4store
  • Neo4j - GPL and AGPL - Tinkerpop 3.0 reference implementation
  • Build one directly against Elasticsearch - Horrible/provocative idea - We currently have the expertise but totally against most of our goals – OTOH we were resigned to having to hack on Titan's Elasticsearch integration anyway

Open questions

edit
  • Paging of large result sets
  • Handling of cycles in the graph
  • How to index the graph for efficient common query use cases
  • Efficient updates for materialized complex query results

Candidate Evaluation

edit
Titan OrientDB ArangoDB Neo4J
Future Proofing
Automatic sharding for large data set? Yes, fully automatic at Cassandra layer through DHT structure. Can add one node at a time. Manual shard setup and assignment, but automatic write balancing across shards. In the community edition, any changes to the cluster config including adding nodes or rebalancing shards require a complete cluster restart. [3] In Development. [4] Currently the sharding setup is manual, but the data distribution is automatic. Zero administration and integration with Apache mesos clusters is being worked on. Replication - yes, sharding - no
Multi-cluster replication? Yes No No ?
Single point of failure? No. Storage backend is Cassandra which doesn't have a single point of contention, much less failure. No No. Automatic failover is being worked on. Currently asynchronous replication is possible but requires manual setup. No
Large-scale production use? Yes - Not clear what large number of Titan servers is. Lots of documentation on very large Cassandra clusters. Netflix has blogger about ~285 node Cassandra clusters pushing a million writes a second. We won't get near that hardware. Yes - five or six server clusters Not clear what "large-scale" means, but customers use clusters in production. AFAIK Yes, not sure of the size but read about bllion-nodes-sized setups.
Active community? Mailing list had one half the emails of Elasticsearch in October. Mailing list had one half the emails of Elasticsearch in October. Yes, github, google group and stackoverflow. stackoverflow, mailing list, github
Usage
Do we like the query language? Gremlin (maybe SPARQL) Modified SQL (maybe Gremlin if we really want) AQL, graph traversal Cypher, TP3
Can we customize the query language? Monkey patching, front-end service front-end service Yes (Foxx) Plugins
Indexing capabilties Hash and Elasticsearch Hash, range and Lucene hash, skiplist, geo, fulltext (one per collection), river for elastic search
Parallel request scheduling (Are different parallel queries sufficiently well (fair?) scheduled so that resource intensive and low latency interactive ones can run on the same cluster?) You'd have to do this by running queries on different machines and sharing the same storage backends. Inside a single machine queries are just have threads doing lazy evaluation and Java threads. Incoming HTTP requests are scheduled in a fair way. Numbers of server threads can be configured. If not all threads are busy with long running requests, then scheduling is fair.
Query time or cpu limit Timeout added in very newest version. Its possible to write degenerate queries the circumvent the timeout. Good, via TIMEOUT Possible to cancel queries externally
Query memory limit Done using process/machine isolation. The storage layer is shared but that isn't memory issue. No No
Query support (use cases)
items by multiple property values Yes (has) Yes Yes
recursive collections (sub-taxons of x) Yes(loop, out) Yes ?
Follow edges in inverse direction Yes (in, out) Yes Yes
top n Yes (order+loop or []) Yes Yes
range With Elasticsearch Yes; includes SB-Trees Yes
2d range, geo With Elasticsearch Yes Yes
prefix match Yes (Text.PREFIX) Yes; SB-Trees Yes
query based on custom set (birth places of 5000 people) Yes (retain) Yes? Yes
Filtering/intersection a custom set Yes (retain) Yes? Yes
Matching against a transitive property (instance of human, including subclasses) Yes(loop, out) Yes; using TRAVERSE or Gremlin ?
"Countries sorted by GDP" Yes (but GDP doesn't seem to be in wikidata props) Yes; Composite SB-Tree and edge per prop statemet Yes
"People born in a city with more than 100k inhabitants" Yes Yes Yes
"largest 10 cities in Europe that have a female mayor" Yes Yes; subquery or JSON filtering in WHERE for example Yes
Distinct values of a property Yes (dedup) Yes; SB-Tree or hash index Yes
max/min/sum/count/avg Yes (sideEffect) Yes Yes
intersection/union/difference Union - yes (copySplit & merge), intersection/difference - probably yes (retain/except) but may have limitations Yes; via SQL UNION/WHERE/subqueries, maybe Gremlin too Yes
Join based on property value Trivial for edges, may not be efficient for properties without specific index Yes Yes
Support for subproperties: "people who have <subproperty of location> with Newcastle" There's no concept of subproperties as such, but properties can be hashmaps and filters can use code to match against them Yes; nested JSON can be queried all the way down Yes
Dev
Do we have the expertise to hack on it? Java and Groovy Java and maybe Groovy C++/JS
Is upstream excited to have us? Yes Yes!!!! Yes, definitely!
Does contributing require a CLA? Looks like no Yes, copied and pasted from Apache Yes, see [5]
OPS
How do we manage housekeeping? [please define]
Can we extract useful metrics for monitoring and alerting? Standard Cassandra instrumentation and basic metrics from Titan itself Yes, statistics are kept and available in the cluster dashboard.
How do we recover from a crash? Entire cluster: Automatic fail over to other cluster (DC); can restore from backups of immutable LSMs Whole cluster: Restart cluster and restore from dump.
What happens if a node goes down? Remove node from LVS pool. Coordinator node: nothing, DBserver node: Have to manually connect asynchronous replica as new node. Automatic failover is being worked on.
What happens if a backend storage node goes down? Automatic rebalancing between remaining nodes in cluster. No backend nodes Have to manually connect asynchronous replica as new node. Automatic failover is being worked on.
How do we back its data up? Normal Cassandra disk backup of immutable SSTable files after snapshot. Backup is not in community edition; use JSON dumps + hub updates (the primary data is in MySQL/ExternalStore anyway) arangodump and arangorestore for cluster are fully supported in the community edition. Asynchronous replication as well.
Source (for importing data, sample queries, etc) https://github.com/smalyshev/wikidata-gremlin/ branch titan_flat https://github.com/AaronSchulz/WikiDataQueryOrient https://github.com/triAGENS/ArangoDB-Data

See also

edit

WDQS Beta based on RDF/SPARQL

edit

Not candidates

edit
  • Neo4j (replication only in "Enterprise Edition")
  • Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
  • Wikidata Query service: single node, no sharding, no replication
  • Caley: no serious production use
  • ArangoDB: looks like it might work but they don't allow upstream contributions. Answer to this: This was never true and is not true, see [5] for the CLA procedure. When they get that problem solved and OrientDB and Titan fall through then we can promote them again.

References

edit