This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. The Wikibase RDF dump format is likely what you are looking for. |
This page describes the proposed data model to represent Wikibase data in graph database, such as Gremlin-compatible database.
This is version 2 of the model, previous version v1 can be found in history.
Terminology
editThe data model is based on Wikibase Data Model and the terms like "item", "claim", "property" are derived from that data model. Please keep the terminology compatible with the Wikibase glossary.
To distinguish between Wikibase Properties/Entites and Titan properties and other constructs, the references to Wikibase terms Property, Item, Entity are capitalized.
Titan vertex names are listed as italic. Titan property (key) names are listed as bold.
Vertices
editEach Wikibase Entity (this pertains to both Q and P Entities) is represented as a vertex in the graph. Each Entity has an unique vertex, which has property wikibaseId with its wikibase ID (as is, e.g. Q42 or P31). The vertex also has properties named labelLNG, where LNG is the language code (such as labelEn for the English label) and may have more properties as necessary.
The descriptions are not stored but the fact the item has the description in certain language is stored as a boolean value in descLNG, so if the item has description in English, it would have descEn as true.
Claims
editClaims on Entities are represented by edges, with the edge leading to either vertex representing another Entity or to the vertex representing Property, depending on if the value is scalar or link to another Entity. See below on the details of the representation.
The claim edges have wikibaseId property matching the Claim ID in the data set. The property name also is stored in edge property named property.
Representing link between Entities
editLink between two Entities is represented as an edge going from owning Entity to the claimed Entity, with edge label being marked by the Property represented by the claim.
For example, if "USA" (Q30) is claimed to be "instance of" (P31) a "country" (Q6256) this produces an edge from vertex Q30 to vertex Q6256 labeled P31.
Additionally, the claiming vertex has a set (multi-value) property named after the Property with the suffix link (i.e. in the example above, P31link) which contains the ids of all Entities linked to this vertex. This is done in order to speed up queries like "list of humans" or "list of countries". Note that this list does not distinguish between claims and ignores qualifiers, if you need this distinction then the edges should be used.
For additional indexing by ElasticSearch, the P??link property has a duplicate property P??link_ (note the underscore) which lists the IDs of the vertices as a space-separated string (e.g. "Q123 Q345 Q5678").
Additionally, the value property (see below) of the link edge is set to the id of the linked Entity, so in the example above the edge P31 would also have a property P31value with value "Q6256". This allows faster filtering of the edges without the need to load the target vertex data.
Representing Property value
editProperty value - i.e., claim having non-Entity value, such as string, number, date, location, etc. - is represented by the property of the claim edge. The data value is stored in the property of the edge named after the claim Property with suffix value. The target vertex of the claim edge is a special node named property.
For example, if "USA" (Q30) is claimed to have a population (P1082) of 318,697,314 people, this produces an edge from vertex Q30 to the vertex property labeled P1082 and having P1082value property of 318,697,314.
The separate names of the property values for the edges will allow to create a typed index (such as fulltext or geospatial index) on the values.
Representing data types
editThe data for the values is stored as presented in imported data, except for the following types which are processed:
- globe-coordinate - is represented in Titan as Geoshape object. Note that this is a Titan-specific representation which may need to be changed for other backends.
- monolingualtext - is currently represented as "language:text" string. We may want to seek better representation.
- time - is represented as long value specifying number of seconds from 1970-01-01 00:00 UTC. The values for years 1 AD to 291999999 AD (inclusive) are represented with per-second precision, the values outside of this range are represented as whole years, where year is defined as 31557600 seconds (365.25 days).
- quantity - only the amount is stored. TBD: current stored as double precision floating point value (Double), we may want to find better representation.
Along with the data value, for non-primitive types the accompanying data are stored in separate property, suffixed with _all, e.g. for property P1082 accompanying data are stored in P1082_all. The data is stored as a map, as it appears in the input[1].
If the data is not representable as described above (i.e. invalid date, value can not be parsed, etc.) it is represented as "somevalue
" (see below).
Representing pseudo-values
editThe pseudo-values "novalue
" and "somevalue
" are represented as edges to special vertices novalue and unknown respectively. The value property of the edge (see above) is unset. This way the property still can have typed index without actually mixing data values with placeholder values.
Representing ranks
editThe rank of the claim is represented by the property rank of the claim edge. The claims with rank "deprecated" are currently ignored on import and are not represented in the indexing data set. [2]
TBD: Although deprecated statements will probably not be queried that often, we should try to import and index all data.
Representing qualifiers
editThe qualifiers are modeled the same way as properties, attached to the claim edge, but the claim value is stored with the suffix q and the accompanying data with the suffix q_all. For qualifiers linking to Entities, the wikibase ID is stored. [3]
For example, qualifier "point in time" (P585) attached to the claim about the US population would produce the property P585q containing the date value and P585q_all with accompanying date values.
If a claim has multiple instances of the same qualifier, the clones of the claim are created, such that each clone has one value of the qualifier. [4]
Exception is the pair of claims P580 (start time) and P582 (end time) which are treated as pair - i.e. for each pair of star/end time one clone is created, not two. [5]
Representing sitelinks and badges
editThe Sitelinks are represented as properties of the vertex, named after the linked site, with L prefix, so the link to enwiki
site is the property Lenwiki. The value of the property is the title of the link.
The Badges are represented by the properties of the vertex property mentioned above, named badge. This property has a "set" type, i.e. can have multiple values. The values are the Item IDs of the respective badges. E.g., enwiki
article with a "good article" badge would have the badge sub-property of the Lenwiki property set to Q17437798. For better indexing by ElasticSearch, there is also badge_ property which has the values of the set as space-separated string.
Representing references
editThe References are represented as vertices, with the wikibaseId of the vertex being the hash value of the reference. The Properties on the Reference are represented the same way as the Statements for an Item - with the claim edges, as described above. The reference vertex is connected to the item vertex by the edge with the label of reference[6]. The wikibaseId property of the reference edge is the same as the ID of the claim it is attached to.
The data model allows for multiple claims from the same or different items to refer to the same reference vertex.
Indexes
editIndexes allow quick lookup of the data without full scan. Titan supports three types of indexes: global (on all vertices/edges), vertex-centric (on edges connected to each vertex, stored together with vertex) and mixed (external, such as ElasticSearch). Mixed indexes allow multi-key lookup and also support more lookup types, such as range, geopoint, fulltext, etc. Global & vertex indices allow only lookup by full key composition, and may not support some data types and key combinations.
Global and mixed indexes help lookup starting on whole graph, while vertex-centric indexes help lookups starting at specific vertex. Incoming/outgoing edges are automatically indexed so no need to add any index there.
Global indexes
editName | Data type | Element | Fields | Notes |
---|---|---|---|---|
by_wikibaseId | String | Vertex | wikibaseId | Unique index, looks up vertex by Q/P identifier |
by_wikibaseIdE | String | Edge | wikibaseId | Indexes edges by claim ID |
by_specialValueNode | String | Vertex | specialValueNode | Used to look up special nodes: unknown, novalue, property |
by_type | String | Vertex | type | Indexes by Entity type - item/property/reference vertex |
by_etype | String | Edge | edgeType | Indexes by edge type - claim/reference/etc. |
by_prop | String | Edge | property | Indexes by property expressed by the edge |
by_Ehash | String | Edge | contentHash | Indexes edge by claim content hash |
by_badge | String | Vertex | badge | Indexes badges by badge Entity ID |
by_Pxxx | varies | Edge | P123value, P123q | Indexes edges by claim values and qualifiers, e.g. P123value or P345q. The name and the type depends on specific property indexed. |
by_Lxxx | String | Vertex | Lxxx | Indexes sitelink properties - e.g. Lenwiki |
by_Pxxxlink | Set[String] | Vertex | P123link | Indexes outgoing link IDs of the vertex, for speeding up in->out lookups on supernodes |
Vertex-centric indexes
editVertex-centric indexes always apply to edges with certain label.
Name | Data type | Label | Fields | Notes |
---|---|---|---|---|
by_claimsHash | String | claim | contentHash | Indexes claims with the same content hash |
by_refsID | String | reference | wikibaseId | Indexes reference edges by id |
by_refsHash | String | reference | contentHash | Indexes reference edges by content hash, for updates |
by_rankPxxx | Boolean | Pxxx | rank | Indexes P123 edges by rank |
by_hashPxxx | String | Pxxx | contentHash | Indexes P123 edges by content hash, for updates |
by_typePxxx | String | Pxxx | edgeType | Indexes P123 edges by edge type |
by_Pxxx | varies | Pxxx | P123value, P123q | Indexes P123 edges by value/qualifier. |
Mixed indexes
editMixed indexes are based on ElasticSearch index and allow querying by any of the index elements. Thus, one index may contain many fields.
Name | Element | Fields | Notes |
---|---|---|---|
by_links | Vertex | wikibaseId, Pxxxlink_ | Indexes outgoing links for nodes, as string - i.e. "P123 P345" |
by_values | Edge | property, Pxxxvalue, Pxxxq | Indexes values & qualifiers on edges |
by_sitelinks | Vertex | Lxxx | Indexes sitelink contents, like Lenwiki |
by_badgesE | Vertex | badge | Indexes badges expressed as string - i.e. "Q123 Q456" |
Illustration
editHere is the example (partial) representation of the vertex "USA" and its properties:
Implementation
editCurrent code implementing the model can be found at https://github.com/smalyshev/wikidata-gremlin/tree/titan_flat
Footnotes
edit- ↑ Note that this means additional data is currently accessible in filters/transforms/queries but not indexable.
- ↑ This is to be confirmed that we want to ignore deprecated data.
- ↑ Theoretically, it is possible to store vertex instead, since Titan allows to represent vertices or unidirectional links as values (effectively, it supports pointers to vertices as value type) but for many queries having ID may actually be better than having vertex and then loading it to find out the ID. This may easily be changed in the future or another value may be added to store the vertex if needed.
- ↑ In theory, this may be dangerous since it can explode one claim into a lot of claims. However, in practice after analyzing most of the data no claims that generate more than 15 clones were detected. There is a hard limit of 50 on expansion of each claim.
- ↑ There might be more property pairs that need to be treated this way, we'd need to figure out how to find them.
- ↑ See https://github.com/thinkaurelius/titan/issues/569 and https://groups.google.com/forum/#!topic/aureliusgraphs/bw_OAnGd9Uw for discussion of using unidirected edges on edges instead of parallel edges.