Extension talk:Wikibase QueryEngine/SQL schema proposal
Latest comment: 10 years ago by Duesentrieb in topic Some comments
Some comments edit
- wbq_entities
- Seems OK, description is a bit terse
- Are 'id' values thinks like Q<hex>? Can NUMERIC be used instead?
- Nope - they are not always numeric --Jeroen De Dauw (talk) 13:21, 11 November 2013 (UTC)
- Can constants be used for 'type' so it can be UNSIGNED INT?
- wbq_mainsnak_<type>
- Same stuff with 'id' field as above
- PRIMARY KEY is kind of long, especially for the expected table size, unless the index is well sorted for full-row lookups of the vast majority of queries, it might make sense to just make this UNIQUE and pick another index (or surrogate) to be the PRIMARY. The hash column could save some space as base36.
- wbq_qualifier_<type>
- (similar to above)
- wbq_<role>_string
- (similar to above)
- Also, how big can 'value' be? I care more if it will be indexed.
- wbq_<role>_entityid
- What is the PRIMARY KEY?
- wbq_<role>_time
- I guess that would need a surrogate key
- wbq_<role>_quantity
- I guess that would need a surrogate key
- I wonder if RTREE indexes might be useful, depending on the final column layout, this kind of table is tricky SQL
- wbq_<role>_geo
- I guess that would need a surrogate key
- Would be nice to see the proposed index statements (would some use RTREE?), this is another tricky one
- Why is 'alt' deprecated?
- wbq_valueless_snaks
- Blobs in the Primary Key are a little worrying, actually why are they not varbinary? Even still, it would make the PRIMARY kind of long
Aaron (talk) 00:07, 8 November 2013 (UTC)
Thanks for the comments Aaron! I'll reply section by section:
- wbq_entities
- Are 'id' values thinks like Q<hex>? Can NUMERIC be used instead? We will have non-numeric IDs for future types of items, like media meta-data, sub-entities for word senses, etc.
- Can constants be used for 'type' so it can be UNSIGNED INT? In theory yes, but that means a schema change on the live site when we introduce a new entity type. And since different entity types can be defined by different extensions, there's no single place to maintain the full list of possible entity types. It's configurable and depends on which extensions are present.
- Not following why adding a constant to the app layer would require a schema change? SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)
- Sorry, I was thinking of a MySQL ENUM. For just using constants: this would be inconsistent with the use of entity types elsewhere in wikibase, both in the database and the code. When we considered the same thing for the ContentHandler (i.e. page_content_model), a decision was made (mostly by Tim) to use human readable strings rather than more brittle/obscure numeric identifiers. The same reasoning holds here, I think, unless there are compelling reasons to use an integer here. -- Duesentrieb 16:55, 13 November 2013 (UTC)
- Not being able to predict entity types seems like a code smell or normalization issue. A field performing two roles should probably be two fields. SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)
- I don't see two roles, and I don't see what we could predict them by. An entity has an ID, and it has a type. Either might be used for selecting or filtering queries. -- Duesentrieb 16:55, 13 November 2013 (UTC)
- Not following why adding a constant to the app layer would require a schema change? SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)
- wbq_mainsnak_<type>
- Same stuff with 'id' field as above
- PRIMARY KEY is kind of long, especially for the expected table size, unless the index is well sorted for full-row lookups of the vast majority of queries, it might make sense to just make this UNIQUE and pick another index (or surrogate) to be the PRIMARY. The hash column could save some space as base36.
- the vast majority of queries would select by property_id and value_hash (aka value_identity), the subject_id is added to the key to make it unique and allow paging.
- we could use an auto-increment row ID as the primary, and make the above a "simple" unoique key. But how woudl that be better? Doesn't a unique key use just as much room as a primary key?
- InnoDB uses a clustered index. Each secondary index entry contains a copy of the relevant PK. A large primary key therefore increases the size of all indexes. A surrogate PK plus a UNIQUE key, will be smaller. SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)
- A, good to know! So, having an auto-increment integer as a surrogate PK would be preferred, I assume? -- Duesentrieb 16:55, 13 November 2013 (UTC)
- InnoDB uses a clustered index. Each secondary index entry contains a copy of the relevant PK. A large primary key therefore increases the size of all indexes. A surrogate PK plus a UNIQUE key, will be smaller. SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)
- wbq_<role>_string
- Also, how big can 'value' be? I care more if it will be indexed. I tend to say that we can limit it to a 255 byte prefix, and use the value_hash (aka value_identity) if we want to check fro equality of longer strings. The prefix would then be used only for sorting and prefix matches.
- Indexing up to 255 bytes would be worrysome in the long run. Probably could get away with less. SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)
- Considering we are doing this for all page titles in large tables like page_links, I don't think this is a problem. Also, from current experience, the vast majority will be <40 bytes. This field will mostly contain things like VIAF IDs, people's maiden names, or URLs. -- Duesentrieb 16:55, 13 November 2013 (UTC)
- Indexing up to 255 bytes would be worrysome in the long run. Probably could get away with less. SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)
- Also, how big can 'value' be? I care more if it will be indexed. I tend to say that we can limit it to a 255 byte prefix, and use the value_hash (aka value_identity) if we want to check fro equality of longer strings. The prefix would then be used only for sorting and prefix matches.
- wbq_<role>_entityid
- What is the PRIMARY KEY? See wbq_mainsnak_<type>. wbq_mainsnak_entityid would have the fields and indexes described for wbq_mainsnak_<type> as well as the ones described for wbq_<role>_entityid. The value_hash would be based on the entity id (or we would just use a value_identity column to contain the entity id - then we wouldn't need any additional column at all to represent entity ids).
- wbq_<role>_time
- I guess that would need a surrogate key. No, since what I specified for wbq_mainsnak_<type> applies.
- wbq_<role>_quantity
- I wonder if RTREE indexes might be useful, depending on the final column layout, this kind of table is tricky SQL. We should investigate this (as well as the possibility to just not have this in MySQL, and use Elastic for this kind of search). I don't know anything about RTREE.
- InnoDB doesn't support RTREE. It would have to be ARIA, the crash-safe MyISAM replacement. That still has limitations for concurrent writes though. We should definitely investigate potential future queries. SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)
- Does RTREE help with range queries? I thought RTREE would be for multi-dimensional exact match queries? -- Duesentrieb 16:55, 13 November 2013 (UTC)
- InnoDB doesn't support RTREE. It would have to be ARIA, the crash-safe MyISAM replacement. That still has limitations for concurrent writes though. We should definitely investigate potential future queries. SPringle (WMF) (talk) 00:29, 12 November 2013 (UTC)
- I wonder if RTREE indexes might be useful, depending on the final column layout, this kind of table is tricky SQL. We should investigate this (as well as the possibility to just not have this in MySQL, and use Elastic for this kind of search). I don't know anything about RTREE.
- wbq_<role>_geo
- Why is 'alt' deprecated? Because from the perspective of the data model, it's more useful to treat the altitude separately (different source, different method of measurement, different margin of error, different unit, etc).
- wbq_valueless_snaks
- Blobs in the Primary Key are a little worrying, actually why are they not varbinary? This is an oversight, they should indeed by varbinary.
I'll try to make the draft a bit more concise when I have a bit more feedback. I currently tend to replace value_hash with value_identity (which may contain a hash or, in the trivial case e.g. for entityid, the full value). -- Duesentrieb ⇌ 10:57, 9 November 2013 (UTC)