Talk:Wikibase/Indexing/Data Model

Best values heuristics

Latest comment: 10 years ago10 comments5 people in discussion

Criticism (by Daniel)

I see several issues with the heuristics for "best" values described here.

It doesn't match the spec. The wikibase data model defines the semantics of ranks such that for a query, the only "preferred" claims for a given properties should be considered, if there are any. If there are no "preferred" claims, only the "normal" claims shall be considered. The claims that are thus defined to be relevant to queries according to this are referred to as the "best" claims for that property.
It leads to surprises. The graph database is intended to be used for queries, not searches. Queries have a well defined result set, which should be clearly predictable to the author of the query. Predictability is important; A search index may used heuristics to follow the actual content. A query index should have clearly defined behavior, and allow content to me modeled accordingly.
Applying such heuristics takes away one of the main incentives to actually rank statements manually (resp by bot). Explicit ranking is extremely valuable, and useful for using values in infoboxes etc. One reason we don't see many "preferred" ranks on Wikidata is that they don't have much effect yet. Once people see how ranking effects query results, this will hopefully be used a lot more. The heuristics suggested here would obscure this effect.
The heuristics may have averse "political" consequences. When designing the wikibase model, we took great care to allow for competing views and contradictions. Having e.g. census data ignored because it's a year older than information from another entity may lead to confusion and even animosity (yes, people get into fights about the population of China, or Israel, or India, because it very much depends on which regions you include as territory - this is highly political stuff).
One of the wiki principles is: avoid magic, let the community edit content. This means here: leave it to the community if, when, and where they want to apply heuristics like "the newest value is the best". They can write a bot that changes the rank accordingly, with a record in the history, discussions on the wiki, etc.

Stas's response

I think we still comply with the semantics since if "preferred" is present we will just consider that value(s), and only that. If it is not present, I think we should not ignore the fact that right now we have no way to know the US population, at least by query, or have no good way to know 10 most populous countries without scanning through every population figure of every country that exists in the database.
I'm not sure how having "best" value makes it unpredictable. It's just a form of materialized view, or an index if you will, just a bit smarter one that DB can provide natively, since the DB does not know our data but we do. Where predictability issue comes from? You still have exactly the same data and get exactly the same result as if you wrote the "10 most populous" query yourself by manually sorting population data by qualifiers for each country. No difference in data will ever happen (and, of course, you can still do the manual query by completely ignoring the best values and going to raw ones). I just propose to write part of this query for you and materialize the result, knowing the user will have to do it anyway.
Here I see your point, but I don't think having the engine help you would prevent people from improving the data. I think, on the contrary, that having engine that is actually useful and easy to use would make more people use it and as such be driven to improve the data feeding it.
This is easily fixed by setting one of the values as preferred. The additional heuristic only kicks in if there is no human decision, so if any humans disagree, they can always override it. Even have multiple preferred values, if desired. In any case, that'd be better than having US population simultaneously being 50 mln, 150 mln and 300 mln - I don't see any context in which that would be of any practical use.
Well, I'm new here but I'm not sure why having an index helping to optimize for common case would be contrary to wiki values. Of course, if we expect the preferred issue to be fixed by the community before the system would go to any production use then the whole issue is irrelevant and we don't need any heuristics - we can just consider the preferred values. But if we expect it to be useful on the data that is not cleaned up yet I think it still can be useful.

In any case, the "best value" part is not integral to the rest of the model, so I'll work on the rest and we'll see what we do with it and if we need it at all after we have the rest of it. --Smalyshev (WMF) (talk) 20:46, 5 December 2014 (UTC)Reply

I think this is mostly a matter of perspective and priority: to you (I suppose) the most important thing is to have something that returns useful results asap, for use by Grok and others. For me it's more important to be consistent with our data model, and integrate community processes, even if it takes a couple of months longer that way. I think this needs discussion on the product level, it's not just an engineering decision. -- Daniel Kinzler (WMDE) (talk) 18:27, 8 December 2014 (UTC)Reply

As far as I understood the queries that Wikigrok would need, they would not involve properties where such a heuristically best value would be possible (like place of birth , nationality). Additionally most of the Wikigrok examples involve things that have no claim (of any rank) for a Property, but another Property of a specific value (rank preferred or normal). (Example: no alma mater and instance of human; for each of those humans make a pass over the linked Wikipedia articles to get the Wiki links; for each of those links check if they refer to a Wikidata.org item that is instance of University. See Extension:MobileFrontend/WikiGrok/Claim_suggestions.) --Jan Zerebecki 19:29, 8 December 2014 (UTC)Reply

Btw, when I referred to the "wiki principle", I wasn't referring to Wikipedia values, but rather to the more general principle of wikis: everything is editable, nothing is automatic. Of course we could make Wikipedia's "featured article" on the frontpage update automatically by writing a MediaWiki extension to do it. Or implement a workflow for article deletion discussions in software. But we never did, for good reasons. -- Daniel Kinzler (WMDE) (talk) 18:27, 8 December 2014 (UTC)Reply

And yea, the whole idea of Wikidata is kind of against the "nothing is automatic" thing. But we do try to avoid magic under the hood. -- Daniel Kinzler (WMDE) (talk) 18:29, 8 December 2014 (UTC)Reply

So, for now I am implementing runtime preferred(), latest() and current() clauses that would allow to apply the heuristic parts to the query at runtime. If it proves to be a big hurdle on the performance, we'll revisit optimizing those with e.g. additional edges on import. --Smalyshev (WMF) (talk) 21:26, 11 December 2014 (UTC)Reply

I agree with everything Daniel said. The indexing should not make any assumptions on best values beyond ranks. Qualifiers should not play into this. This is fundamental to how we want Wikidata to work. --Lydia Pintscher (WMDE) (talk) 10:04, 15 December 2014 (UTC)Reply

Another reason that just came to my mind: Deciding on best statements isn't just going to be used for this query tool. Other applications will also need them and they shouldn't need to have to reimplement the heuristics suggested here because ranks aren't set because of lack of incentive. --Lydia Pintscher (WMDE) (talk) 23:03, 17 December 2014 (UTC)Reply

Markus's Reply

This is an important point and I am glad it was brought to my attention. I am deeply concerned about the proposal as it stands now. Daniel has made some important points, and I agree with Lydia and him that in particular the use of specific qualifiers as part of the standard search interface is problematic. Qualifiers are content and as such governed by the community. Hard-coding a specific use of qualifier properties into the query implementation moves control from the community to the Foundation. I understand that the proposal was made in good faith, without the intention to restrict the power of the community over the content of Wikidata. Yet, this is what this proposal in fact leads to: while the community remains free to use other qualifiers or to use qualifiers in different ways and meanings, the query feature would ignore this by preferring statements based on the judgement of developers, changeable only by their active development effort.

It is well known that on the Web, it is not enough to be able to say whatever you like -- what counts is what is found by others, and how often it is found. Tweaking the query to take content-specific criteria into account implements a form of software-mediated content restriction, and opens a gateway to further such restrictions in the future. Even if we would have a community-based decision on the exact form of the heuristics now, implementing this decision will bereave the community of the power to change the heuristics later on without the cooperation of WMF developers. I think this is highly problematic, and constitutes a violation of the contract between WMF and community, whereby the former should minimize its interference with content-level decisions as far as possible.

Fortunately, there is an easy solution to this problem that enables all the intended benefits without transferring content-governing powers to the Foundation. If the system uses ranks in a well-documented and stable way, then the community can use the ranks to express their preference for query results. The qualifier-based heuristics that have been suggested (and which I agree might be very useful) can easily be implemented using bots, which set the rank information accordingly. This was the one and only reason why ranks have been introduced at all: to provide the community with some simple flags they could set to influence query results. Ranks are the interface between technical backend (stable, well-documented use of ranks for filtering results) and community-controlled content (complex, domain-specific heuristics with any amount of special cases and exceptions). By keeping it like this, we can have the best of both worlds and show that we respect the unabridged autonomy of the community regarding the content of Wikidata. I am confident that, as soon as queries are halfway functional, the community will make sure that "preferred" rankings are set widely and appropriately, using heuristics that are much more elaborate than the ones proposed here.

Best wishes,

Markus Krötzsch (talk) 13:04, 16 December 2014 (UTC)Reply

P.S. I am currently on parental leave and thus more unresponsive than usual. I don't think I can have an extensive discussion here, but I hope this argument can still be heard. From my point of view, it is a key point of the design discussions that have led to ranks in their current form being introduced to Wikidata at its inception over two years ago.

I see your point, Markus. I will avoid creating any coded-in preferences based on qualifiers, and instead will create query predicates that would allow heuristics to be attached to the queries. This may make the queries slower, but there are ways to address it. --Smalyshev (WMF) (talk) 01:47, 20 December 2014 (UTC)Reply

Multiple Qualifiers

Latest comment: 10 years ago4 comments2 people in discussion

Note that this assumes each qualifier will be present only once. Wikibase allows multiple qualifiers with the same property. We need a different solution, but since it is preferable to have these data indexable, we should not be using complex structures here.
- Could you provide an example with the same qualifier used more than once on the same claim? I'd like to see the semantics of it to figure out different solution. There are a number of options that could still leave it indexable but I'm not sure how that works so I'd need some examples. --Smalyshev (WMF) (talk) 21:13, 5 December 2014 (UTC)Reply
- Found such case: https://www.wikidata.org/wiki/Q801 - "head of state" has multiple start/ends for some. Not sure yet how to handle it as starts and ends should match and that means they need to be kept in ordered structure. Titan has multivalues, but looks like only on vertices, not on edges. WiIl check further. --Smalyshev (WMF) (talk) 07:45, 6 December 2014 (UTC)Reply
  - That would be better handled as separate statements, I think. But of course, we still need to decide how to handle such a thing when we encounter it. I think pick the first one, and list statements with multi-value qualifiers, would be sensible. Except if we find a legit use case for this. I can't think of any, but the software doesn't make assumptions about it. -- Daniel Kinzler (WMDE) (talk) 18:18, 8 December 2014 (UTC)Reply
    - The model handles multiple qualifiers as if they were separate claims now. --Smalyshev (WMF) (talk) 01:49, 20 December 2014 (UTC)Reply

Qualifiers as properties or edges

Latest comment: 10 years ago3 comments2 people in discussion

Qualifiers can reference other items. This should be modeled as an edge, but it's not possible to attach an edge to an edge. To allow this, qualifiers would need to be nodes in their own right.
- The query engine allows to go from string to vertex named after string very easily with transform() clause, as far as I can see, so not having edges on edges won't be a problem for querying. Modeling qualifier as a vertex though might be a problem since qualifier is attached to claim, which is now an edge. Given the multiple qualifiers issue above, we may have to convert claims to vertices too. If claims are vertices, qualifiers can be edges or vertices too. Not sure what is best since this will produce a lot more edges and may impact performance. --Smalyshev (WMF) (talk) 21:13, 5 December 2014 (UTC)Reply
- Actually, Titan can attach special kind of edge to an edge - http://s3.thinkaurelius.com/docs/titan/0.5.2/advanced-schema.html so maybe qualifiers can be made work this way. This is a Titan-specific feature so needs to be checked how it influences querying in Gremlin. --Smalyshev (WMF) (talk) 21:28, 7 December 2014 (UTC)Reply
- I think this is an issue in principle, but not much of one in practice. Can't think of a use case that wouldn't be covered by a string match off-hand. -- Daniel Kinzler (WMDE) (talk) 18:20, 8 December 2014 (UTC)Reply

Importing deprecated data

Latest comment: 10 years ago3 comments3 people in discussion

Although deprecated statements will probably not be queried that often, we should try to import and index all data.
- We can import deprecated data, but if we want to avoid putting "and exclude deprecated data" condition on every clause of every query, we probably should store them somewhere separate - like with edges marked 'P31_deprecated' or something like that maybe, so they won't be part of regular queries. Would that work? We can have DSL clauses that would say "include deprecated" in the query language, but I think we don't want to make users to explicitly exclude deprecated clauses in default queries. Which is why I think if we want to keep deprecated data, we should separate it from regular ones. --Smalyshev (WMF) (talk) 21:19, 5 December 2014 (UTC)Reply
  - I agree these should be not seen by a query that does not explicitly select deprecated terms. So extending this to the rest of the ranks then we have P31_deprecated, P31_normal, P31_preferred, P31 (same as preferred with fall back to normal; wikibase source code uses the term best for this, this is what is displayed per default in Templates), P31_heuristically (what you described above as best value, if someone needs this). Does this fit with what you had in mind? --Jan Zerebecki 17:54, 8 December 2014 (UTC)Reply
- Yeah we do want deprecated statements to be indexed but not used in most cases. --Lydia Pintscher (WMDE) (talk) 10:06, 15 December 2014 (UTC)Reply

Representing inexact values

Latest comment: 10 years ago1 comment1 person in discussion

time and quantity are not exact values - both have a "main" value and an uncertainty interval. Without that interval, quantities would have to match to 127 decimal points, and times would have to match to the second. If we do not represent the uncertainty intervals, queries become impractical. For globe-coordinate this would be handled by a circular Geoshape with the diameter derived from the globe-coordinate's precision.
- I think we will import all the value parts as additional value parts, and this way they can be queried against or otherwise used too. I'll update the spec accordingly soon. --Smalyshev (WMF) (talk) 18:38, 8 December 2014 (UTC)Reply

Add topic