User:Henning (WMDE)/Wikibase/Concepts/Uncertainty

Essence: The WIKIBASE Data Model should be altered to allow properly modeling uncertainty.

The missing feature edit

The nature of technology is to impose classification and if it is just by using generic data types. Everything that is to be captured by digital technology is to be classified in a particular way -- there is no circumvention. With that in mind, trying to classify reality is kind of a fond operation. Although this statement may be disputed fiercely, classifying (as it as the nature of digitizing) reality was the motor powering the idea of Wikidata. Consequently, the WIKIBASE Data Model was designed to be as liberal and unobtrusive as possible to allow modeling the real world as meaningful and extensive as possible with, very basically, key-value pairs.

Constraints do exist, though on a rather low level only, with one of the most basic constraints being that a Value Snak's value has to conform to the Data Type of the Property used in the Snak. Another constraint is that a Value Snak may capture just one single value only.

The consequence of the latter is that, with Statements essentially being Snaks backed by References, a constraint is implicitly imposed on a Snak's source that may not be obvious. Controversially formulated: A primary source may specify one dedicated value for a particular Property only.

Reality is uncertain edit

Uncertainty is part of everyday life and everyday life is captured and aggregated by media that may act as source for Statements. Statistical estimations, measurements, diverging eye-witness accounts are a source for uncertainty just like unclear medieval handwriting or torn documents. Some weather forecasting may predict rain or snow, mathematical formulas with some variable may result in multiple alternative outcomes. Whether all of that needs to be captured in any sense is not a question the software WIKIBASE should answer and, following the intention of offering a most flexible data model, WIKIBASE should not prevent any uncertainty from being captured, unless the principle is to not allow capturing any uncertainty at all.

Examples edit

There are several natures of uncertainty, most prominently demonstrated by the following examples. These examples are used throughout the remaining elaboration.

  • Range uncertainty: The distance between Berlin and Beijing is 7,353 km. The measurement's resolution is 1 km. Hence, the real distance is actually somewhere between 7,352.5 km and 7,353.5 km.
  • Alternative values: Abraham von Freising died either on 7 June 993 or 7 June 994. The alternatives may result from a torn historical document.
  • Likelihood: Robin of Locksley is stated to have been Robin Hood as to some source while other sources object that statement. Who, in fact, happened to be Robin Hood or whether he is just some legend remains unclear.

Should uncertainty be captured at all? edit

The answer on how to handle uncertainty may be given radically: Capture no uncertainty at all pointing on the argument that it cannot be modeled properly -- and maybe should not even be modeled at all. Imposing such a standard, however, requires extensive measures and manual checking. Avoiding uncertainty slip into the data base is an unmanageable task. By simply not supporting uncertainty, uncertain values will find their way into the data base by those, in the worst case, being treated like certain values:

  • The distance between Berlin and Beijing is exactly 7,353 km -- the uncertainty deriving from the scale is just omitted.
  • 7 June 993 and 7 June 994 may be specified separately as Abraham von Freising's date of death resulting in him showing up to have died in 993 as well as in 994 when issuing a query for persons who died in the particular years; Alternatively, since in reality there is no such thing as two death dates (which, made like that, is actually a daring statement), one date may just be omitted relabeling that date to certainty.
  • The assumption that Robin of Locksley might have been Robin Hood may be coined to certainty by stating that Robin of Locksley was Robin Hood backing it with the source the statement originates from, as a Statement is just about reflecting the source's statement. The Statement could be assigned the deprecated Rank. However, as the mystery of Robin Hood remains unclear, there is no proof that the Statement, in fact, is deprecated.

Trying to specify uncertainty with the existing software edit

The ambivalence of the current situation in WIKIBASE is that some form of uncertainty may indeed by specified while others may just be specified using some rather improper and inelegant methods:

  • Specifying range uncertainty is integral part of the Quantity Data Type. It is already possible to specify the distance between Berlin an Beijing being 7,353 +/- 0.5 km.
  • Abraham of Freising's date of death may be set to 7 June 993 and be accompanied by a Qualifier specifying the alternate date of 7 June 994. A simple query for his date of death would just return the date used for the Main Snak though, unless the caller of the query is aware of the specific Qualifier used to specify the alternate date. In addition, some additional Property like "alternate date" may be required. The concept of such would not be easy to grasp for users.
  • Just specifying both dates as date of death is also a way of dealing with the uncertainty. However, there is no way to instantly make the caller of a query aware that two dates are specified. Therefore, each query has to consider the fact that a query result may return multiple date of deaths for each instance. Just having the caller of a query mark all results featuring multiple date of deaths as uncertain by himself is just half the story though. In fact, although rather uncommon, a person may have two date of deaths after having been reanimated.
  • Likelihood may be worked around by specifying a dedicated Property to be able to render: "Robin of Locksley might have been Robin Hood." This, however, creates discrepancy as to how other "virtual" persons / alternate identities whose real-world counterparts are known may be specified.
  • Another solution to model alternatives and likelihood would be a generic Property like "is uncertain" that, featuring a boolean value, would act as a flag. Using this Property would be quite awkward though as it would only make sense to be used with a boolean "true" value as there is no point in rating the quality of sources.

The bottom line is that, in order to model alternatives or likelihood with the current software, either callers of queries have to take a lot of circumstances into consideration to actually receive a clean result or interaction regarding entering and specifying values would be awkward and standards imposed by the operators of a WIKIBASE Repository need to be validated permanently.

A proper data model edit

As to the current Data Model, a Value Snak may feature one dedicated value only. However, there is no hard constraint on how many and what kind of Statements may be specified on an Item. Hence, the natural solution would be to just specify all alternatives as values of a Property resulting in the following statements:

  • Abraham von Freising died on 7 June 993.
  • Abraham von Freising died on 7 June 994.

Indeed, there would be circumstances allowing this contradiction:

  • If Abraham von Freising, in fact, died twice.
  • If both statements originate from different sources. However, the source being a single document whose content may not be identified clearly, there are no two sources contradicting each other. In fact, as both values originate from the same source, these values should be captured in one single Statement, in one single Snak.

Hence, the value of a Value Snak should not be limited to one particular value, but instead accept a set as well as a list of values. While this would allow proper specification of alternatives and ranges unable to be captured using the Quantity Data Type (i.e., some series of dedicated values), the internal complexity of querying and evaluating the data does increase by some extend. However, without a proper data model that allows capturing uncertainty, the data is doomed to be flawed as there is no way to detect uncertainty that has been coined to certainty.

Flag edit

Flagging values may be an alternative to increasing the Data Model's complexity. While this would not solve the lack of representing uncertainty in the Data Model, it may be used to work around the problem. Flags could be applied just like Badges on Site Links. A Statement may simply be marked as uncertain while flags could additionally be used to, for example, mark disputed Statements and more. When querying for Statements, the caller may opt to analyze any flags and alter the query result accordingly.