Topic on Talk:Wikimedia Technology/Annual Plans/FY2019/CDP3: Knowledge Integrity

Qualification of citations, retractions, and other annotation of primary sources

5 comments • 09:08, 1 May 2018 6 years ago

5

To build trust (I generally don't like the term, and prefer transparency) we need to be able to indicate if primary sources are still, umm, "valid". Some research turned out to be useful for some time, and still present interesting studies, but the conclusions are now false. Some research was based on fraudulent research and got retracted. Annotating primary sources in Wikidata with various forms of qualification is essential to moving science forward: ignoring mistakes and misconduct and not showing we know how to handle that, will contribute the a further blur of facts, fake, and fiction. If we cannot indicate a source was proven wrong, we will forget it, cite it again and again, and generally not learn from our mistakes.

Therefore, I think part of this proposal should be to make a start with developing models that adopt various community proposals that work in this area. I do not anticipate this to be solved in the first year, but doing something impactful over the period of the full project sounds quite achievable: data and tools are around, but integration and awareness is missing, two actions core to the proposal already. The first year (this annual plan) could work out a plan to integrate the the resources.

The first resources I like to see interoperable are those that provide information about retractions. These include the RetractionWatch database and PubMed (CrossRef may also have retraction information). Interoperability would start with the creation of suitable properties and a model that describes how retractions are showing up in Wikidata (probably with suitable ShEX). For the RetractionWatch database, maybe a property for their database entries may be sufficient. The more provenance about the retraction, the better, however.

The second source of information are citations. We already have a rich and growing citation network, but the current citations do not reflect the reason why the citations was made: this can include agreement, reuse of knowledge and data, but also disagreement, etc. The Citation Typing Ontology nicely captures various reasons why an article is cited. Some articles are cited a lot, but not because the paper turned our solid (e.g. http://science.sciencemag.org/content/332/6034/1163).

The importance is huge. The effort will be substantial, but the work can be largely done by the community, once the foundation and standards are laid out by Wikipedia/Wikidata. Repeatedly we find people citing retracted papers, citing papers with false information, and that alone is by scholars who read a substantial part of the research in their field. The impact will be substantial and use cases are easy to envision: use in policy development (which research should our governance based on, and which not), research funding (what is the long term quality of research at some institute (boring but solid, versus exciting by risky)), and doing research itself (is this paper still reflecting our best knowledge).

Of course, without this foundation we keep running into questions of reliability in Wikipedia too: can you automate alerting editors of articles where a cited paper is now considered false? Or regarding research, can false/true source ratio information be used to identify Wikipedia articles of dubious nature?

I fully understand that Wikimedia would go beyond state of the art of the research community, but the community is not doing itself. Just like it was not doing an Open, domain-independent resources, which turned out of great use in and to science. If our goal is the collection of all knowledge, this collection is not a mere pile of more and more knowledge, but must be bound by carefully judging the quality of that knowledge. For this, tracking the above types of information (retractions, citation types) is essential, IMHO.

Reply 07:30, 29 April 2018 6 years ago

Fnielsen (talkcontribs)

Retractions and fraudulent papers are a minuscule part of problematic papers. By far the largest problem comes from "normal" scientific papers. There is in my opinion a massive problem with ordinary science. Some parts of science refer to this as the "reproducibility crisis". It is my guess that this affects a considerable part of bioinformatics, including parts of the information already entered in Wikidata. We already got an "erratum" property. Perhaps we need a "paper status" property to capture retracted papers. But I do not see how we can easily handle the problem with "normal" papers. I suggested a property to record pre-registered meta-analysis. I guess such kind of properties could help.

Reply 11:49, 30 April 2018 6 years ago

Dario (WMF) (talkcontribs)

@Egon Willighagen @Fnielsen thank you for taking the time to write this up. I very much agree with you, being able to annotate "source quality" through a mix of manual and automated curation, and propagating this signal to data reusers, to me is one of the key value propositions of WikiCite. There has been some discussion on how to model formal retractions in Wikidata, and I believe this is something the contributor community should be fully empowered to decide and implement autonomously. I like Finn's suggestions to start hacking a possible model (though I am not sure a single "paper status" property would be sufficient). If there's anything specific that you think we could support programmatically in the next (fiscal) year to facilitate these efforts on WMF's end, I'd be very interested in discussing it.

Also, as you probably remember–because I've been a bit of broken record about this :) – the advanced user stories you mention (such as notifications to editors when they try and add a disputed source) are important but currently blocked on:

1) the existence of sufficiently rich and vetted data models for the variety of publication types cited across Wikimedia projects (if the goal is to extend this beyond the scholarly domain);

2) high-quality and comprehensive data coverage, we're still far from being at a point where all sources cited in Wikimedia projects are captured in near real time in Wikidata. My hope is that some of the proposed technical directions mentioned in the program (the reference event stream, Citoid integration in Wikidata) can help significantly accelerate this work;

3) the existence of a community supported data model to represent source quality in all its relevant dimensions (see above);

4) most importantly, and depending on all of the above, the existence of APIs for reading source metadata from Wikidata and gracefully allowing Wikipedia editors to rely on this. This is a big product intervention fraught with concerns about cross-project contributions that need to be addressed before your idea becomes a reality.

Regarding the second proposal–which I read as adding some kind of support for a Citation Ontology in Wikidata–much as I like the idea in theory (I have been one of CiTO's early enthusiastic supporters) I am skeptical we will ever get a critical mass of contributors to annotate citation types for a meaningful portion of a bibliographic corpus. The problem is 1-2 orders of magnitude larger than the creation and curation of bibliographic records, and requires much more effort (actually reading papers), than other types of curation workflow (such as entity disambiguation, reconciliation etc). I don't see yet how this could be viable until a solid bibliographic corpus exists, a more complete open citation graph (right?), and much more scalable human curation systems.

Reply 13:41, 30 April 2018 6 years ago

GerardM (talkcontribs)

Hoi, I tried to get this discussion going. Someone decided that a "retracted paper" should be a subclass of a paper. He insisted on edit warring on the subject and consequently I find that the community is not empowered to decide and implement. I have no interest in such nonsense. Thanks,

Reply 09:08, 1 May 2018 6 years ago

Fnielsen (talkcontribs)

I also see some difficulty with CiTO. Some can already be inferred from Wikidata, e.g., Self Citation, while others may be "up for interpretation", where is the border between ''disputed by'' and ''disagreed with by''.

Reply 14:39, 30 April 2018 6 years ago

Reply to "Qualification of citations, retractions, and other annotation of primary sources"