Topic on Talk:Wikimedia Technology/Annual Plans/FY2019/CDP3: Knowledge Integrity

Re: understand and accelerate the referencing and linking of knowledge statements to external sources, catalogs, metadata providers, and content repositories.

4
Kerry Raymond (talkcontribs)

One of the problems I encounter when trying to resolve a "citation needed" is that often I can find the information on a website (thanks Google) but, as most websites don't cite their souces and Wikipedia is soooo popular, it's not uncommon to find the same sentence/paragraph, word-for-word or substantially similar. Is the Wikipedia text a copyright violation of this website? Am I seeing an unattributed copy of Wikipedia material on this website? While I can (with some effort) find the diff that added the information to the Wikipedia article, I know the date the info appeared on Wikipedia, but most websites don't even date the webpage (beyond perhaps refreshing their annual copyright notice C. 2018), let alone provide a history. And even if the sentences are different (although how many ways can you say "Joe Bloggs was born in Sydney on 20 December 1880"), there's still no guarantee that the information content didn't come from Wikipedia. The line between Wikipedia and external sources is now totally blurred. After 17 years of Wikipedia, even offline sources like books (once seen as "authoratitive" and definitely distinct from Wikipedia) may now be containing Wikipedia information content. Are pre-2001 sources the only safe haven?

Should we have a campaign to ask websites to stamp pages with a "Guaranteed: No Wikipedia inside" so we can use them with more confidence?

It's all very well to say Wikipedia is a tertiary source drawing on secondary sources, but if we can't distinguish between a secondary source and a quaternary source, maybe we need to revisit the role of primary sources in Wikipedia. Maybe we have to stop being crowdsourcers and start being scholars?

Ocaasi (WMF) (talkcontribs)

Hi Kerry, you've identified a "hard" problem of knowing what came first, Wikipedia or the publication. With many websites lacking datestamps, resolving this issue is not trivial in many cases. The closest work being done on this that I'm aware of is based off a partnership we formed with plagiarism detection company Turnitin (iThenticate) now live at https://tools.wmflabs.org/copypatrol/en CopyPatrol. It's a great tool, but even with its very sophisticated algorithm and huge corpus of materials to check against, determining if Wikipedia was first or second requires human review.

Do you think that this program should incorporate improvements on CopyPatrol or other plagiarism tools? At the moment that is handled by the Community Tech Team, although it'd be something we could look into further developing if you think new features could help address this conundrum. Cheers, Jake

Kerry Raymond (talkcontribs)

If the webpage in question can be found in the Internet Archive, we do have some way to timestamp that webpage. If we had a version of WikiBlame

http://wikipedia.ramselehof.de/wikiblame.php?lang=en

that worked over the Internet Archive for the same content in parallel with the Wikipedia article, we may be able to get a range of time in which that information first appeared on that webpage (but nothing like the precision we get with our versioning). But it might be sufficient to establish that the Wikipedia content was definitely before or definitely after that webpage, which then allows to decide if one is likely to be the source (or the copyvio) of the other. Of course, if the Wikipedia timestamp is in the middle of the range for the webpage, we are none the wiser. And it doesn't rule out that both have a common third source, (which might be another Wikipedia article or another webpage on that same website, since both can get refactored).

Ocaasi (WMF) (talkcontribs)

This is a neat idea, and I wonder what it would take to intersect Turnitin and Wayback Machine. It doesn't sound trivial, but I'm happy to talk to Mark Graham at IA about it (he runs Wayback). -Jake

Reply to "Re: understand and accelerate the referencing and linking of knowledge statements to external sources, catalogs, metadata providers, and content repositories."