Wikimedia Technical Conference/2018/Session notes/Integrating data into our products

Theme Integrating useful data into our products
Type Evaluating Use Cases
Session Leader Ramsey Isler
Facilitator Joaquin Hernandez
Scribe Irene

Description: Content is the key offering on Wikimedia projects, but it is also important to provide useful data about that content. Metadata, usage metrics, and content analysis are just a few areas where data can enhance our projects. This session will explore methods and motivations for using data of various types to expand and improve Wikimedia content and tools.

Questions discussed edit

Question Significance Answers
What are the use cases for structured data / metadata / semantic data on Wikipedia and other content Wikis? How are these use cases served now? What data types need support for curation, and what does not? Specifically mention categories and infoboxes. While we know we want to use more structured data on our content wikis, we haven’t clarified where and how we want to enable this. Understanding these use cases and the needs for curation will help us design ways to include data.
  • Data tab on Wiki pages - a quick shortcut that shows structured data about/relating to the page. This was by far the most popular idea (with 4 votes for favorite idea)
  • Multi-language auto-generated descriptions from wikidata statements
  • Lead Image + focus rectangle (to set the cropped area of the main image for an article, aka “The Michelangelo’s David” problem)
  • Open Graph Metadata - for social media
  • Related Articles/see also - suggestions/discovery driven by structured data and inference/derivation
  • Categories (with a focus on how MCR changes category use on pages) - Especially in mcr, since now it duplicates the entire content currently every time the category changes
  • Content creation Metadata (eg # of contributors, page activity, most disputed content, blame tool)
  • Page Templates ->Ontology -> Semantic data
  • Structured page / section data + Semantic article content mark-up + Clear separation of in-article content from its presentation - these were three ideas by 3 separate participants and all decided that they were actually talking about the same thing; using structured data to describe content of the article.


What type of semantic data can/do we want to attach to pages? What type of data do we need to attach to non-page entities like revisions, diffs, paragraphs, sentences, users, citations, etc? Given the use cases above, it should be obvious that we need to attach data to certain types of entities within MediaWiki. While most data may need to be associated with a page, sometimes we need to attach data to a revision (JADE) or another type of entity. It seems like the following use cases need to be attached to pages:

Structured Data tab Open Graph Metadata Content creation metadata (applies to both page and non-page elements) Post-MCR Category metadata Structured page/section data, aka Semantic Article Content Markup, aka separation of content from presentation

And the following should be attached to non-page entities:

  • Multi-language auto-generated descriptions from Wikidata statements
  • Maybe Open Graph Metadata (as it could apply to Diffs)
  • Maybe categories (depending on implementation)
  • Content Creation Metadata (applies to both page and non-page elements)
  • Lead image + focus rect (applies to both page and non-page elements [focus rect is metadata, probably on Wikidata])
For which use cases should data be stored in a specific content wiki? For which use cases should data be stored on Wikidata and “imported” from there? Some data types may only be needed within a specific project, but others may be central and should be stored in Wikidata. Identifying the rules for how we choose will guide our architecture and provide a best practice for product owners/engineers.

Some of these answers very clear, others were a little uncertain.

Stored on the content Wiki:

  • Related articles / See also - the data on the page will obviously be stored in the content wiki
  • Content creation Metadata
  • Stored on Wikidata:
  • Some parts of ontology/semantics derived from templates (the ontology part in particular)
  • The focus rectangle part of Lead/Page Images (stored via structured data on Commons)
  • Multi-language auto-generated descriptions from Wikidata statements - [split vote] this is inferred/derived data that might be stored permanently, but also may only be cached.

Not stored at all (at most, simply cached):

  • Multi-language auto-generated descriptions from Wikidata statements - [split vote] this is inferred/derived data that might be stored permanently, but also may only be cached.
  • Related articles / See also - any auto-generated info from Wikidata may only be cached.
  • Open Graph metadata (generated/derived and cached)
Is it necessary for this data to be curated separately on the Wikidata client wiki (like en.wp.o), or only within Wikidata, with affordances to curate Wikidata from within the client wiki? Do all Wikimedia wikis need the ability to consume and integrate data from Wikidata? When using data from Wikidata in other wikis, how do should we support curation of that data? Do we build in a standard way to curate Wikidata from client Wikis? Do we support some sort of “forking” of the data and do we need to support upstreaming changes from the client wiki? These help us understand the needs of Data Federation. Question was partially answered, but with some complications.

Curated on-wiki:

  • Categories (post MCR)
  • Structured page / section data + Semantic article content mark-up + Clear separation of in-article content from its presentation
  • Related articles/See also
  • Content creation Metadata

Curated on Wikidata:

Undecided/it’s complicated:

  • Structured data tab - [split vote] some participants thought the curation should happen on Wikidata. One voter thought it should happen on-wiki (via editing interface), others thought curation could happen both on Wikidata and on-wiki
  • Templates -> Ontology -> Semantic data - [split vote] much debate about how curation could be done here
  • Lead image + focus rectangle - [split vote] general feeling was that the Lead image may or may not be curated on-wiki but the focus rectangle should be curated via the image’s Commons page with structured data
What other sorts of data would be useful for end users to have? Perhaps info about how many times their content has been seen (or reused in the case of multimedia)? What kinds of data might be useful for volunteer devs? We need to think beyond just structured/semantic data. What sorts of other data about our content and how it is used do we have? Is it in a usable format? If not, what would it take to make it so? Participants felt like the topics brought up covered the bases pretty well.

Features and Goals edit

Given your discussion of the topic and answering questions for this session, list one or more user stories or user facing features that we should strive to deliver
1.  Adding a data tab alongside “read” and “history” that shows Wikidata/structured data for that page/topic
Why should we do this?

This was a popular option that was seen as a way to finally unify the two worlds of our Wikipedia content and structured data

What is blocking it?
  • No MCR on all wikis
  • Community decisions to actually enable structured data on everything
  • Design/spec for UI, etc.
  • Answering the question of
Who is responsible?

WMF, WMDE, and Community?


Important decisions to make edit

What are the most important decisions that need to be made regarding this topic?
1.  Clearly defining the model of where structured data elements should be stored and how/where they are curated.
Why is this important?

There is confusion, even within WMF, on some aspects about storage and curation processes, and we need to clear that up before we can begin to do even high-level strategy.

What is it blocking?

Taking next steps for deciding how/when/why to implement structured data across Wikis

Who is responsible?

WMF, WMDE, Community

Action items edit

What action items should be taken next for this topic? For any unanswered questions, be sure to include an action item to move the process forward.
1. Spend more time exploring the question of whether data should be stored on-wiki or on Wikidata, and explore how curation will work. Do both on a case-by-case basis, because there won't be a one-size-fits all solution.
Why is this important?

Where to store things and how to curate them are key questions.

What is it blocking?

Moving forward with architecture design/spec for "structured templates"

Who is responsible?

WMF, Community?

New Questions edit

Do we add potentially multiple concept wikidata IDs to page templates OR do we have it all be done on wikidata? Question of both storage and curation.
Why is this important?

This will determine the curation and storage model when doing "structured templates".

What is it blocking?

Comprehensive matching templates to Wikidata items

Who is responsible?

Community

Detailed notes edit

Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.

  • Voting Legend for Session:
  • Pink post it - non page
  • Blue post it - page element
  • Orange dot - storage on wiki
  • Red dot - storage on wikidata
  • Purple dot - not stored at all
  • Green dot - curation on wiki
  • Yellow dot - curation on wikidata

Goal: we want to know the big YES ideas when it comes to data. Don’t want the meh ideas - want the passionprojects

Generated Ideas:

  • Adding a data tab alongside “read” and “history” that takes you to the wikidata for that page
    • Storage questions: One contributor long-term hope that one day wiki and wikidata will both be merge into a single entity.
    • Contention around where its curated - Sam, Lydia, Bryan, Dmitry all thought that it should be curated through both
  • Multi language auto-generated descriptions from wikidata statements (ideally a short piece of code, esp for short page previews)
    • Some thought ideally not stored but will end up being cached
    • Not curated because not an edit
  • Lead image + focus rect
    • Ex, thumbnails from search results, but right now we don’t have a good way of focusing on the center of each image to generate thumbnails
    • Question of storage - complicated, coordinates could be stored on base on commons
    • Curation split votes - lead image is curated by the wiki but the region should be on structured data, but will probably be overwritten
    • Does it always have to be done on wiki or?
  • Open graph metadatas to interact with social medias
    • Cached but not actually stored?
    • Could be an mcr slot
    • Not curated according to votes
  • Related articles / see also
    • Storage: inferred based on elastic search; however community members thought they should be able to manually override. JonKatz says all three are viable storage options
    • Lydia: makes sense to override that locally but generally inferred (after Lydia’s statement Jon got rid of the wikidata half of his sticker)
  • CATEGORIES (mcr)
  • Esp in mcr, since now it duplicates the entire content currently every time the category changes
  • Q about storage - presumably this could spans across languages and could be storage on wikidata
    • Contention
    • Not currently consistent
  • Content creation metadata
    • Examples: number of contributors, page activity, most disputed content, blame tool
  • Templates leading to ontology and semantics
    • Do we add potentially multiple concept wikidata ids to the template OR do we have it all be done on wikidata?
    • If you can map templates to ontologies no matter where they come from, that gives you a plug for it
    • Already possible?
    • Split vote concerning storage AND concerning curation - definitely a question to be answered
  • Structured page / section data + Semantic article content mark-up + Clear separation of in-article content from its presentation
    • Opens the door to things like mixing and matching data from different projects