Wikimedia Technical Conference/2018/Session notes/Integrating data into our products
Theme | Integrating useful data into our products |
Type | Evaluating Use Cases |
Session Leader | Ramsey Isler |
Facilitator | Joaquin Hernandez |
Scribe | Irene |
Description: Content is the key offering on Wikimedia projects, but it is also important to provide useful data about that content. Metadata, usage metrics, and content analysis are just a few areas where data can enhance our projects. This session will explore methods and motivations for using data of various types to expand and improve Wikimedia content and tools.
Questions discussed edit
Question | Significance | Answers |
---|---|---|
What are the use cases for structured data / metadata / semantic data on Wikipedia and other content Wikis? How are these use cases served now? What data types need support for curation, and what does not? Specifically mention categories and infoboxes. | While we know we want to use more structured data on our content wikis, we haven’t clarified where and how we want to enable this. Understanding these use cases and the needs for curation will help us design ways to include data. |
|
What type of semantic data can/do we want to attach to pages? What type of data do we need to attach to non-page entities like revisions, diffs, paragraphs, sentences, users, citations, etc? | Given the use cases above, it should be obvious that we need to attach data to certain types of entities within MediaWiki. While most data may need to be associated with a page, sometimes we need to attach data to a revision (JADE) or another type of entity. | It seems like the following use cases need to be attached to pages:
Structured Data tab Open Graph Metadata Content creation metadata (applies to both page and non-page elements) Post-MCR Category metadata Structured page/section data, aka Semantic Article Content Markup, aka separation of content from presentation And the following should be attached to non-page entities:
|
For which use cases should data be stored in a specific content wiki? For which use cases should data be stored on Wikidata and “imported” from there? | Some data types may only be needed within a specific project, but others may be central and should be stored in Wikidata. Identifying the rules for how we choose will guide our architecture and provide a best practice for product owners/engineers. |
Some of these answers very clear, others were a little uncertain. Stored on the content Wiki:
Not stored at all (at most, simply cached):
|
Is it necessary for this data to be curated separately on the Wikidata client wiki (like en.wp.o), or only within Wikidata, with affordances to curate Wikidata from within the client wiki? Do all Wikimedia wikis need the ability to consume and integrate data from Wikidata? | When using data from Wikidata in other wikis, how do should we support curation of that data? Do we build in a standard way to curate Wikidata from client Wikis? Do we support some sort of “forking” of the data and do we need to support upstreaming changes from the client wiki? These help us understand the needs of Data Federation. | Question was partially answered, but with some complications.
Curated on-wiki:
Curated on Wikidata: Undecided/it’s complicated:
|
What other sorts of data would be useful for end users to have? Perhaps info about how many times their content has been seen (or reused in the case of multimedia)? What kinds of data might be useful for volunteer devs? | We need to think beyond just structured/semantic data. What sorts of other data about our content and how it is used do we have? Is it in a usable format? If not, what would it take to make it so? | Participants felt like the topics brought up covered the bases pretty well. |
Features and Goals edit
Given your discussion of the topic and answering questions for this session, list one or more user stories or user facing features that we should strive to deliver | ||
1. Adding a data tab alongside “read” and “history” that shows Wikidata/structured data for that page/topic | ||
Why should we do this?
This was a popular option that was seen as a way to finally unify the two worlds of our Wikipedia content and structured data |
What is blocking it?
|
Who is responsible?
WMF, WMDE, and Community? |
Important decisions to make edit
What are the most important decisions that need to be made regarding this topic? | ||
1. Clearly defining the model of where structured data elements should be stored and how/where they are curated. | ||
Why is this important?
There is confusion, even within WMF, on some aspects about storage and curation processes, and we need to clear that up before we can begin to do even high-level strategy. |
What is it blocking?
Taking next steps for deciding how/when/why to implement structured data across Wikis |
Who is responsible?
WMF, WMDE, Community |
Action items edit
What action items should be taken next for this topic? For any unanswered questions, be sure to include an action item to move the process forward. | ||
1. Spend more time exploring the question of whether data should be stored on-wiki or on Wikidata, and explore how curation will work. Do both on a case-by-case basis, because there won't be a one-size-fits all solution. | ||
Why is this important?
Where to store things and how to curate them are key questions. |
What is it blocking?
Moving forward with architecture design/spec for "structured templates" |
Who is responsible?
WMF, Community? |
New Questions edit
Do we add potentially multiple concept wikidata IDs to page templates OR do we have it all be done on wikidata? Question of both storage and curation. | ||
Why is this important?
This will determine the curation and storage model when doing "structured templates". |
What is it blocking?
Comprehensive matching templates to Wikidata items |
Who is responsible?
Community |
Detailed notes edit
Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.
- Voting Legend for Session:
- Pink post it - non page
- Blue post it - page element
- Orange dot - storage on wiki
- Red dot - storage on wikidata
- Purple dot - not stored at all
- Green dot - curation on wiki
- Yellow dot - curation on wikidata
Goal: we want to know the big YES ideas when it comes to data. Don’t want the meh ideas - want the passionprojects
Generated Ideas:
- Adding a data tab alongside “read” and “history” that takes you to the wikidata for that page
- Storage questions: One contributor long-term hope that one day wiki and wikidata will both be merge into a single entity.
- Contention around where its curated - Sam, Lydia, Bryan, Dmitry all thought that it should be curated through both
- Multi language auto-generated descriptions from wikidata statements (ideally a short piece of code, esp for short page previews)
- Some thought ideally not stored but will end up being cached
- Not curated because not an edit
- Lead image + focus rect
- Ex, thumbnails from search results, but right now we don’t have a good way of focusing on the center of each image to generate thumbnails
- Question of storage - complicated, coordinates could be stored on base on commons
- Curation split votes - lead image is curated by the wiki but the region should be on structured data, but will probably be overwritten
- Does it always have to be done on wiki or?
- Open graph metadatas to interact with social medias
- Cached but not actually stored?
- Could be an mcr slot
- Not curated according to votes
- Related articles / see also
- Storage: inferred based on elastic search; however community members thought they should be able to manually override. JonKatz says all three are viable storage options
- Lydia: makes sense to override that locally but generally inferred (after Lydia’s statement Jon got rid of the wikidata half of his sticker)
- CATEGORIES (mcr)
- Esp in mcr, since now it duplicates the entire content currently every time the category changes
- Q about storage - presumably this could spans across languages and could be storage on wikidata
- Contention
- Not currently consistent
- Content creation metadata
- Examples: number of contributors, page activity, most disputed content, blame tool
- Templates leading to ontology and semantics
- Do we add potentially multiple concept wikidata ids to the template OR do we have it all be done on wikidata?
- If you can map templates to ontologies no matter where they come from, that gives you a plug for it
- Already possible?
- Split vote concerning storage AND concerning curation - definitely a question to be answered
- Structured page / section data + Semantic article content mark-up + Clear separation of in-article content from its presentation
- Opens the door to things like mixing and matching data from different projects