Structured Data Across Wikimedia

Start:	2021-02-01
End:	2023-06-30
Team members:	; Alexandra Ugolnikova - Product Manager; Carly Bogen - Program Manager; Cormac Parle - Software Engineer; Elena Tonkovidova - Test Engineer; Mark Holmquist - Engineering Manager; Matthias Mullie - Software Engineer; Marco Fossati - Software Engineer; Sneha Patel - UX Designer; Connie Chen - Data Scientist; Luca Martinelli - Movement Communications Specialist;
Backlog:	#Structured-Data-Backlog

Translate this page

SDAW^[1] was a project designed to help structure content on wikitext pages in a way that will be machine-recognizable and -relatable, to make reading, editing, and searching easier and more accessible across projects and on the Internet.

It aimed to help users associate contents between Wikimedia projects, help readers dive deeper in the Wikimedia knowledge ecosystem, and help contributors disseminate information across projects and beyond them in a Wikidata-like way. The project was created to provide a venue for experimentation with computer-aided editing tools to make editing easier and more accessible to more editors around the world.

The project ran from February 1, 2021 to June 30, 2023.

Background

This project was a follow-up to similar development that was completed on Commons, as part of the previous SDC^[2] grant, and has been partially funded by a three year grant from the Sloan Foundation. Work on SDC made us aware of the need for more advanced metadata for all content and APIs to provide better search results, which would make in turn content more accessible, discoverable, translatable and usable for other needs.

The project had three high level goals:

To allow machines to recognize Wikimedia content and to suggest relationships to other Wikimedia content. We explored this first via the image suggestion project.
To design a way to structure articles and pages to enable new content formats – such as content served in smaller, easily digestible pieces that is more accessible for readers to use and share.
To give Wikimedia users a more inviting, more efficient way to search and find content, building on MediaSearch, and exploring new ways to improve search across Wikipedias using Structured Data.

What is changing

The goal of the project is to design and prototype a new system that aims to be flexible enough to serve all the kinds of metadata we might need to support in the near future.

We identified three main projects that we will develop, as part of our work:

Image suggestion, a feature for experienced users to help illustrate Wikipedia articles;
Sectional metadata, also known as Section topics, in order to describe what a section of a Wikipedia article is about;
Search improvements, that will use structured content to give users a more inviting and efficient way to search and find content on the Wikipedias.

Image suggestion

The Image Suggestion UI aims at developing systems for structured data across all Wikimedia projects.

This work will build on the work already begun as part of the “Add an image” structured task project. However, its focus will be shifted towards improving the processes for experienced contributors. In particular, we will target users who have edited or watched a particular article or set of articles, since they are likely to be experts in the topic and to have interest in seeing that article(s) improve.

Section topics

The Section Topics project will identify sections in an article and create topics accordingly for those sections, drawing on several elements, such as:

an algorithm that detects Wikidata items based on the section’s blue links (which will be developed in partnership with the Structured Data, Research, and Data Platform teams);
the ability to automatically identify sections in an article (which will be developed in partnership with the Structured Data and Data Platform teams).

One of the first use cases we envisioned for section topics will be section-level image suggestions, which will use the blue-links algorithm and section identification infrastructure above, and be delivered both via the newcomer experience and via notifications for experienced contributors. This will build upon the work done on image suggestions and will be developed in partnership with the Structured Data, Data Platform, Research, Search, Android, and Growth teams.

These elements will not change, nor impact the current editing experience for users. All these activities will be automatic and will not depend on any action from editors. Currently, this project is in its development phase, and there are still aspects that may require further investigation and/or feedback from users.

Search improvements

The Search Improvements project will use structured content to give users a more inviting and more efficient way to search and find content on the Wikipedias. By improving Special:Search, we want to enable users to find the information they are looking for, or that they may not have noticed, or previously come across through existing search.

We aim to identify and define incremental “special search” improvements that use structured content, to assist users in finding the content they are looking for, especially in those language wikis that have fewer articles.

What do we not want to do?

Leave users out of the process
Overwhelm users with too much new content to moderate
Add any additional bias to Wikimedia projects
Add additional vectors for vandalism
Introduce too much complexity into our systems

Status updates

June 2023

Search Preview deployed on Catalan, Dutch, Hungarian, Norwegian and Ukrainian Wikipedia.
Section-level Image Suggestions deployed on Portuguese, Russian, Indonesian, Catalan, Finnish, Hungarian and Norwegian Wikipedia.

May 2023

The final report of DPLA^[3] project funded by SDAW^[1] to drive the reuse of described and attributed images was published.

March 2023

Survey about Image Suggestions notifications run on Portuguese, Russian and Indonesian Wikipedia.

January 2023

New feature Search Preview deployed on Portuguese, Russian and Indonesian Wikipedia.
Started work on Section-level Image Suggestions, based on work done for Section Topics.

November 2022

Image Suggestions testing phase started on Catalan, Finnish, Hungarian and Norwegian Wikipedia.

September 2022

First round of Image Suggestions testing on Portuguese, Russian and Indonesian Wikipedia successfully concluded.
Project pages updated to reflect the new current status of the initiative.

June 2022

The second year report for Structured Data Across Wikimedia has been published.
DPLA^[3] was awarded SDAW^[1] grant funding to drive the reuse of described and attributed images. You can read more about it at DPLA's 2022 SDAW project announcement.
A general consultation about Search improvements is launched.

March 2022

Project pages updated to reflect the new current status of the initiative and the three main projects to be developed.
Indonesian Wikipedia joins in as the third tester community.

February 2022

Establishing contact with Portuguese and Russian Wikipedia community as first tester communities for Image Suggestions.

November 2021

Project is moving to a first test stage, that is experimenting with the use of notifications to alert users of potential useful images for Wikipedia articles.

May-August 2021

Looking for feedback about the Image Suggestions project, through individual invitations and a month-long RfC specifically targeted to 4 Wikipedias + Commons

February 2021

Looking for feedback about these ideas.
Working on rough wireframes and mockups to help explore these ideas.
Exploring infrastructure to support this work via the Technical Decision Making Forum process. See task T274181.

Second half of 2020

Building MediaSearch on Commons.
MediaSearch A/B test - conducted between 10 and 17 September 2020.

Feedback

Project feedback is and will always be welcome. We are especially interested in your ideas about the extent to which you want to keep the “human-in-the-loop” throughout the topical metadata creation process. We are looking forward to hearing from you about the following open questions:

Your expectations about the project
1. What do users expect from this project? What are the necessary actions to be addressed?
2. How do you envision this metadata being used? Can you think of ways it would aid in your workflows?
Metadata moderation
1. Is moderation necessary to avoid vandalism and/or bias?
2. If moderation is necessary, how can it be effectively managed?
Adding and confirming metadata
1. Do users want to be able to approve or reject metadata suggested by the automated system?
2. Do users want to be able to add additional metadata beyond what is suggested by the automated system?
3. Do you think it may just be sufficient for users to have the opportunity to send feedback with suggestions on how to improve the machine generated metadata, when necessary?
Privileges for visualising and editing
1. Do we want metadata to be visible for all users or only for certain classes of users?
2. Do we want metadata to be editable for all users or only for certain classes of users?

Also, more specific feedback about related projects can generally be left on the projects' talk pages:

Funding

Partial funding for this work is provided by a follow-up restricted grant from the Alfred P. Sloan Foundation, to further the work done by the first round of funding to develop SDC^[2].

References

↑ ^1.0 ^1.1 ^1.2 SDAW — Structured Data Across Wikimedia
↑ ^2.0 ^2.1 SDC — Structured Data on Commons
↑ ^3.0 ^3.1 DPLA — Digital Public Library of America

[sdaw-1] 1.0 ^1.1 ^1.2 SDAW — Structured Data Across Wikimedia

[sdc-2] 2.0 ^2.1 SDC — Structured Data on Commons

[dpla-3] 3.0 ^3.1 DPLA — Digital Public Library of America

[1]

[2]

[3]