Structured Data Across Wikimedia/Image Suggestions/Data Pipeline

Tracked in Phabricator
Task T296814

General architecture

Who does what

Team	Pipeline components
Data Engineering	Data Lake
Data Platform	Pipeline infrastructure, Cassandra, Image Suggestions API
Search Platform	Search indices update
Structured Data	Airflow job

The Structured Data side

Our main duties are:

extract relevant content from Commons, Wikidata and Wikipedias;
transform it into datasets suitable for Commons and Wikipedias search indices;
load image suggestions into the database that will serve the API.

Speaking with the above architecture diagram terms, we:

own the Airflow job;
feed specific Hive tables in the Data Lake;
provide content for Data Persistence, namely input for Elasticsearch updates and Cassandra.

The Airflow job breaks down into the following steps:

a set of sensors that give green lights as soon as fresh data is available in the Lake;
a Python Spark task that gathers Elasticsearch weighted tags for the Commons search index;
a Python Spark task that gathers actual image suggestions, filtering out those rejected by the user community;
a Python Spark task that gathers suggestion flags for Wikipedias search indices;
a set of Scala Spark tasks that feed Cassandra with suggestions.

How it works

Get good suggestions from Commons

First, we aim at retrieving appropriate image candidates for unillustrated Wikipedia articles. To achieve so, we bring into play two different scores:

the confidence of an image being an actually good suggestion;
the relevance of an image against a query in the Commons search index.

Initially, we envisioned that the latter would cater for the former, i.e., the relevance score would both place high-quality images at the top of search results and provide accurate recommendations. However, we found that the two signals are weakly correlated. Hence, we opted for two separate scores. See phab:T272710#7119669, and phab:T301687#7732953 for justification of how we are calculating confidence score.

We leverage the following content properties:

two from Wikidata, namely image (P18) and Commons category (P373);
one from Wikipedias, namely article lead images;
one from Structured Data on Commons, namely Depicts statements.

Confidence score

We set a constant score that depends on the property used to retrieve the given image candidate:

Property	Score
image (P18)	90
Commons category (P373)	80
lead images	80
Depicts	70

If an image is suggested based on more than one different properties, then we use the highest score.

The scores are based on a dataset of human-annotated suggestions that we gathered for evaluation purposes. They mirror the accuracy percentage, rounded down to the nearest 10%. For instance, over 90% of image (P18) candidates were judged as accurate suggestions, thus resulting in a conservative confidence score of 90.

Relevance score

We build sets of (property, value, score) triples that serve as weights to rank images in the Commons search index. For each property except Depicts, we retrieve the corresponding Wikidata item and compute the score.

When available, we consider image (P18) as a crucial property, thus setting a constant maximum score of 1,000 to its values.

For Commons category (P373), we implement the following simple intuition: a category with few members is more important than one with many members. As a result, given a Wikidata category item, its score is inversely proportional to the logarithm of the total images it holds.

For article lead images, the score is based on the number of main namespace pages that link to articles with the given lead image, grouped by the Wikidata item of the article. For instance, let Commons_File_A be a lead image of xxwiki/Article_X and yywiki/Article_Y, which map to the Wikidata item Q123. The Q123 score is proportional to the sum of incoming links for Article_X and Article_Y. Based on empirical evidence, we pick a scaling factor of 0.2 and a threshold of 5,000 for incoming links. Hence, if the sum of incoming links is:

less than the threshold, then we set the score to incoming links * scaling factor;
greater than or equal to the threshold, then we set the score to its maximum value, i.e., 1,000.

Find suggestions for Wikipedia articles

The second major step involves mining images that serve as suitable candidates for unillustrated Wikipedia articles. We consider an article to be unillustrated if ALL of the following are true:

the article has no local images
the article has either no commons images OR every commons image linked to the articles is used so widely across Wiki projects that it is likely to be an icon or placeholder
the Wikidata item corresponding to the article is not an instance of a list, a year, a number, or a name.

We filter images matching ANY of the following out of the list of suggestions:

the image is used so widely across Wiki projects that it is likely to be an icon or placeholder;
the file name contains substrings that indicate icons or placeholders, such as flag or replace_this_image;
the image is in the placeholder category.

Tell whether an article has suggestions

The final stage consists of adding markers to Wikipedia articles for which we have suggestion candidates. We fulfill this by simply injecting boolean flags into the respective search indices.

Monitoring

Some of the outputs of the data pipeline are monitored, and alerts created and sent if something is not right. For more details see https://phabricator.wikimedia.org/T312235