Wikimedia Apps/Team/Android/Add an image MVP

Objective

The Android, Structured Data, and Growth teams aim to offer "Add an Image" as a “structured task”. More about the motivations for pursuing this project can be found on the main page created by the Growth team. In order to roll out Add an Image and have the output of the task show up on wiki, a "minimum viable product" (MVP) for the Wikipedia Android app will be created. The MVP will enhance the algorithm provided by the research team and answer questions about behavior usage to further explore the concerns raised by the community.

The most important thing about this MVP is that it will not save any edits to Wikipedia. Rather, it will only be used to gather data, improve our algorithm, and improve our design.

With the Android app being where "suggested edits" originated, and our team has a framework to build new task types easily. The main pieces include:

The app will have a new task type that users know is only for helping us improve our algorithms and designs.
It will show users image matches, and they will select "Yes", "No", or "Skip".
We'll record the data on their selections to improve the algorithm, determine how to improve the interface, and think about what might be appropriate for the Growth team to build for the web platform later on.
No edits will happen to Wikipedia, making this a very low-risk project.

The Android team will be working on this in February and March 2021. Our hope is the Growth team will learn enough to deploy the feature on mobile web. Based on the success and lessons of the Growth team's deployment, the Android team will refine the MVP and turn it into a feature that produces edits to Wikipedia.

Product Requirements

As a first step in the implementation of this project, the Android team will develop a MVP with the purpose of:

Improving the Image Matching Algorithm developed by the research team by answering "how accurate is the algorithm"? We want to set confidence levels for the sources in the algorithm -- to be able to say that suggestions from Wikidata are X% accurate, from Commons categories are Y% accurate, and other Wikipedias are Z% accurate
Learn about our users by evaluating:
- The stickiness of Add an Image across editing tenure, Commons familiarity, and language
- The difficulty of Add an Image as a task and if we can determine if certain matches are harder than others
- Learn the implications of language preference on the ability to complete of the task
- Accuracy levels of users judging the matches because we’re not sure how accurate the users are, we want to receive multiple ratings on each image match (i.e. “voting”).
- The optimal design and user workflow to encourage accurate matches and task retention
- What, if any, measures need to be in place to discourage bad matches

How to Follow Along

We have created T272872 as our Phabricator Epic to track the work of the MVP. We encourage your collaboration there or on our Talk Page.

There will also be periodic updates to this page as we make progress on the MVP.

Updates

2021 Jun 25 - Final Report and Next Steps

The Android team completed the Train Image Algorithm experiment. The findings can be found below. There was enough favorable insights from the experiment that the Growth team decided to proceeds with the next phase of this work. You can read more about the Growth team building a Mobile Web feature to place images in articles on their project page. In the interim, the Android team will sunset the Train Image Algorithm task, and will add an Image Recommendations task to Suggested Edits based on the work from the Growth team.

The two most important questions to answer in making a decision to proceed with image recommendations work for newcomers are around engagement and efficacy. Each of those has more detailed questions underneath.

Engagement: do users like this task and want to do it?

Edits per session: do users do many of these edits in a row?
Retention: do users return on multiple days to do the task again?
Algorithm experience: is the algorithm accurate enough that users feel productive, but not so accurate that they feel superfluous?
Qualitative: is there anything we can see about the task in Play Store comments?

Efficacy: will resulting edits be of sufficient quality?

Accuracy of algorithm: what is the baseline accuracy before users are involved?
Algorithm improvement: what did we learn about the algorithm’s weak points?
Judgment: can newcomers identify the good matches from the bad, thereby improving the overall accuracy of the feature placing images on articles?
Effort: do newcomers seem to spend adequate time and care evaluating each match?

Engagement: do users like this task and want to do it?

Edits per session: we want to see users do many of these edits in a row, indicating that they like the task enough to keep on going.

On average, they do about 9 annotations per user and 10 annotations per session
We want to compare to the other Android tasks, using a 30 day sample of data, only Logged In, Suggested Edit editors.
We want to look at these numbers for English and Non-English users, if possible.
Note on positive reinforcement: the experience recommends that users do 10 per day as their “daily goal”. Perhaps the fact that this number is close to 10 is an indication that the daily goal is influencing users.

Average Edits per Unique User:

Task	All users	English	Non-English
Image rec	11	11	11
Desc add	20	9	20
Desc change	8	3	8
Img caption add	6	6	6
Img tag add*	7	NA	NA
Desc translate	18	11	19
Img caption translate	7	4	7

*Image tag edits are on Commonswiki, we don’t track language for those edits

Retention: we want to see users return on multiple days to do the task again.

Most recent in this Phab comment, on how to make an apples-to-apples comparison between the various Android tasks
Using a 30 day sample of data from only Logged In, Suggested Edit editors.
We want to compare to the other Android tasks.
We want to look at these numbers for English and Non-English users, if possible.

All users

Task	1 day	3 day	7 day	14 day
Image rec	8.7 %	6 %	3.8 %	1.7 %
Desc add	39.2 %	34.2 %	26.7 %	18.3 %
Desc change	32.6 %	28.0 %	22.8 %	15.9 %
Img caption add	20.7 %	16.5 %	11.8 %	7 %
Img tag add	17.8 %	13.2 %	8.8 %	4.3 %
Desc translate	30 %	23 %	16.1 %	6.1 %
Img caption translate	20.8 %	14.6 %	9.7 %	2.8 %

English

Task	1 day	3 day	7 day	14 day
Image rec	8.7 %	5.9 %	3.7 %	1.6 %
Desc add	30 %	23.3 %	18.9 %	11.1 %
Desc change	26.9 %	19.2 %	15.4 %	7.7 %
Img caption add	19.1 %	15 %	10.8 %	5.9 %
Desc translate	17.65%	11.8%	5.9%	0
Img caption translate	7.7 %	3.8 %	3.8 %	0

Non-English

Task	1 day	3 day	7 day	14 day
Image rec	8.1 %	5.7 %	3.6 %	2 %
Desc add	40.1 %	33.9 %	27 %	18.3 %
Desc change	34.5 %	28.7 %	23.2 %	16.7 %
Img caption add	19.4 %	16.2 %	11.6 %	7.7 %
Desc translate	27.7 %	20.9 %	13.1 %	3.9 %
Img caption translate	21.6 %	15.1 %	10.1 %	2.9 %

Algorithm experience: is the algorithm accurate enough that users feel productive, but not so accurate that they feel unnecessary?

If users were saying “yes” or “no” over 90% of the time, we might worry that they’re bored. If they say “unsure” more than a third of the time, we might worry that they’re frustrated.
Users say “yes” 65% of the time, “no” 20% of the time, and “not sure” 15% of the time. In other words, they perceive the algorithm to be correct about two-thirds of the time, and they’re only unsure rarely.
It would be helpful to find research from the industry or academy on how to think about and tune this ratio.

Response	All users	English	Non-English
Yes	65%	65%	64%
No	20%	19%	22%
Not sure	15%	16%	14%

Efficacy: will resulting edits be of sufficient quality?

Accuracy of algorithm: what is the baseline accuracy before users are involved?

Our best estimate comes from the SDAW test, which tested in six languages, and ranges from 65-80% accurate depending on whether you count “Good” or “Good+Okay”, and depending on the wiki/evaluator (source).
The three sources in the algorithm have substantially different accuracy (source) and make up different shares of the coverage (source):

Source	Accuracy (good)	Accuracy (good+okay)	Share of coverage
Wikidata	85%	93%	7%
Cross-wiki	56%	76%	80%
Commons category	51%	76%	13%
All	63%	80%	100%

Through the Android MVP, experts evaluated 2,397 matches. On average, experts assessed 76% of the matches to be correct. This is in line with the results above.
WMF staff also manually evaluated 230 image matches which were marked as “correct” by newcomer editors (<50 edits). We found that 80% of these matches are actually correct, which is in line with the numbers above.

Algorithm improvement: what did we learn about the algorithm’s weak points?

What is the distribution of responses for the follow-up questions for “no” and “not sure”?
We want to look at these numbers for English and Non-English users, if possible.

“No” responses

Response	All users	English	Non-English
Not relevant	5,094 (43%)	4,034 (45%)	1,060 (37%)
Not enough information	2,014 (17%)	1,465 (16%)	549 (19%)
Offensive	159 (1%)	108 (1%)	51 (2%)
Low quality	969 (8%)	734 (8%)	235 (8%)
Don’t know this subject	1,132 (10%)	807 (9%)	325 (11%)
Cannot read the language	752 (6%)	554 (6%)	198 (7%)
Other	1,674 (14%)	1,210 (14%)	464 (16%)

“Not sure” responses

Response	All users	English	Non-English
Not enough information	2,147 (24%)	1,742 (24%)	405 (23%)
Can’t see image	267 (3%)	213 (3%)	54 (3%)
Don’t know this subject	4,178 (47%)	3,325 (46%)	853 (48%)
Don’t understand the task	284 (3%)	215 (3%)	69 (4%)
Cannot read the language	1,095 (12%)	895 (12%)	200 (11%)
Other	996 (11%)	785 (11%)	211 (12%)

Judgment: can newcomers identify the good matches from the bad, thereby improving the overall accuracy of the feature placing images on articles?

Comparison with WMF staff annotations
- 80% of the matches for which newcomers said "yes" are actually good matches
- This number goes up to 82-83% when we remove newcomers who have very low median time for evaluations.
- Since the algorithm is 65-80% accurate in the first place, and algorithm+newcomers is 80% accurate, but we think that we can boost that by screening the worst newcomers (those who go too fast; those who say yes too often), then perhaps newcomer+algorithm could be up at 85%+.
- 85% of the matches for which Avg/Expert users said "yes" are actually good matches

Comparison with expert users (users with 1000+ Wikipedia edits)

For images labeled as "good matches" by newcomers		Experts agree 74.9% of the time
For images labeled as "bad matches" by newcomers		Experts agree 51.8% of the time

label	description	% positive responses	%users with "all yes"	% users with “all yes” and 5+ annotations
new	<50 edits	76.90%	40.40%	30.06%
expert	>=1000 edits	76.15%	22.45%	12.82%
avg	otherwise	73.63%	17.12%	14.21%

There was agreement amongst users

Newcomers are more likely to select yes than experienced users.

Effort: do newcomers seem to spend adequate time and care evaluating each match?

What percent of users have a mean response time of less than five seconds?

All users

User	Mean	Median	% of users with <5s response time
Newcomer (<50 edits)	9.6	7.6	31.7%
Medium (>=50 and <1000 edits)	10.2	8.5	11.2%
Expert (>=1000 edits)	11.6	9.6	13.2%