Wikimedia Technology/Annual Plans/FY2019/TEC5: Scoring Platform

The Scoring Platform team is an experimental, research-focused, community-supported, AI-as-a-service team. Our work focuses on balancing the efficiency that machine classification strategies bring to wiki-processes with transparency, ethics, and fairness. Our primary platform is ORES, an AI service that supports wiki processes such as vandal fighting, gap detection, and new page patrolling. The current set of ORES-supported products are loved by our communities, and our team's work of relieving overloaded community processes with AI, has shown great potential to enable conversations about growing our community (Knowledge Equity). In this proposal, we'll describe what we think we can accomplish given our current, minimal staffing. We'll also propose to fully staff the team along the lines of the original FY2018 Scoring Platform proposal so that we can expand our capacity in the critical area of bias detection and mitigation.

Overview of FY2018 edit

Last year, we invested in the Scoring Platform team by giving Aaron a budget, staffing the team with Adam Wight as a senior engineer (80%) and Amir Sarabadani as a junior engineer (50%). We also retained a contracting budget to hire experts to develop new AIs and evaluation strategies. In total, we have a staff of 2.55 FTEs.

Despite this minimal staffing, the team has been quite successful.

  • Lots more models delivered to lots more wikis (targeting emerging communities, increasing capacity for knowledge equity)
  • Deployed ORES on a dedicated cluster and refactored the ORES extension (more uptime, evolving infrastructure)
  • Collaborated with Community Tech on a study of new page review issues�—training and testing a critical technology for mitigating the issue (evolving our infrastructure and experimenting with new strategies for supporting newcomers).
  • Published papers about why people cite what they cite, and the dynamics of bot governence (increasing our understanding of wiki processes).
  • We performed a community consultation and system design process for JADE, our proposed auditing support infrastructure.

Contingency planning for FY2019 edit

In order to deal with funding realities, we've prepared two annual plans for our department. The first alternative presents our ideal plan, which includes a reasonable amount of growth. The other is what we can accomplish if staffing levels cannot be improved.

Staffing increased as requested edit

Ask edit

Bring the team up to a higher level capacity and robustness by:

  • Promote Amir to a full-time requisition holder
  • Hire an engineering manager/tech lead to remove this burden from Aaron. This was proposed in our original plan for FY2018.

Benefits edit

  • Support more languages and wikis: we have a large backlog of requests for ORES support.
  • Bring our new auditing system, mw:JADE, online. Start to tracking algorithmic bias—the kind of problems that keep some potential contributors out—much more effectively (Knowledge Equity).
  • More robust ORES service.
  • Develop new prediction models more quickly (Knowledge as a Service). Many of the models we will target are intended to provide fertile ground for experimentation around the balance between efficient quality control and better newcomer support (Knowledge Equity).
  • Once Aaron is wearing fewer hats, he'll be less of a bottleneck for the team. With more time, Aaron will be able to participate in thought leadership/outreach more effectively.

Staffing unchanged from FY2018 levels edit

Ask edit

  • Continue funding the Scoring Platform Team at FY2018 levels.

Benefits edit

In the next fiscal year, we will continue our work of making ORES more robust, and expanding our prediction models to new wiki processes and under-served wiki communities.

  • Slowly increase model support to more wikis, prioritizing emerging communities.
  • Experiment with the new article routing models and expand them to more communities.
  • Publish datasets and papers about the process and machine-based process augmentation time.

Risks and challenges edit

  • While the Scoring Platform team has been able collaborate effectively with volunteers in order to supplement its minimal resourcing, the fact is that the development of ORES (useful AIs) and JADE (our auditing system) has been slowed substantially by understaffing. We have the chance to help lead the industry on this front, but it may escape us.
  • Our bus factor is still far too low. Were we to lose the one full-time engineer on the team, development and deployments would nearly come to a halt. Or worse, if Aaron were to be lost, the majority of the team's infrastructure would leave with him.

Program outline edit

The following program outline describes the expanded set of goals that we think we can achieve with a fully-staffed team. If we don't expand the team this year, then we'll need to limit our goals outcomes 1 and 2.

Teams contributing to the program edit

Scoring Platform

Annual Plan priorities edit

Primary Goal: 1. Knowledge Equity: Grow new contributors and content

How does your program affect annual plan priority? edit

Knowledge as a service
Basic machine learning support is essential to keep Wikipedia and other Wikimedia projects open at scale. Without machine learning support, quality control work and other curation activities are too cumbersome to maintain and the urge to control access and limit newcomers becomes overwhelming. We've seen this in English Wikipedia with restrictions on anonymous editors and registered newcomers. Collaborations between the Scoring Platform team and product teams in Audiences represent the best hope we have re-opening English Wikipedia and ensuring that other communities don't close off to new participants.
Knowledge equity
While these machine learning technologies are critical to continuing our mission, they also come with a great potential cost with regards to bias. This proposal includes a request for growth so that we can more effectively focus on the development of effective auditing technologies. These technologies ensure that the AIs that we use to curate content are effectively accountable to our volunteer community. If we do not invest in the development of auditing technologies, we risk furthering our current inequities by encoding them in our prediction models. This is a great risk when it comes to quality control, because these algorithms are part of the decision system controlling who gets to contribute and who does not.

Through investments in this program, we hope to (1) boost the capacity of our communities to effectively curate content at the scale of human knowledge and (2) to ensure that systems we build serve as a force for good in Wikimedia.

Program Goal edit

Improve the efficiency of wiki processes and mitigate the effects of algorithmic biases that are introduced.

Outcome 1
More wiki communities benefit from semi-automated curation support
Output 1
ORES supports the edit quality prediction models for more wikis/languages
Output 2
ORES supports the draft quality/topic prediction models for more wikis/langauges
Outcome 2
Grow the community of wiki decision process modelers and tool builders (staff, volunteers, academics)
Output 3
Published posts about ORES, AI, wiki processes, etc. on the Wikimedia blog
Output 4
Workshops run, papers published, datasets published, tutorials published, hackathons co-organized
Outcome 3
Users of ORES-based-tools can build a repository of human judgement to contrast with model-predictions
Output 5
JADE (our auditing system) accepts and stores human judgements
Output 6
JADE supports basic curation activities (reverts, suppression, watchlists -- MediaWiki integration)
Outcome 4
Developers and volunteer analysts will be able to analyze trends in ORES bias.
Output 7
JADE data appears in mw:Quarry and public dumps
Output 8
Reports about ORES bias are published.
Outcome 5
Tool developers and product teams will be able to use JADE to help patrollers collaborate by providing a central location for noting which items have been reviewed and what the outcome of that review was.
Output 9
A stream of judgements are available for consumption by tools/products.

Resources edit

We can meet outcomes 1 and 2 with current resourcing. Outcomes 3, 4, and 5 will require additional resourcing highlighted below.

FY2017–18 FY2018–19
People (OpEx)
  • Principal research scientist
  • Senior engineer
  • 0.5 ✕ Junior engineer (contract budget)
  • contracting budget for expert modelers/interns
  • 0.1 x Tech Writer
  • Principal research scientist (no change)
  • Senior engineer (no change)
  • Junior engineer (FTE conversion)
  • contracting budget for expert modelers/interns (no change)
  • 0.5 ✕ Product manager (new hire, shared from WMCS)
  • 0.25 x Tech Writer (new hire, shared from WMCS)
Stuff (CapEx) New ORES cluster servers (kubernetes nodes) No substantial new hardware needed
Travel & Other
  • 2 ✕ Wikimania
  • 2 ✕ Wikimedia Hackathon
  • 2 ✕ professional conference
  • n/a ✕ Wikimania (centralized)
  • 3 ✕ Wikimedia Hackathon (+1 for new hire)
  • 3 ✕ professional conference (+1 for new hire)

Targets edit

Outcome 1 edit

More wiki communities benefit from semi-automated curation support
Target
2 new wikis with advanced edit quality models each quarter
At least one semi-automated tool adopt the draftquality and drafttopic prediction models (e.g. PageCuration)
Measurement method
  1. Count of Wikis supported after models are deployed.
  2. Tool developers self-report use of ORES prediction models

Outcome 2 edit

Grow the community of wiki decision process modelers and tool builders (staff, volunteers, academics)

Target
Publish two papers in peer reviewed journals about wiki process support with AI models
Publish two datasets used in modeling wiki processes
Publish two wikimedia blog posts about modeling, auditing, and problems of scale.
Recruit and train new developers at the hackathon events (Wikimedia Hackathon and Wikimania)
Measurement method
  1. Papers, datasets, and blog posts published by the team
  2. Wiki data science workshops organized
  3. Papers published that cite papers and datasets we release
  4. Count of collaborators recruited at hackathon & how many remain active (retained) on the mailing list (ai@lists) or the IRC channel (#wikimedia-ai)

Outcome 3 edit

Users of ORES-based-tools can build a repository of human judgement to contrast with model-predictions

Target
A test JADE service is deployed in WMF Cloud
JADE is ready to be deployed into production wikis
JADE has revert/suppression/watchlist integrations in MediaWiki
Measurement method
  1. Track deployments
  2. The number of 3rd party tools that build JADE integrations (self-reported & discovered)

Outcome 4 edit

Developers and volunteer analysts will be able to analyze trends in ORES bias.

Target
JADE data appears in mw:Quarry and produces database dumps
Publish at least one report about bias/non-bias in ORES using JADE data
Measurement method
  1. Demo query in Quarry & inclusion in dumps.wikimedia.org
  2. Number of bias report publications

Outcome 5 edit

Tool developers and product teams will be able to use JADE to help patrollers collaborate by providing a central location for noting which items have been reviewed and what the outcome of that review was.
Target
JADE judgments appear in mw:EventStream
JADE judgments appear along with predictions in ORES
At least one tool adopts the use of JADE for distributed coordination between patrollers
Measurement method
  1. Inclusion in EventStream
  2. Deployment of JADE data to ORES
  3. Developers report the usage of JADE data in curation tools

Dependencies edit

We rely on Operations for helping with basic hardware and service support. We'll also need somewhat minimal hardware resources for bringing JADE to production -- should we decide to do so this year. We expect that JADE's primary systems resource usage will be in production MediaWiki, but we may want to have some minimal services producing novel event streams (Judgements) and data dumps.

We'll rely on WMCS for support in hosting JADE-related datasets in public analytics infrastructure like PAWS and Quarry. This will help us make sure that JADE data is open for analysis by our volunteer communities.

We expect that Research will depend on us for datasets and for productionizing some of their experimental models.

We expect that Wikimedia Product teams in the Contributors department will depend on us to support their use of our prediction models (ORES) and auditing/distributed-curation support (JADE) in their tools.

References edit