Wikimedia Developer Summit/2018/Research Analytics and Machine Learning

Research, Analytics, and Machine Learning

https://phabricator.wikimedia.org/T183320
See task for session description, goals, and pre-reading notes

DevSummit event

Day & Time: Tuesday, 2:00 pm – 4:00 pm
Room: Tamalpis
Facilitator: Dan
Notetaker(s): Leila, who else please?

Session Notes: 15+ people in the room.

Part 1

Welcomes, outline, goals [15 mins]
- Introductions
- Set the Goal: The output of this session will feed into phase 2 of the 2030 Strategic Direction
- Here is how we plan to do it:
- Introduce our teams, the work and mandate of each team, capabilities we teams have, capabilities we're planning, and the things "we" see missing
- Brainstorm what "you" think we should be doing
- Break out and talk about risks, needs, opportunities, and what to avoid, stop, or re-resource

How do Research, Analytics, and Machine Learning relate to knowledge equity and knowledge as a service? What are some examples of features, services, processes, commitments, products, etc that further the strategic direction? Which are planned? Interactive: what other capabilities should be considered? [45 mins]

Slides at https://docs.google.com/presentation/d/1-pINN6SoiqBJDusTYrXk7mYV7eXlxMOOA-MqQtC52HM/edit#slide=id.g326cc9e2b0_0_1118

- Aaron: Machine Learning
  - Aaron goes over his slides.
  - Leila/dan: thoughts on what Aaron said?
  - Moritz: Now you have applied ML to content, but especially for code refactoring of the source-code that could be replaced by code else-where, to make the code-base more readable. Have you considered applying ML there? What new technologies do you introduce with this research that needs to be maintained in the long term?
  - Aaron: the first question sounded like: what about code instead of content. This is not an area that I focus on, but there are other people who are interested in it. We can share notes with other teams who are interested.
  - CScott: You have CluBot, Huggle, etc. who are doing the same thing kind of. So maybe we can look at consolidating ORES and these services. Back to your work: back to the darker days when VE was just deployed, we heard a lot that we get new editors who are less experienced, and that creates a lot of extra work for the patrollers, who were drowning in the RecentChanges feed. That specific problem didn't happen, perhaps because VE wasn't broadly embraced and the editor increase didn't happen that much either. A similar thing happened in Commons with the uploader app on Mobile, which had to be turned off because it was *too* successful in getting uploads to commons (most of which were, alas, of low quality). So another way of contextualizing your work is to help unburden the vandal fighters, which helps keep Wikimedia open to new contributions. ORES could be extended to images other than just text to allow us to re-enable some of the more accessible media upload mechanisms.
  - Aaron: yes, one of the talks I've not yet written is "how ORES kept Wikimedia open".
  - Ariel: How many times I've looked for a simple image of a flower and I have been unsuccessfull and something like this can help with.
  - Aaron: Let's talk about this in the breakout time because the systems that can support such a thing can be quite complicated.

- Dan: Analytics
  - We collect all the data and people get lost at sea with all of the complexity. This is a data warehouse idea. Instead we're going for the metaphor of a data lake. You can see the shore from any part of it. If some of that data needs to be public, it evaporates into the crowd. Streams of data feed into it. It's a calming metaphor and technology -- that helps you define questions and answer them.
  - Something that has impact beyond our team -- we have to build a really good streaming data platform. E.g. pageviews are writes into the analytics database. The industry handles this with scale-able log transport & kafka. Events as a first-class citizen. Since we're building this anyway, we started to look at what else in the MW ecosystem could use this. (Job queue, ChangeProp, ORES, JADE, etc.) All of these systems have the same basic needs to transfer data and allow for remixing. E.g. you can fork a stream and make changes and re-publish that stream.
  - In 2030, how can we start sharing technology and working together. E.g. Search folks are using this platform to do ML and scale-able transport. More concretely, I want to get you thinking about the "data lake". If it feels like an ocean, we have failed. It should feel like a happy little lake. If it doesn't, talk to us.
  - CScott: How our goals are resourced -- current guidance from product is that we don't just blindly trust developers, but rather make people prove the products via data. Analytics is a crucial point of how we get our dev summit proposals funded. E.g. platform team says we need (say) MCR, but our product process seems to require that at some point we have metrics to back up the effectiveness of the project in order to get our action items on a roadmap.
  - Dan A: Analytics infrastructure is the team.
  - CSCott: A lot of developers don't know how to work with the data, could use assistence understanding what data is available, how to run a trial, etc.
  - Dan A: If it seems like we're the gatekeepers of the capabilities to do what you need to do. No magic bullet. We have a backlog miles long like everyone else, so we need to talk.

- Does everyone have what they want out of analytics?
  - Aaron: One of the problems I run into with Jade: I need tight MW integration. I want things oming out of Jade to be reviewed or suppressed by people using MW. I'm about to pull the plug and mkae this a MW only thing; working with the streams is difficult. there needs to be buyin from the MW platform team to work on integraiton w streaming systems, [for this to work]
  - Dan: There seems to be consensus... Assembly with redis injected thing with the Job queue. We don't want that. So far, all we have done is turned everything outside of mediawiki into a second class citizen. I feel that if we decouple -- stop thinking of storage first. If you pivot on that and make the focus on the event, the storage is a natural conclusion on that. The source of truth is a natural history. You can put it in ?? stores so you can ask questions. You only use one stack.
  - SJ: I think people sometimes have a query in mind, but self-censor until it only covers data they already know exists. If there were a way where people could build from both-sides into the middle. E.g. some people might know the use case and others can know how to gather the data.
  - Dan A: [Quarry is a nice example of this happening today. People come to the talk page there who might not know how to formulate a queryand start a dicsussion about what they want, and we come help craft queries / get results to them.] I don't know how many people end up on that quarry talk page asking questions. Probably a tiny subset of the people who have questions.
  - SJ: Is there a way that people on that talk page can flag missing things? (e.g., reflect back that currently their query can't be answered, but record their desire - so that over time you build a priority queue of 'wanted queries')
  - Dan A.: Yeah. File a task or we'll file it for you. (But probably) more can be done for prioritizing that.
  - CScott: To Aaron's earlier question: The issue of decoupling products from MediaWiki so that they can be used by a wider community has come up before with VE and some other things. If you are going to invest in the effort to de-couple it, check out the audience first and recruit a specific non-WMF user. Otherwise you risk spending extra effort building something that's not actually useful outside the WMF despite your efforts. Or build it small and publicize it and wait for people to start asking to use it for other things, then plan to build v2 which is more general and incorporate all the other lessons learned.
  - James F: When I think about data that I'm interested in, I'm thinking about editing data. My concerns around analytics -- are we ?? stuff right now? Should we be throwing this data away or should we keep it? "Hey! How are the edits per capita from the population in the UK vs. the US vs. elsewhere?" Well, I can remember a study off the top of my head from 11 years ago, but the numbers have probably changed. Then there are questions like "At what threshold of # of edits in the past does the average new user start adding references to the majority of their edits?". Building that from scratch in the system might be ???. How do we decide how to ? vs. the specific research cases. I don't know where that middle point is.
  - Dan: that's why people went from the data ocean to the lake. Oceans help you answer complex questions; but it will drive your ops people inside and cost a lot. We think it's sane to force people to ask questions up front, to draw a line: think about product and strategy and pivotal questions, and we then figure out how to track them and save relevant data. Nothing more at first. With experience doing this we can build intuition about other things we might want to [gather].
  - It's funny you say that you have the benfit of going back to look at the old history for every query. It took us 2–3 years to clean up the mess of that old history. right now we store current stae of page and user; you have to look things up in the user table to find user age; there might be registration dates or not, there might be user blobs? you have to look up. People using the logging table desigined [implicitly] for a certain use case but not all current use cases.
  - [Moritz not spoken] An interesting short term objective might be to look how other stream processing services use the wikipedia edit stream. For example Flink relates aggregates recent edits https://ci.apache.org/projects/flink/flink-docs-release-1.4/quickstart/run_example_quickstart.html

- Leila: Research
  - Aside: of course ML is a part of research and v-v. The Research team focuses on 4 primary areas.
  - 1) Increasing WP [volume] across projects and languages,
  - 2) Increasing contributor diversity, demographically and otherwise (topically?)
  - 3) Community health - detecting harassment going through user/talk pages building models to detect harassment; moving backwards, can we detect scenarios that predict harassment and intervene before harassment. Highly specialized people now do hours and hours of work, some can be reduced by machines.
  - 4) Improving citations across Wikimedia projects. How current citations are being used, can we detect who's using what citation, can we bring more citations to Wikimedia projects.
  - We're a 6 person team, w 30+ collaborators in industry and academia. Much work happens through these voluntary collaborators. This is a capability we think about as a [community, research] asset. We think about how to cultivate and make the best use of their efforts.
  - Example: Knowledge Equity. Problem we observed in 2015: map of the world showing a lot of areas of the world with no wikipedia articles associated by geography. The bright spots are largely on the coasts of the United States, Europe, Japan, and smatterings in India and a few other places. English Wikipedia is dire enough, but, for example, Russian wikipedia covers even less. Even Spanish shows large gaps in coverage in South America, which is largely spanish-speaking. Arabic wikipedia covers almost nothing, just small parts around the Arabic-speaking world and dim coverage in the US.
  - One thing we started thinking about in 2013: can we build systems that help editors identify such gaps? Geo-coord gaps are only one of many, but are nice to convey the message. Find articles missing by geography, prioritize them, recommend them to those who are intersted in [the region / closing such gaps]. We built an end-to-end system -- comparing WP langpairs with one another. e.g. we compared EN and FR, found articles available in one but not the other. Find, Rank, Recommend. We took the top 300K articles by pageviews and tried to solve the following maching problem:
    - We know what editors have edited before ==> inferred what topics they're interested in
    - Divided the articles into groups of 100K each. Looked at 12K editors who contributed to both EN and FR in the months of the experiment.
    - Separated the editors: some got no intervention, some got a random recommendation, some got personalized recommendations.
    - We found you could triple new article creation w/ personal recommendations. We started Gap Finder, a project with APIs that does this in a productionized way
    - Vertical Expansion: In the best case, you have a big language project like enwiki, and the goal is to tell the editor how they can expand an article, on a section-basis. This problem was inspired by the problems in sub-Saharan Africa. There are very few but very active editors. They need to create templates in their local languages to tell newcomers how/what to expand. We designed the system, which you can try at http://gapfinder.wmflabs.org/en.wikipedia.org/...

Part 2

How do Research, Analytics, and Machine Learning relate to the broad strategic goals, namely knowledge equity and knowledge as a service?
Which capabilities (features, services, processes, commitments, products, etc.) do we already have? Which are planned? Which should be considered?
Risks, needs, opportunities, avoid, stop, resources

Breakout questions, ask people to switch to breakout mode. [5-mins]

Breakout sessions: [30 mins]
- What major risks do you see to our ability to provide these capabilities from the technical side, and how can they be mitigated? For existing capabilities, this includes risks to sustaining and scaling it.
- What technological needs do you see (beyond addressing the risks) for providing the respective capabilities?
- Which technological opportunities do you see for providing the respective capabilities? Which methods or technologies should we explore?
- What should we avoid doing with respect to the relevant capabilities?
- What should we stop doing with respect to the relevant capabilities?
- What amount of resources should be committed to providing each of the capabilities? Consider horizons of 1 year, 3 years, and 5 years.

Topic: Machine Translation

Risks:
- Have to figure out what is "good enough" as a translation. Wikipedia has a high quality standard.
- diluting wp authoritative voice w/ poor quality translated content
- amplifying inequities by creating more content for those already rich in content, while neglecting those with little content to begin with
Needs:
- properly aligned multilingual text
- clearly distinguish automated generated text vs human created text (UI)
- mediawiki needs to more carefully tag the source language of created content, esp in multilingual contexts and foreign-language citations
- possibly citations-in-wikidata also need to track translated versions of references
Opportunities:
- with a trained model, cross-project discussion (talk pages, chat) can be facilitated
- existing parallel text in (underserved) languages
  - transliteration
  - using wikidata to fill in translations of proper names and low-frequency noun phrases, and conversely leveraging human-created translations to fill in gaps in wikidata language links
- identifying gaps in content and filling it (broadening viewpoint of all wikis)
- OCR (using the language models to allow OCR of small languages)
- big community in traditionally underserved languages
- translation between variants of a language (including script differences)
- shared language models allow building good models for related languages which individually do not have much training data
- multilingual readers can get a broader view on a topic by comparing wikis; monolingual readers can do the same w/ the aid of machine translation
- trained model can be used to surface additional parallel texts and improve alignments, creating a virtuous cycle
Avoid:
- concentrating machine translation on majority languages & low-hanging fruit
- indirectly encouraging strict parallelism among wikis; ensure that wikis have autonomy and can express distinct viewpoints
Stop:
- Separate databases for each wiki
- LanguageConverter (replace w/ improved translation engines)
- Content Translation tool might be able to be folded into core instead of maintained as a separate service
Resources:
- Beefy servers (can haz GPUs)
- ML team (shared with other ML needs in organization)
- Language/domain consultants (shared with other language needs in organization)
- Community liasons to help communities newly able to work together across language barriers
- Partnership with academic MT researchers?p

Topic: Persistent experience across multiple sessions
- What major risks do you see to our ability to provide these capabilities from the technical side, and how can they be mitigated? For existing capabilities, this includes risks to sustaining and scaling it.
  - Is this like evercookies? What are the privacy & user-identification risks involved, can they be mitigated? [NB: if research requires it we can do user fingerprinting after the fact]: combination of IP address, agent, referrer.
  - Longer-term we'll lose track of people as they change IP/UA version/UA/etc.
- What technological needs do you see (beyond addressing the risks) for providing the respective capabilities?
  - Letting researchers and community see what is happening to banners and experiments
  - Letting potential researchers see the technical & social mechanism for privacy preservation, so they can work out whether they can make experiments they have in mind fit the accepted framework
  - Legal need: updating legal-tech policies that currently prevent such observations / data gathering
- Which technological opportunities do you see for providing the respective capabilities? Which methods or technologies should we explore?
  - Search Surveys, for instance: localstore a token. Name, timestamp. Search can do this to collect labels, track relevance of labels.
  - See Phab ticket about generalized survey pipeline. [can we ask users re: reconstructing connections across different surveys?]
  - Invite opt-in to longer-term collection for personalization! Want enough people in this network to vary tests across them. Compare what the iOS app asks for an opt-in; Android app is opt-out.
    - a) find out what % of people will opt in.
    - b) consider opt-out for a sample cross section of users
  - Invite people to opt in to enough persistent data to ensure they stop seeing donation banners after they give x)
  - Survey trees that ask one portion at a time, and customize later portions based on earlier responses. E.g. starting w/ demographics... the resulting surveys can be much shorter for each respondent while gathering locally precise information.
- What should we avoid doing with respect to the relevant capabilities?
  - Be careful to debias data for the sorts of people who choose to opt-in or opt-out (but need population data to know how to de-bias)
  - Not become an ad-tech company. :-) In particular, major uses of this data should not be fundraising but improvements to reading and editing and collaboration experience
  - Opaqueness about what data exists. E.g.: At any point, show me everything you have about me. And the user have a choice to say "forget" or "fix".
  - Action w/o community buy-in.
- What should we stop doing with respect to the relevant capabilities?
  - Not doing much now. Stop the misalignment.
  - Right now opt-out for central notice has to be manually pasted into wikitext each time (the 'CLOSE' button for 'stop showing this banner'). That's why some don't have it. Integrate that into the system altogether!
- What amount of resources should be committed to providing each of the capabilities? Consider horizons of 1 year, 3 years, and 5 years.
  - Research and other vision: how this can transform experience, 1 or 2 central goals & ideal outcomes for user experience if we had optimal data
  - Design support to push out & iterate surveys.
  - Include advisors / guidance from libraries that are also exploring privacy-conscious ways to improve use & experience of libs
  - Legal support and clearance. Community support in discussion.
  - Executive support and coordination

Topic: Qualitative Methods
- actually metrics on quality of articles
- getting input on how and why of people's work, concepts that numbers can't help us measure. → https://www.youtube.com/watch?v=MFpEksCJKqY "Qualitative Surveys"
- Connect datasets of articles with representative samples.
- Capabilities
  - Microsurveys++ - 1 to 5 scales, as well as free-form text and the analysis of that type of qualitative input, along with the team of people that analyze and summarize this input.
    - risks: privacy, data not belived because not quantitative
    - needs: expose bias of the research team/reflexivity
    - opportunities: finding out more about traditionally quieter segments of our users (eg newcomers)
    - avoid: don't combine with quantitative data, acknowledge the difference in paradigms; always treating such surveys as big projects over a long timeframe (small quick projects with few resources are possible)
    - stop: Stop doing surveys in an ad-hoc way
    - resource: Basic infrastructure for surveys on-wiki
  - Dashboards around article quality over time, percent stubs, average quality, percent of good or higher quality, percent of articles that are spam/vandalism (for less active projects)
    - risks: People feel labeled ("Welsh Wikipedia is bad" or something like that), mismeasuring (for example, wiktionary is hard to judge)
    - needs: Compile tha data and create visualizations/dashboards; need reliable quality indicators
    - opportunities: give community new goals to achieve, like "50% good or higher quality"
    - avoid: Making it too complicated, subjective assessments of quality (use ORES); Making decisions based only on quantitative data
    - stop:
    - resource: ORES wp1.0 and draftquality models

Topic: Measuring across demographics, beliefs, etc. (assessing knowledge equity)
- Equity means everyone has a fair chance to contribute -- but how do you measure that. You can look at data like page hits and #s of users. If you don't know anything about your users, you don't know if you're getting enough from a group.
- Geolocation could be inaccurate/misleading.
- Clarification: Look at everybody? Or look at certain segments of users. Survivor bias... E.g. we look at people who come to the site and are really new vs. heavy users of 7 years.
- Most people aren't going want to give up their [socio-economic status].
- Risks
  - Privacy risks
  - Poor sampling (survivor bias) If you don't have a smart phone, you are already excluded/at a disadvantage. Who has cell phones? We can use contextual inquiries to re-weight proportions by how people are or are not excluded from getting online.
  - Social backlash
  - Scary governments & scared contributors
- Needs
  - Measure success overall and to have interventions
  - Need to understand where the gaps are.
  - Would need to guarantee privacy. Some engineering and sampling solutions.
- Opportunies
  - Know where to direct our efforts too
  - Epidemiological threat detection
  - Correlation with other key metrics (theory)
  - We could measure stuff in our own environment -- we'd have to sample.. Maybe 1:10000 people.
- Avoid
  - Long storage of sensitive data. Make sure that we have a context for it. "Your data is going to be used for X." Have an intial project with clear goals.
- Stop
- Resources
  - (maybe) paying people for their cooperation.

Reconvene and discuss, each breakout group presenting the result of their discussion as a response to the above questions [15 mins]
- …

Wrap up [5 mins]
- SKIPPED.

Action items, next steps, and owners:

Deliver the above to C-levels for annual planning process & mysterious "phase 2"