Wikimedia Discovery/Meetings/Search retrospective 2016-03-24

Discovery Search Retrospective

2016-03-24

What has happened?

Covering whatever has happened related to the team since the last retro (2016-02-22)

Lila left; Katherine interim ED
Released completion suggester
Renamed Relevance Forge
- Added automatic engine param optimization to RelForge
Oliver left the team
Quarterly planning
Annual planning
Ran phrase boosting A/B test

Review action items from before

Chris? Follow up to Yuri’s idea on status updates for Discovery (ala Wikidata - https://www.wikidata.org/wiki/Wikidata:Status_updates)
- DONE
Chris: we should blog post and announce this (textcat)
- Updating docs first
Chris: we should blog post and announce this +1 (popularity score)
- Looking into it
Chris: Should we publicize usage numbers via blog post or other? (WDQS)
- Did in weekly update. If should be in other channels, let Chris know
Chris: Let’s ask Rachel how much useful it is for other teams ATM +1 (office hours)
- Has been scheduled!
Chris: This should be mentioned on the public mailing list (Nik’s post)
- Was in weekly update.
Dan: Consider adding team information to the Discovery wiki page, but need a maintenance plan (who works on each project)
- DONE - https://www.mediawiki.org/wiki/Wikimedia_Discovery#The_team
Kevin: Consider other note mechanisms (but maybe after we redo retros?)
- Experimenting with etherpad
Tomasz? Should we add more “document” tasks to onboarding docs?
- Had discussions. No further action required right now

What went well?

Completion suggester roll out!
- Dashboard Graphs moved as a result!
Guillaume did not crash wikipedia (yet) despite having the chance to do it multiple times
Ran an A/B test using metrics other than the zero result rate (Yay!)
Added autocomplete information to satisfaction schema
- Is the increase in the augmented clickthrough correlated with the completion suggester?
  - Unknown at this time, see "Limited analysis throughput" below.
Better ways of evaluating changes to search before deploying them have been developed
Upgraded elasticsearch to 1.7.5 (and it was super straightforward because we have Guillaume to chip away at and own these issues!)
Talking to Chris about TextCat & generating a public-facing page
- https://www.mediawiki.org/wiki/TextCat
Dashboards were down, Bryan D. fixed the Vagrant issue, so yay for folks outside the department helping us :)
- bd808 is pretty great at appearing out of nowhere and fixing things

What could we improve?

Velocity? Many of our planning meetings lately are basically "we've seen these tickets and they are still there"
- but maybe the problem is more related to having other tasks that keep getting added/finished and other tasks remaining for multiple weeks?
  - (Dan:) I think this overall concern is correct, but there's also truth to the above bullet point. In one notable example, I added a bunch of tasks to the sprint which were fixed and deployed between two planning meetings!
- do we have a measurement of velocity somewhere?
  - (Post-meeting note: No, but Phlogiston might help
- This is partly an effect of kanban vs. scrum, where we are allowing tasks to enter the flow at any time
- If tasks are taking longer than expected, that may be a warning that tasks should be split into smaller tasks
- As long as Dan has visibility and agrees with the priorities of what's being done, this doesn't appear to be a problem
Limited analysis throughput, due to losing half the team. Solving this is already in progress though.
- It's possible some other team members could step in and perform less rigorous analysis to help
- We are back-filling this position, so the issue should just be temporary
- Reminder to Mikhail to speak up if he feels overwhelmed or like there are unreasonable deadlines
- Mikhail: We're ok for now
Guillaume needs a better understanding of our procurement process
- RobH seems to be the contact. Guillaume will try to find that documented somewhere
A/B test sizing / length of time. We just kind of wing this, but I've been told there are more rigorous ways to decide sample sizes necessary to measure an expected change
- Relatedly, with a more rigorous sizing we could perhaps run multiple simple tests at same time, like measuring swapping results and measuring the effect of slower results
- Mikhail: We're unsure about the size of the entire population, so picking a rate is difficult
- Can we use number of requests as our population (or a proxy for it)?
  - Depends on the test. Who is actually affected by this feature? Requests vs. sessions can be an issue.
gEdit returning NSFW results [disputed (that this is a "what could we improve")] +1 to the disputed
- This seems to be much bigger issue than we can handle in this team.... (depends on the definition of "the problem")
- (Dan) I basically think this entire issue has kind of always existed, NSFW results and things. We've just brought it to the forefront with the spelling correction.
- (Dan) I'm more concerned about the fact that someone who types "gedit" almost certainly does not want anything to do with "genitals"
- Risk of being accused of censorship if we focus too much on not returning NSFW results
- Dan has a list of queries which have improved; this is one of the few that got worse
- We were able to provide a workaround solution to the downstream customer very quickly, which was good
- Two separate issues:
  - NSFW images being presented unexpectedly (in search results, and elsewhere)
  - "Bad" (for some definition of "bad") typo suggestions
https://meta.wikimedia.org/wiki/Discovery/Testing looks like it needs a refresh with currently planned/running/done tests
- Mikhail: Check on this
We didn't announce the latest A/B test
- Should probably have an automatic task to announce each test
- Timeliness is not critical for most of these tests, so putting in weekly status report

What else should be noted?

I've read much less controversy around discovery on our mailing lists (are we improving in communication, or is just the subject getting old?)
- IMHO, the controversy was a stick people were using to poke Lila / The board into action. Mission accomplished, so it's died down, so to speak?
- Don't get me started here. :)

Retro of retro

This was the first-ever search-team retrospective, rather than a full Discovery department retrospective. It was an experiment. How was it?

Got more in depth in particular topics; more relevant to everyone here
- Agree
- but I miss hearing more about what everyone else is doing
Mikhail (who was in 3 team retros): I really like these focused retros and ability to go more in depth
Compared to previous places: These retros go much less in depth than what I'm used to (e.g. they might go 1 hour deeply into one topic)
On other teams at the WMF, we've picked 2-3 issues by voting, and then 15 minutes per topic (not sure which way is better...this kind of works)
We could video record these and circulate within the team (allow viewing at 2x)

Action Items

Mikhail: Check whether meta Discovery/Testing page is up-to-date
Dan? Should probably have an automatic task to announce each test
Guillaume talk to robh to understand how procurement works
Kevin: Think more about velocity question: Hire more? Change process? Is it OK as is? Start doing guesstimations?
Mikhail: Announce the past test(s)