Wikimedia Discovery/Meetings/Search retrospective 2016-03-24

Discovery Search Retrospective

2016-03-24

What has happened?

edit

Covering whatever has happened related to the team since the last retro (2016-02-22)

  • Lila left; Katherine interim ED
  • Released completion suggester
  • Renamed Relevance Forge
    • Added automatic engine param optimization to RelForge
  • Oliver left the team
  • Quarterly planning
  • Annual planning
  • Ran phrase boosting A/B test

Review action items from before

edit
  • Chris? Follow up to Yuri’s idea on status updates for Discovery (ala Wikidata - https://www.wikidata.org/wiki/Wikidata:Status_updates)
    • DONE
  • Chris: we should blog post and announce this (textcat)
    • Updating docs first
  • Chris: we should blog post and announce this +1 (popularity score)
    • Looking into it
  • Chris: Should we publicize usage numbers via blog post or other? (WDQS)
    • Did in weekly update. If should be in other channels, let Chris know
  • Chris: Let’s ask Rachel how much useful it is for other teams ATM +1 (office hours)
    • Has been scheduled!
  • Chris: This should be mentioned on the public mailing list (Nik’s post)
    • Was in weekly update.
  • Dan: Consider adding team information to the Discovery wiki page, but need a maintenance plan (who works on each project)
  • Kevin: Consider other note mechanisms (but maybe after we redo retros?)
    • Experimenting with etherpad
  • Tomasz? Should we add more “document” tasks to onboarding docs?
    • Had discussions. No further action required right now

What went well?

edit
  • Completion suggester roll out!
    • Dashboard Graphs moved as a result!
  • Guillaume did not crash wikipedia (yet) despite having the chance to do it multiple times
  • Ran an A/B test using metrics other than the zero result rate (Yay!)
  • Added autocomplete information to satisfaction schema
    • Is the increase in the augmented clickthrough correlated with the completion suggester?
      • Unknown at this time, see "Limited analysis throughput" below.
  • Better ways of evaluating changes to search before deploying them have been developed
  • Upgraded elasticsearch to 1.7.5 (and it was super straightforward because we have Guillaume to chip away at and own these issues!)
  • Talking to Chris about TextCat & generating a public-facing page
  • Dashboards were down, Bryan D. fixed the Vagrant issue, so yay for folks outside the department helping us :)
    • bd808 is pretty great at appearing out of nowhere and fixing things

What could we improve?

edit
  • Velocity? Many of our planning meetings lately are basically "we've seen these tickets and they are still there"
    • but maybe the problem is more related to having other tasks that keep getting added/finished and other tasks remaining for multiple weeks?
      • (Dan:) I think this overall concern is correct, but there's also truth to the above bullet point. In one notable example, I added a bunch of tasks to the sprint which were fixed and deployed between two planning meetings!
    • do we have a measurement of velocity somewhere?
    • This is partly an effect of kanban vs. scrum, where we are allowing tasks to enter the flow at any time
    • If tasks are taking longer than expected, that may be a warning that tasks should be split into smaller tasks
    • As long as Dan has visibility and agrees with the priorities of what's being done, this doesn't appear to be a problem
  • Limited analysis throughput, due to losing half the team. Solving this is already in progress though.
    • It's possible some other team members could step in and perform less rigorous analysis to help
    • We are back-filling this position, so the issue should just be temporary
    • Reminder to Mikhail to speak up if he feels overwhelmed or like there are unreasonable deadlines
    • Mikhail: We're ok for now
  • Guillaume needs a better understanding of our procurement process
    • RobH seems to be the contact. Guillaume will try to find that documented somewhere
  • A/B test sizing / length of time. We just kind of wing this, but I've been told there are more rigorous ways to decide sample sizes necessary to measure an expected change
    • Relatedly, with a more rigorous sizing we could perhaps run multiple simple tests at same time, like measuring swapping results and measuring the effect of slower results
    • Mikhail: We're unsure about the size of the entire population, so picking a rate is difficult
    • Can we use number of requests as our population (or a proxy for it)?
      • Depends on the test. Who is actually affected by this feature? Requests vs. sessions can be an issue.
  • gEdit returning NSFW results [disputed (that this is a "what could we improve")] +1 to the disputed
    • This seems to be much bigger issue than we can handle in this team.... (depends on the definition of "the problem")
    • (Dan) I basically think this entire issue has kind of always existed, NSFW results and things. We've just brought it to the forefront with the spelling correction.
    • (Dan) I'm more concerned about the fact that someone who types "gedit" almost certainly does not want anything to do with "genitals"
    • Risk of being accused of censorship if we focus too much on not returning NSFW results
    • Dan has a list of queries which have improved; this is one of the few that got worse
    • We were able to provide a workaround solution to the downstream customer very quickly, which was good
    • Two separate issues:
      • NSFW images being presented unexpectedly (in search results, and elsewhere)
      • "Bad" (for some definition of "bad") typo suggestions
  • https://meta.wikimedia.org/wiki/Discovery/Testing looks like it needs a refresh with currently planned/running/done tests
    • Mikhail: Check on this
  • We didn't announce the latest A/B test
    • Should probably have an automatic task to announce each test
    • Timeliness is not critical for most of these tests, so putting in weekly status report

What else should be noted?

edit
  • I've read much less controversy around discovery on our mailing lists (are we improving in communication, or is just the subject getting old?)
    • IMHO, the controversy was a stick people were using to poke Lila / The board into action. Mission accomplished, so it's died down, so to speak?
    • Don't get me started here. :)

Retro of retro

edit

This was the first-ever search-team retrospective, rather than a full Discovery department retrospective. It was an experiment. How was it?

  • Got more in depth in particular topics; more relevant to everyone here
    • Agree
    • but I miss hearing more about what everyone else is doing
  • Mikhail (who was in 3 team retros): I really like these focused retros and ability to go more in depth
  • Compared to previous places: These retros go much less in depth than what I'm used to (e.g. they might go 1 hour deeply into one topic)
  • On other teams at the WMF, we've picked 2-3 issues by voting, and then 15 minutes per topic (not sure which way is better...this kind of works)
  • We could video record these and circulate within the team (allow viewing at 2x)

Action Items

edit
  • Mikhail: Check whether meta Discovery/Testing page is up-to-date
  • Dan? Should probably have an automatic task to announce each test
  • Guillaume talk to robh to understand how procurement works
  • Kevin: Think more about velocity question: Hire more? Change process? Is it OK as is? Start doing guesstimations?
  • Mikhail: Announce the past test(s)