Wikimedia Discovery/Meetings/Search retrospective 2016-03-24
Discovery Search Retrospective
2016-03-24
What has happened?
editCovering whatever has happened related to the team since the last retro (2016-02-22)
- Lila left; Katherine interim ED
- Released completion suggester
- Renamed Relevance Forge
- Added automatic engine param optimization to RelForge
- Oliver left the team
- Quarterly planning
- Annual planning
- Ran phrase boosting A/B test
Review action items from before
edit- Chris? Follow up to Yuri’s idea on status updates for Discovery (ala Wikidata - https://www.wikidata.org/wiki/Wikidata:Status_updates)
- DONE
- Chris: we should blog post and announce this (textcat)
- Updating docs first
- Chris: we should blog post and announce this +1 (popularity score)
- Looking into it
- Chris: Should we publicize usage numbers via blog post or other? (WDQS)
- Did in weekly update. If should be in other channels, let Chris know
- Chris: Let’s ask Rachel how much useful it is for other teams ATM +1 (office hours)
- Has been scheduled!
- Chris: This should be mentioned on the public mailing list (Nik’s post)
- Was in weekly update.
- Dan: Consider adding team information to the Discovery wiki page, but need a maintenance plan (who works on each project)
- Kevin: Consider other note mechanisms (but maybe after we redo retros?)
- Experimenting with etherpad
- Tomasz? Should we add more “document” tasks to onboarding docs?
- Had discussions. No further action required right now
What went well?
edit- Completion suggester roll out!
- Dashboard Graphs moved as a result!
- Guillaume did not crash wikipedia (yet) despite having the chance to do it multiple times
- Ran an A/B test using metrics other than the zero result rate (Yay!)
- Added autocomplete information to satisfaction schema
- Is the increase in the augmented clickthrough correlated with the completion suggester?
- Unknown at this time, see "Limited analysis throughput" below.
- Is the increase in the augmented clickthrough correlated with the completion suggester?
- Better ways of evaluating changes to search before deploying them have been developed
- Upgraded elasticsearch to 1.7.5 (and it was super straightforward because we have Guillaume to chip away at and own these issues!)
- Talking to Chris about TextCat & generating a public-facing page
- Dashboards were down, Bryan D. fixed the Vagrant issue, so yay for folks outside the department helping us :)
- bd808 is pretty great at appearing out of nowhere and fixing things
What could we improve?
edit- Velocity? Many of our planning meetings lately are basically "we've seen these tickets and they are still there"
- but maybe the problem is more related to having other tasks that keep getting added/finished and other tasks remaining for multiple weeks?
- (Dan:) I think this overall concern is correct, but there's also truth to the above bullet point. In one notable example, I added a bunch of tasks to the sprint which were fixed and deployed between two planning meetings!
- do we have a measurement of velocity somewhere?
- (Post-meeting note: No, but Phlogiston might help
- This is partly an effect of kanban vs. scrum, where we are allowing tasks to enter the flow at any time
- If tasks are taking longer than expected, that may be a warning that tasks should be split into smaller tasks
- As long as Dan has visibility and agrees with the priorities of what's being done, this doesn't appear to be a problem
- but maybe the problem is more related to having other tasks that keep getting added/finished and other tasks remaining for multiple weeks?
- Limited analysis throughput, due to losing half the team. Solving this is already in progress though.
- It's possible some other team members could step in and perform less rigorous analysis to help
- We are back-filling this position, so the issue should just be temporary
- Reminder to Mikhail to speak up if he feels overwhelmed or like there are unreasonable deadlines
- Mikhail: We're ok for now
- Guillaume needs a better understanding of our procurement process
- RobH seems to be the contact. Guillaume will try to find that documented somewhere
- A/B test sizing / length of time. We just kind of wing this, but I've been told there are more rigorous ways to decide sample sizes necessary to measure an expected change
- Relatedly, with a more rigorous sizing we could perhaps run multiple simple tests at same time, like measuring swapping results and measuring the effect of slower results
- Mikhail: We're unsure about the size of the entire population, so picking a rate is difficult
- Can we use number of requests as our population (or a proxy for it)?
- Depends on the test. Who is actually affected by this feature? Requests vs. sessions can be an issue.
- gEdit returning NSFW results [disputed (that this is a "what could we improve")] +1 to the disputed
- This seems to be much bigger issue than we can handle in this team.... (depends on the definition of "the problem")
- (Dan) I basically think this entire issue has kind of always existed, NSFW results and things. We've just brought it to the forefront with the spelling correction.
- (Dan) I'm more concerned about the fact that someone who types "gedit" almost certainly does not want anything to do with "genitals"
- Risk of being accused of censorship if we focus too much on not returning NSFW results
- Dan has a list of queries which have improved; this is one of the few that got worse
- We were able to provide a workaround solution to the downstream customer very quickly, which was good
- Two separate issues:
- NSFW images being presented unexpectedly (in search results, and elsewhere)
- "Bad" (for some definition of "bad") typo suggestions
- https://meta.wikimedia.org/wiki/Discovery/Testing looks like it needs a refresh with currently planned/running/done tests
- Mikhail: Check on this
- We didn't announce the latest A/B test
- Should probably have an automatic task to announce each test
- Timeliness is not critical for most of these tests, so putting in weekly status report
What else should be noted?
edit- I've read much less controversy around discovery on our mailing lists (are we improving in communication, or is just the subject getting old?)
- IMHO, the controversy was a stick people were using to poke Lila / The board into action. Mission accomplished, so it's died down, so to speak?
- Don't get me started here. :)
Retro of retro
editThis was the first-ever search-team retrospective, rather than a full Discovery department retrospective. It was an experiment. How was it?
- Got more in depth in particular topics; more relevant to everyone here
- Agree
- but I miss hearing more about what everyone else is doing
- Mikhail (who was in 3 team retros): I really like these focused retros and ability to go more in depth
- Compared to previous places: These retros go much less in depth than what I'm used to (e.g. they might go 1 hour deeply into one topic)
- On other teams at the WMF, we've picked 2-3 issues by voting, and then 15 minutes per topic (not sure which way is better...this kind of works)
- We could video record these and circulate within the team (allow viewing at 2x)
Action Items
edit- Mikhail: Check whether meta Discovery/Testing page is up-to-date
- Dan? Should probably have an automatic task to announce each test
- Guillaume talk to robh to understand how procurement works
- Kevin: Think more about velocity question: Hire more? Change process? Is it OK as is? Start doing guesstimations?
- Mikhail: Announce the past test(s)