Discovery/Retrospective 2015-11-30
Review action items from previous retrospective:
edit- Erik: Brainstorm on language-related goal
- DONE. Chose to move forward with Accept-Language headers (Erik) and training a language detector (Trey)
- Kevin to take showcase feedback to Adam
- DONE
- Oliver to continue email thread about user satisfaction suvey
- DONE
- Kevin to email about wiki page "categories"
- Stas added categories; We thought there was an email conversation, but I [Kevin] can't find it right now.
What has happened since the last retro? (2015-11-02)
edit- Portal shift to gerrit; event logging
- Progress on relevance lab
- Ongoing hiring processes
- Ran multiple cirrus A/B tests
- Worked out issues with avro schemas and analytics pipeline
- Found a nasty bug in Blazegraph causing data corruption and developed a workaround (so it should stop now)
- Improved WDQS GUI significantly (with WMDE team help)
- Have monitoring dashboard for WDQS now: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service
- Maps are now available for ruwiki's Geohack (GPS links) and Wikivoyage (en & ru)
- Dashboard for portal http://discovery.wmflabs.org/portal/
What went well?
edit- Cirrus A/B testing goes from strength to strength (and we now have analysis redundancy!)
- We are 99% of the way there to achieving our primary search goal (https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q2_Goals#Search) ++
- David's work with analytics to get the avro pipeline running has been much appreciated
- Picking up the completion suggester work again; it was incredibly promising when we last ran tests on it!
- Product manager hiring seems to be going extremely well! We have had a lot of really good candidates and interviews so far.
- Progress on relevance lab
- Maps are live in ru-geohack - thanks to an in-person meetings at a conference, and wikivoyage
- Graphs are getting closer to being interactive
- ruwiki reported significantly better satisfaction with tech side of WMF - possibly due to substantial participation in the community by Max and Yuri
What could have gone better?
edit- We didn't get the Survey out (and won't be able to do so usably until next quarter. I don't trust data from late December, simple as) ++
- We didn't get the Portal A/B test out (and won't be able to do so usably until next quarter) +++
- NOTE: Follow-up conversations raised the possibility that we might still be able to test this month
- Unfortunately, the test we ran for our Q2 goal did not show significant user impact . We are still not showing significant user impact as a result of our search work. ++++++++++
- The common terms query A/B test ended up in limbo
- Canceled to focus on quarterly goal tests, was initially reverted by performance issues and once it was worked out we needed to move on.
- We should try to pick it up again in January if we can; it was promising++++
- Language detection is hard to do well on short strings. Data gathering for retraining a model is hard. Progress is slow.+
- Hard to show impact on inter-language search, the number of queries is just too small (per initial analysis by Trey, and backed up by our prod tests).
- All the features for inter-language search have been implemented but we should review Trey analysis and fine tune
- ops hiring has moved forward, but no signature on the dotted line yet
- Realized we can't analyze the did you mean test results, the test wasn't collecting data properly due to changed css classes+
Discussion
editTOPIC: "Unfortunately, the test we ran for our Q2 goal did not show significant user impact . We are still not showing significant user impact as a result of our search work."
- Should we adjust analysis to capture effects within a small subset of all of the searches?
- Probably, but this doesn't explain why we haven't had more impact
- Inability to measure quality of results has hurt us
- Long tail: each change won't have a big impact
- Measure the impact of a change against the population of possibly affected searches
- Measuring a change that affects a very small number of searches is hard and expensive
- Would it make sense to identify "obvious bots"?
- We didn't see substantial improvement even when we did exclude bots
- Could shift focus away from ZRR (Zero Results Rate) and toward relevance
- Should we split up into microteams, to make progress on more small changes at once?
- Creates team and process problems
- Biggest problem is not search results that fail--it is when search results are presented, but the user doesn't click
- UX issues. Maybe split front-end engineers between portal and search results
- We would like to keep people searching within our system, rather than bouncing out to other search engines from our content
- Are we running A/B tests too soon? Should we do more internal analysis first?
- For language, we knew the effects would probably be small, but it was our Q goal so we moved ahead
- For portal page, tests are known to be small (common sense), but getting them out this quarter should be good
- Not a lot of internal discussion was needed, but next quarter would probably make sense
- Should we have validation process to make sure we are collecting the data we wanted?
- We do actually have that. The CSS issue was older code (before our validation process).
Action Items
edit- Dan: write a goal for improving the UX of the search page on-wiki
- Dan: Discussion of improving the relevance/sorting of results rather than just zero results rate
- Moiz: Talk about whether we really can run A/B tests on the portal, since it's not subject to a deployment freeze
- Dan: Follow up on the common terms query A/B test
- Mikhail: Look into listing features that affected the results set for a query (sister project to 'query categorizer UDF')