Discovery/Retrospective 2015-11-30

Review action items from previous retrospective:

edit
  • Erik: Brainstorm on language-related goal
    • DONE. Chose to move forward with Accept-Language headers (Erik) and training a language detector (Trey)
  • Kevin to take showcase feedback to Adam
    • DONE
  • Oliver to continue email thread about user satisfaction suvey
    • DONE
  • Kevin to email about wiki page "categories"
    • Stas added categories; We thought there was an email conversation, but I [Kevin] can't find it right now.

What has happened since the last retro? (2015-11-02)

edit
  • Portal shift to gerrit; event logging
  • Progress on relevance lab
  • Ongoing hiring processes
  • Ran multiple cirrus A/B tests
  • Worked out issues with avro schemas and analytics pipeline
  • Found a nasty bug in Blazegraph causing data corruption and developed a workaround (so it should stop now)
  • Improved WDQS GUI significantly (with WMDE team help)
  • Have monitoring dashboard for WDQS now: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service
  • Maps are now available for ruwiki's Geohack (GPS links) and Wikivoyage (en & ru)
  • Dashboard for portal http://discovery.wmflabs.org/portal/

What went well?

edit
  • Cirrus A/B testing goes from strength to strength (and we now have analysis redundancy!)
  • We are 99% of the way there to achieving our primary search goal (https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q2_Goals#Search) ++
  • David's work with analytics to get the avro pipeline running has been much appreciated
  • Picking up the completion suggester work again; it was incredibly promising when we last ran tests on it!
  • Product manager hiring seems to be going extremely well! We have had a lot of really good candidates and interviews so far.
  • Progress on relevance lab
  • Maps are live in ru-geohack - thanks to an in-person meetings at a conference, and wikivoyage
  • Graphs are getting closer to being interactive
  • ruwiki reported significantly better satisfaction with tech side of WMF - possibly due to substantial participation in the community by Max and Yuri

What could have gone better?

edit
  • We didn't get the Survey out (and won't be able to do so usably until next quarter. I don't trust data from late December, simple as) ++
  • We didn't get the Portal A/B test out (and won't be able to do so usably until next quarter) +++
    • NOTE: Follow-up conversations raised the possibility that we might still be able to test this month
  • Unfortunately, the test we ran for our Q2 goal did not show significant user impact . We are still not showing significant user impact as a result of our search work. ++++++++++
  • The common terms query A/B test ended up in limbo
    • Canceled to focus on quarterly goal tests, was initially reverted by performance issues and once it was worked out we needed to move on.
    • We should try to pick it up again in January if we can; it was promising++++
  • Language detection is hard to do well on short strings. Data gathering for retraining a model is hard. Progress is slow.+
  • Hard to show impact on inter-language search, the number of queries is just too small (per initial analysis by Trey, and backed up by our prod tests).
  • All the features for inter-language search have been implemented but we should review Trey analysis and fine tune
  • ops hiring has moved forward, but no signature on the dotted line yet
  • Realized we can't analyze the did you mean test results, the test wasn't collecting data properly due to changed css classes+


Discussion

edit

TOPIC: "Unfortunately, the test we ran for our Q2 goal did not show significant user impact . We are still not showing significant user impact as a result of our search work."

  • Should we adjust analysis to capture effects within a small subset of all of the searches?
    • Probably, but this doesn't explain why we haven't had more impact
    • Inability to measure quality of results has hurt us
    • Long tail: each change won't have a big impact
    • Measure the impact of a change against the population of possibly affected searches
      • Measuring a change that affects a very small number of searches is hard and expensive
  • Would it make sense to identify "obvious bots"?
    • We didn't see substantial improvement even when we did exclude bots
  • Could shift focus away from ZRR (Zero Results Rate) and toward relevance
  • Should we split up into microteams, to make progress on more small changes at once?
    • Creates team and process problems
  • Biggest problem is not search results that fail--it is when search results are presented, but the user doesn't click
    • UX issues. Maybe split front-end engineers between portal and search results
  • We would like to keep people searching within our system, rather than bouncing out to other search engines from our content
  • Are we running A/B tests too soon? Should we do more internal analysis first?
    • For language, we knew the effects would probably be small, but it was our Q goal so we moved ahead
  • For portal page, tests are known to be small (common sense), but getting them out this quarter should be good
    • Not a lot of internal discussion was needed, but next quarter would probably make sense
  • Should we have validation process to make sure we are collecting the data we wanted?
    • We do actually have that. The CSS issue was older code (before our validation process).



Action Items

edit
  • Dan: write a goal for improving the UX of the search page on-wiki
  • Dan: Discussion of improving the relevance/sorting of results rather than just zero results rate
  • Moiz: Talk about whether we really can run A/B tests on the portal, since it's not subject to a deployment freeze
  • Dan: Follow up on the common terms query A/B test
  • Mikhail: Look into listing features that affected the results set for a query (sister project to 'query categorizer UDF')