Wikimedia Product/Analytics Infrastructure/Experiment Platform

Wishlist

  • Access to Ongoing experiments’ metadata across Foundation to avoid conflicts
    • Access to previous experiments and their outcomes (including decisions made based on results) for setting baselines and expectations of possible changes
  • A global queue of experiments to avoid clashing, users receiving multiple interventions/treatments which affects the results
  • Experiment targeting (specific users satisfying conditions/criteria)
    • E.g. users who have registered with a verified email address
  • Power analysis tools for determining sample sizes
    • Automatic determination of experiment duration based on data of new user registrations, how many existing registered editors there are, how many anonymous editors, how many readers or active users (in mobile apps sense)
  • Predictive models for determining who is the most likely to benefit from intervention to inform targeting (e.g. if a new editor is on a “likely to drop out (stop editing) trajectory” and then putting them into the treatment group)
  • Defining QA metrics and tools for implementing it (e.g. no sudden biases in sampling like IE users not represented)
  • Server-side sampling based on editing activity (for example) and then recording if someone is in an experiment
  • Different randomization strategies (e.g. not equal weights given to groups, treatment group bigger than control group)
  • Bandit optimization
  • Bayesian optimization
  • Dashboards for monitoring results of test
  • Automated report generation, and then it’s up to the data/product analyst to interpret the results in it for the product manager
  • Library of success metrics, their definitions, and implementations
    • Teams have KPIs and so when they deploy experiments they can specify which KPI is going to be impacted
  • POTENTIALLY we may want to use third-party tools like Optimizely
  • Require experiment design document/fill out form that includes questions on measurement, targeting/sampling
  • Cohorts of users (by week, by wiki)
  • Built-in multilevel modeling for cross-wiki, cross-cohort experiments
  • Sequential testing (cf. New Stats Engine whitepaper from Optimizely)
    • Why you won’t need to set a sample size in advance
    • How Stats Engine enables you to confidently check experiments as often as you want
    • How you can test as many variations and goals without worrying about hidden sources of error
  • Exporting a dataset of group assignments
    • Also IMPORTING a dataset of group assignments
  • Feed output of one experiment into another experiment???

Next up

Interview with analysts to see what more could be included in this list

Notes

This effectively describes the Sampling Controller. When a client queries the configuration API for sampling rate, it should send some minimally identifying information about itself (and the user) to receive a determination of enrollment in an online experiment (A/B test). On the backend, Product/Program Managers and Analysts define a target population to sample and target platforms. Additionally, the experiment design may require that an editor receives the same treatment on multiple platforms (any device they're logged in on).

The following are possible use cases:

  • Cross-platform intervention on a cohort of newly registered editors who have made fewer than 3 edits in the first month since creating their account
  • "Advertising" existence of Hindi & Marathi Wikipedias to Hindi & Marathi speakers (based on system/browser language settings) in India who are browsing English Wikipedia
  • Assessing how a change to how Chinese characters are displayed in a mobile app affects reading length/depth

To that end, we should consider having clients include the following information when requesting configuration settings, in addition to the "platform" & "client ID" that the client would send anyway:

  • USER: If logged in: {wiki, user ID} or username
  • REGION: Geographic region (at state level?)
  • LANGS: Ordered languages (top N?)
  • STREAMS (maybe?): List of streams the client can send events to (that Stream Manager has registered at initialization)
  • CONFIG (maybe?): Configuration hash for versioning

The backend uses this data to check whether the client is/should be enrolled in any experiments. Potential ideas for responses:

  • A single sampling rate (expressed as a probability, e.g. {1.0} or {0.0}) which is used for every stream in STREAMS
  • A list of per-stream sampling rates (expressed as probabilities); e.g. { "STREAMS": [{ "stream1": 0.1 }, { "stream2":0.5, "stream3": 1.0 }, { "stream4": 0.0 }] }
  • Optionally, the sampling controller responds with the group assignment. If the user is already enrolled in, say, a cross-platform experiment, then the tag is recalled from the database. If the user is not yet in an experiment but is on a platform with an active experiment, the sampling controller performs an enrollment random roll and if it enrolls the client, it follows-up with a group assignment random roll. It then saves the {client_id, experiment_tag} pair in the database. If enrolled, the backend responds with the following experiment information:
{
  "tag": "growth/some_experiment/some_group",
  "start_dt": "2019-05-01T00:00:01Z",
  "end_dt": "2019-05-14T23:59:59Z",
  "streams": ["stream1", "stream4"]
}

This is remembered by the client (to the extent possible), especially the experiment's expiration date. Affected streams are tagged while the experiment is active, and the UI/UX is configured appropriately during the duration of the experiment. Once the end_dt is reached, the relevant UI/UX should return to the default configuration and any events sent to the affected streams cease to be tagged as belonging to an experimental group.

References

Bradley, A. (2019). Building our Centralized Experimental Platform. MultiThreaded. StitchFix.