Growth/Analytics updates/Personalized first day experiment plan

Our Personalized first day intervention will deploy a short survey to users who register on our target wikis. In preparation for that going live, we are considering whether to run any experiments with that survey, and if we do, what type of experiments to run. We started out by discussing this in T206380, but then turned it into a larger plan that doesn't fit Phabricator well. This page describes and explains our current plans. Feel free to follow up on the talk page with any questions or comments you might have, this is a work in progress and we welcome feedback!

Experiment plan

The goal of the intervention is to learn more about those who create accounts on our target wikis, and we plan to use a short survey to do so. That changes the signup process by introducing another step and asking the user to spend some time answering the questions in the survey. This could change user behavior, for instance by distracting the user from what they were doing prior to creating an account, or changing their motivation to continue spending time on Wikipedia.

The Growth Team's overarching goal is to increase editor retention, and the first part of becoming a retained editor is to become an editor. Our primary concern regarding the survey is therefore whether it leads to fewer new users making their first edit. This has led us to propose a two-step experiment plan for this survey:

A one-month A/B test with the survey in target wikis to gather data and measure its impact on editor activation.
If the survey does not significantly affect editor activation rates, deploy it to a larger proportion of new users in order to gather more responses.

In the A/B test, we show the survey to 50% of newly registrations. The other 50% become our control group. Even though we are concerned that the survey might reduce editor activation, we choose a 50/50 split because it allows us to determine if there is an effect as quickly as possible.^{[note 1]} Secondly, we will discuss several other measurements below that might provide us with indications that our survey is not functioning the way we intended.

Once the A/B test concludes, we analyze the data and see whether we get an indication that it significantly impacts editor activation. Given the rate of signups on our target wikis, we expect to be able to detect a difference of 10% of more, and if we find that magnitude of a difference and it's in the negative direction, we'll turn the survey off and look for ideas on how to mitigate the problem.

Survey measurements

As discussed above, the main focus of the A/B test will be to see whether it has an overall effect on editor activation. We would also like to investigate this by dividing the population into various subgroups, and foresee making the following categorizations:

Whether the respondent completed, partially completed, or skipped the survey.
The respondent's answers to the various survey questions.
Whether the respondent choose to follow the "tutorial" or "help desk" links found in the survey.
The context the account was created from (editing, reading the main page, or reading something else).
The device the account was created on (mobile, desktop).

Additional measurements

In addition to studying the effect on editor activation, we will be interested in knowing more about the responses to the survey. The initial list of measurements looks as follows:

Frequency of responses to the survey questions.
Frequency of…
- Completing the survey.
- Partially completing the survey.
- Skipping the survey.
- Clicking away from the survey without skipping.
- Leaving the site.
Qualitative analysis of free text responses.

We would also want to know if there are meaningful differences between these depending on what context the account was created from (editing, reading the main page, or reading something else), and depending on the device the account was created on (mobile, desktop).

Leading indicators and plans of action

The purpose of the A/B test is to determine if the survey has a detrimental effect on editor activation. As discussed above, we should be able to detect a difference at or above 10% after a month. Waiting a whole month to know if something is seriously wrong is too long, so we have sketched out a set of scenarios. For each scenario, we have a plan of action so that it is clear prior to deployment of the survey what we will do if we see certain things happening.

These scenarios have measurements connected with them, and we see these measurements as leading indicators, indicators that suggest a negative impact on editor activation. Because we have not deployed this survey before, it is difficult to know for certain what a meaningful value for these leading indicators will be, so they are our best guesses.


Indicators	Plans of action
Leaving the site upon seeing the survey	If 15% of respondents leave the site, we leave the survey running until we have enough data in our A/B test to see the activation rate of the people who stick around. Then we turn it off and analyze. If 25% of respondents leave the site, we try out a much shorter form (or wait for the improved user experience of Variant C). If the number if very high (>50%?), we turn the survey off.
Skipping the survey	If 70% of respondents skip the survey, we try out a much shorter form (or wait for the improved user experience of Variant C). If 100% of respondents skip the survey, something is broken about the form. We investigate how to fix it.
Partially completing the survey	If 80% of respondents who submit the survey only partially complete it, and if their partial responses are from the beginning of the survey, then we try out a much shorter survey (or wait for the improved user experience of Variant C). If partially completed surveys have earlier questions completed at lower rates than later questions, we revisit the translation of the questions. If translations are okay, we think about reordering or shortening -- perhaps switching out a couple shorter surveys.
Negative responses in the free-text fields	If 5% or more of free-text responses complain about the survey, data privacy, or express confusion around the sign-up process, then we re-evaluate the text and the length of the survey.

Footnotes

↑ If you would like to learn more about why, see this answer on Cross Validated. While that is looking at sample means and we are concerned about proportions, the latter turns into the former for large populations and thus applies here too.

[1] If you would like to learn more about why, see this answer on Cross Validated. While that is looking at sample means and we are concerned about proportions, the latter turns into the former for large populations and thus applies here too.

[note 1]