Growth/Analytics updates/Work log/2018-08-22
The focus for today is to understand new user retention in Czech and Korean Wikipedia, which are the two target Wikipedias for the team's potential interventions. In order to do so, we'll be grabbing data for both wikis and look at trends in account creation, thresholds for retention calculations, trends in retention, and finally do a power analysis.
Account creation trends
editFirst, let's look at historic trends in overall account creation for both wikis, in order to understand what the range of our data is, and what has been going on.
We have data going back to 2006 for both Wikipedias. While Wikipedia has been around since before then and there are plenty of accounts registered prior to 2006, that is the time when logging of account creations first started. We are using that information to distinguish accounts based on how they were created, something we will come back to later.
There are some trends to note in Figures 1 and 2. First of all, we have some days with very high number of accounts registered, and some of these happened recently (within the last year). We will look into how these accounts were created in case they affect our calculations. Secondly, we can see that the number of new accounts has been fairly stable since about 2011 for both wikis. There is some fluctuation, which we will come back to. Lastly, we can see the effect of the SUL finalization project during 2014 and 2015, leading to a significant increase in the number of account creations during that time period.
Due to the increase of registrations during SUL finalization, we'll focus on the period from 2016 onwards. That gives us about 2.5 years of data, where it appears that the overall trend is relative stability in the number of accounts created. Here are what the daily number of account creations for both wikis look like for that period:
In Figures 3 and 4, we see the spikes in account creations, which we need to look into. Secondly, we can start to see some amount of seasonal variation in registrations on Czech Wikipedia, there's a dip during the summer months in both 2017 and 2018.
Next, we split the graph based on how the accounts were created, to see if the spikes affect all types of creations or just a single type:
In Figures 5 and 6, "origin_type" is how the account was created. "autocreated" means the account was created automatically by the system when a user who already has an account on another wiki visits a given wiki. "create" is the normal type of account creation, someone without an account creates it. "create2" means that an existing user created the account for someone else, which for example might happen during events. "other" means that we do not know how the account was created because there is not an entry for that account in the logging table.
There are two key things to note in Figures 5 and 6. First of all, the spikes only occur for autocreated accounts. These are users who already have accounts on other wikis, and we are not primarily interested in them when it comes to newcomer retention for two reasons: first of all, their behaviour might be different than a "typical newcomer" because they already have experience from another wiki, and secondly, they are arguably not newcomers. When we remove them from our analysis, we then also do not have any outlier days in our dataset.
Secondly, we see an increase in "other" towards the end of the dataset. This might be due to how the datasets on the Data Lake are processed, in that we have entries in the user table that do not match the logging table. Because of this we will restrict our retention analysis so that it ends on 2018-07-01.