Growth/Analytics updates/Work log/2018-10-05
Account creations from mobile
editIn T205760, I was asked to figure out what percentage of accounts were created from mobile devices vs. desktop. This should exclude auto-created accounts, and it would be preferable if it could be split further by device (e.g. iOS and Android).
I found the ServerSideAccountCreation schema, which should enable me to answer these questions. From what I can tell by exploring the data, if the isApi
flag is set the account was most likely created using a mobile app. Access to account creation through the API is not restricted to the mobile apps, so we might encounter non-app creations. While this binary heuristic is crude, it did appear to flag the vast majority of cases I looked at, and this project did not allow for developing a more sophisticated heuristic.
If the isApi
flag is not set, we can distinguish between the mobile and desktop sites using the displayMobile
flag. EventLogging also has the webhost
field, but that turns out to never be set to the mobile site in this case (perhaps why the displayMobile
flag is there?) These two flags combined allows us to split account creations into three categories: apps, mobile, and desktop.
Exploring the data further I found that I can use the os_family
property of the event to identify Android and iOS devices. I chose to not investigate further splits (e.g. identifying Windows phones), and instead put everything that is neither Android nor iOS in "other".
I was interested in getting data from 2016-01-01 onwards and plot a monthly time series, but this was hindered by the Data Lake only having consistent data starting in early February 2018. More data is available, but the tight deadline did not allow for merging of these data sources. Instead, I chose to use 6 months of data, covering February through July 2018. The data for February is not complete, but I am assuming that this affects every category equally since it's a time-based exclusion. Since the task is to get proportions and not specific counts, this should not substantially affect the results.
I ended up writing the HQL query below to grab data for this from the Data Lake. The name of the resulting table and the wiki it grabs data from is changed for each of the six wikis we are grabbing data from.
DROP TABLE IF EXISTS nettrom_growth.cs_creations_by_site;
CREATE TABLE nettrom_growth.cs_creations_by_site
AS SELECT SUBSTRING(u.user_registration_timestamp, 1, 7) AS reg_month,
SUM(IF(e.is_api = FALSE
AND e.is_mobile = FALSE
AND (e.os_family NOT IN ('Android', 'iOS')), 1, 0)) AS n_desktop_other,
SUM(IF(e.is_api = FALSE
AND e.is_mobile = FALSE
AND e.os_family = 'Android', 1, 0)) AS n_desktop_android,
SUM(IF(e.is_api = FALSE
AND e.is_mobile = FALSE
AND e.os_family = 'iOS', 1, 0)) AS n_desktop_ios,
SUM(IF(e.is_api = FALSE
AND e.is_mobile = TRUE
AND (e.os_family NOT IN ('Android', 'iOS')), 1, 0)) AS n_mobile_other,
SUM(IF(e.is_api = FALSE
AND e.is_mobile = TRUE
AND e.os_family = 'Android', 1, 0)) AS n_mobile_android,
SUM(IF(e.is_api = FALSE
AND e.is_mobile = TRUE
AND e.os_family = 'iOS', 1, 0)) AS n_mobile_ios,
SUM(IF(e.is_api = TRUE
AND (e.os_family NOT IN ('Android', 'iOS')), 1, 0)) AS n_app_other,
SUM(IF(e.is_api = TRUE
AND e.os_family = 'Android', 1, 0)) AS n_app_ios,
SUM(IF(e.is_api = TRUE
AND e.os_family = 'iOS', 1, 0)) AS n_app_android
FROM wmf.mediawiki_user_history u
JOIN (SELECT event.userId AS user_id,
event.displayMobile AS is_mobile,
event.isApi AS is_api,
useragent.os_family AS os_family
FROM event.serversideaccountcreation
WHERE year >= 2018
AND wiki = 'cswiki') AS e
ON u.user_id = e.user_id
WHERE u.wiki_db = 'cswiki'
AND u.snapshot = '2018-08'
AND u.caused_by_event_type = 'create'
AND TO_DATE(u.user_registration_timestamp) >= '2018-02-01'
AND TO_DATE(u.user_registration_timestamp) < '2018-08-01'
AND u.created_by_self = TRUE
AND u.created_by_system = FALSE
AND u.created_by_peer = FALSE
AND u.is_bot_by_name = FALSE
GROUP BY SUBSTRING(u.user_registration_timestamp, 1, 7);
Proportion of desktop, mobile, and apps
editLet's first look at the top-level split, between the different sites/methods: apps, desktop, and mobile. First, I use the whole dataset and calculate the overall proportion of registrations for each. This results in six bar charts, seen below.
-
Proportion of Czech Wikipedia registrations for apps, desktop, and mobile.
-
Proportion of Korean Wikipedia registrations for apps, desktop, and mobile.
-
Proportion of English Wikipedia registrations for apps, desktop, and mobile.
-
Proportion of German Wikipedia registrations for apps, desktop, and mobile.
-
Proportion of Arabic Wikipedia registrations for apps, desktop, and mobile.
-
Proportion of Ukrainian Wikipedia registrations for apps, desktop, and mobile.
What we can see is that for five out of the six wikis, the majority of registrations come through the desktop site. The Arabic Wikipedia is the exception, where 57.5% of the registrations are on the mobile site, and about one third of the registrations are on desktop. The proportion of mobile registrations on the five others appears to lie in the 20–30% range.
Next, I want to know if these proportions have changed over time. We therefore calculate the same proportions on a monthly basis for the six months in our datasets. The time series graphs are shown below.
-
Proportion of Czech Wikipedia registrations for apps, desktop, and mobile, plotted monthly from Feb through July.
-
Proportion of Korean Wikipedia registrations for apps, desktop, and mobile, plotted monthly from Feb through July.
-
Proportion of English Wikipedia registrations for apps, desktop, and mobile, plotted monthly from Feb through July.
-
Proportion of German Wikipedia registrations for apps, desktop, and mobile, plotted monthly from Feb through July.
-
Proportion of Arabic Wikipedia registrations for apps, desktop, and mobile, plotted monthly from Feb through July.
-
Proportion of Ukrainian Wikipedia registrations for apps, desktop, and mobile, plotted monthly from Feb through July.
The general trend in all these plots is that the proportion of registrations through the mobile site is stable, and there's a reduction in desktop registrations in favour of registrations through the mobile app. This means that our previous plots of the overall proportions is slightly overestimating the desktop registrations, while underestimating the app registrations, compared to the more recent months.
Proportion of Android and Apple
editOur datasets also allow us to split each site by whether the registration happened on an Android, iOS, or other type of device. Because the overarching question is registrations on apps/mobile/desktop, I decided to focus this on splitting each of those sites, rather than answer the question "what's the overall proportion of Android/Apple/other?"
Splitting up the dataset and calculating proportions for each site results in the six bar charts seen below.
-
Proportion of Czech Wikipedia registrations for apps, desktop, and mobile, each split into Android/iOS/other.
-
Proportion of Korean Wikipedia registrations for apps, desktop, and mobile, each split into Android/iOS/other.
-
Proportion of English Wikipedia registrations for apps, desktop, and mobile, each split into Android/iOS/other.
-
Proportion of German Wikipedia registrations for apps, desktop, and mobile, each split into Android/iOS/other.
-
Proportion of Arabic Wikipedia registrations for apps, desktop, and mobile, each split into Android/iOS/other.
-
Proportion of Ukrainian Wikipedia registrations for apps, desktop, and mobile, each split into Android/iOS/other.
In these graphs, one overall trend is that iOS dominates the apps registrations. German Wikipedia is an exception, where there is a large number of registrations falling into the "other" category. We use the isApi
flag to categorize account creations into the "apps" category, and while that might be correct in the vast majority cases, account creation through the API is not restricted to the apps. In this case it did not seem worthwhile to spend time on developing a more fine-tuned set of heuristics in order to filter out non-app account creations, which means that we end up with cases like this.
Registrations on the desktop site is typically "other", with Arabic and Korean showing a fair proportion of registrations from Android and iOS devices. Lastly, the trend of Android vs iOS is switched for the mobile site, typically 80/20 in Android's favor. The German Wikipedia is the exception here where it is more even.