Wikimedia Technology/Annual Plans/FY2019/TEC14: Smart Tools for Better Data/Goals

Program Goals and Status for FY18/19 edit

  • Goal Owner: Nuria Ruiz
  • Program Goals for FY18/19: We will maintain and increase public access to past, present and real time data for Wikimedia projects. We will provide the infrastructure to measure the impact and reach of projects and features for editors, communities and WMF.
  • Annual Plan: TEC14: Smart Tools for Better Data
    • Primary Goal is Knowledge as a Service: Evolve our systems and structures
    • Tech Goal: Supporting our Community of contributors

edit

Outcome / Output edit

Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.

Provision a cluster for public Data Lake access in Cloud Service

Goals edit

  • Order Data Lake hardware task T198424   In progress
  • Provide Rationale for SQL engine used to make data accessible in labs task T204537   Partially done


Outcome 3 / Output 1 edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goals edit

  • Build most prolific contributors report task T189882   Done
  • Include metrics about total article count (pages to date) in Wikistats 2 task T198425   Done


Outcome 3 / Output 2 edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

Wikistats 2.0 - Beta (carry on items from last quarter)

Goals edit


Outcome 3 / Output 3 edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

Support for more data sources and programming languages for WMF Jupyter Notebook users.

Goals edit

  • Better integration of Jupyter with spark task T190443   Done


Outcome 4 / Output 1 edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

Users see improvements on data computing and data quality.

Goals edit

  • Data Sanitization backend for hadoop that includes ability to salt & hash. task T198426   Done
  • STRETCH GOAL: POC More efficient Bot filtering on pageview data.task T211359   In progress


Outcome 4 / Output 2 edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

MediaWiki content is available on cluster on recurrent schedule

Goals edit

  • STRETCH GOAL: Productionize MediaWiki content processing. Ingest and process text on every wikipedia page to use later for analytics-style computations task T186559   In progress


Outcome 5 / Output 1 edit

We have scalable, performant and reliable software for data transport

Software maintenance on analytics stack to maintain current level of service

Goals edit

  • Spin out a tiny EventLogging RL module for lightweight logging task T187207

Status edit

  Note: September 18, 2018

  Partially done Work continues with performance team, work was completed by end of Q4

edit

Outcome / Output edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goals edit

  • Create report for "articles with most contributors" in Wikistats2 task T204965  N Not done
  • Create report for Active editor metrics per project family task T188265  N Not done
  • Provide easier mapping between Wikistats1 metrics and Wikistats2 metrics (example: "active editors") task T187806  N Not done
  • Provide ability to query metrics per project family (*.wikipedia.org) in Wikistats UI task T205665   Done

Status edit

  Note: December 2018

Changes to display projects family data (new registrations for all wikipedias) deployed.


Outcome / Output edit

Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.

In this iteration (spanning several quarters) the Data Lake will include historical data about editing (revisions, pages, users) for all Wikimedia projects since the beginning. Data is optimized to be queried in an analytics-friendly way that allows for simple and fast queries.

Goals edit

  • Presto cluster online and usable with test data pushed from analytics prod infrastructure accessible by Cloud (labs) users task T204951   Done
  • Edit Data Lake Quality: Resolve known issues (ongoing goal) task T204953  N Not done

Status edit

  Note: November 14, 2018

Presto setup on labs started and is   In progress, we are discussing with SRE the flow of data

  To do December 2018

Missed rest of goals due to issues with scooping mediawiki data from labs, those issues are being worked on in task T210749 and task T210693


Outcome / Output edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

Users see improvements on data computing and data quality.

Goals edit

STRETCH GOAL: POC More efficient Bot filtering on pageview data.task T211359   Done

Status edit

  Note: December 2018

Finished initial phase of POC, running additional tests and doing write up


Outcome 4 / Output 2 edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

MediaWiki content is available on cluster on recurrent schedule

Goals edit

STRETCH GOAL: Productionize MediaWiki content processing. Ingest and process XML dumps to use later for analytics-style computations task T186559   Done

edit

Outcome / Output edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goals edit

  • Create report for "Articles With Most Contributors" in Wikistats2 task T204965
  • Create report for "Active Editors" metrics per project family in Wikistast2 task T188265
  • Provide easier mapping between Wikistats1 metrics and Wikistats2 metrics (example: "active editors") task T187806   Done
  • Import Edit Data Lake dataset into turnilo (WMF data exploratory tool) task T211173
  • Create staging environment to test upgrades of superset: task T212243   Done

Status edit

  Note: February 14, 2019

  • The comment and actor refactor on MediaWiki and performance problems on labs db replicas have delayed much of this work, those issues are being tracked here: T210749 and T210693
  • We plan to deploy superset's latest release to a staging environment by end of quarter

  Note: March 14, 2019

We have a staging environment in which we are testing the upgrade to superset, both "Articles With Most Contributors" and "Active Editors" metrics will worked on next quarter. The import of edit data into druid is on its way for a first import to happen this quarter.

Outcome / Output edit

Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.

In this iteration (spanning several quarters) the Data Lake will include historical data about editing (revisions, pages, users) for all Wikimedia projects since the beginning. Data is optimized to be queried in an analytics-friendly way that allows for simple and fast queries.

Goals edit

  • Edit Data Lake Quality: Resolve known issues (ongoing goal) task T204953   Partially done
  • Sunset wikimetrics. It is being replaced by the event-metrics tool: task T211835   Done

Status edit

  Note: February 2019

Efforts on improving data quality on data lake data are on track to be completed this quarter.

The effort of sunseting wikimetrics is schedule to start by the beginning of March

  Note: March 2019

Wikimetrics is now deprecated and offline. Users are using event-metrics now. The main work about quality issues in Data Lake data is set to be done this quarter, some of it will spill to next quarter.

edit

Outcome / Output edit

Foundation staff and community have better visual tools to access data about content, contributors and readers.

Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goals edit

  • Edit Data Lake Quality: Resolve known issues (ongoing goal) task T204953   In progress
  • Create report for "articles with most contributors" in Wikistats2 task T204965  N Postponed
  • Create report for "active editor metric" per project family (like "editors for wikisource") task T188265  N Postponed
  • Wikistats UI timeselector allows for selection or arbitrary time ranges task T219112   Done
  • Import Edit Data Lake dataset into turnilo (WMF data exploratory tool) task T211173   Done

Status edit

  To do May 2019

We will be focusing in data quality and postponing the metric computations to next quarter or possibly the quarter after next

  To do June 2019

Discussed...

Outcome / Output edit

Users see improvements on data computing

Foundations for ML: Initial deployment Pipeline

Dependencies: SRE

Goals edit

STRECH GOAL: Develop a workflow to move computed data from hadoop to production services T213976   In progress

Status edit

  To do May 2019

  In progress We are now testing oozie workflow to push to swift using the auth system that swift supports.

  To do June 2019 Oozie workflow finished, we will be testing it with some users