Analytics/Epics/Pageview API

For the documentation of the current pageview API, see: wikitech:Analytics/PageviewAPI.

Goals

edit

Wikipedians need a reliable and accurate API for querying page views for articles. This epic describes the steps that need to be taken to build such an API. Initially, this epic will focus on the underlying infrastructure (e.g. kafka/hadoop) that needs to be built for this purpose. This Epic is definitely not finished and will be expanded with more requirements about the front end as the back end work progresses.

Detailed Tracking Links

edit

TBD

Users

edit
User Description
Product Managers The people who are researching, designing and iterating on the page view metrics
Researchers/Analytics Developers The people who define the various page view metrics
Analytics Developers The people who write the software that produce the metrics
Analytics Operators The people who ensure the software is running and the data is updated
Management WMF who make decisions based on the results of the data
Community The wikipedians who look at the data to assess their success and the health of the community and their pages
Readers The people who read wikipedia

Prioritized Use Cases

edit

High Priority

edit
  1. As a Wikipedian, I need an API that allows me to query various page view stats
  2. As a Reader, I want any PII (IP address, UA, etc) to be removed from my page view information
  3. As a Product Owner, I want page views to be geo-coded at a country level
  4. As a Product Owner (and a lot of other stakeholders), I want raw logs to be deleted within 90 days
  5. As a Product Owner, I want page views to conform to a community reviewed definition

Later

edit

Non functional requirements

edit
  1. Data should be updated daily, with hourly granularity

Additional information

edit

We've done some planning with tech-ops documented here: List of tasks for backend work