Analytics/Epics/Pageview API

For the documentation of the current pageview API, see: wikitech:Analytics/PageviewAPI.

Goals edit

Wikipedians need a reliable and accurate API for querying page views for articles. This epic describes the steps that need to be taken to build such an API. Initially, this epic will focus on the underlying infrastructure (e.g. kafka/hadoop) that needs to be built for this purpose. This Epic is definitely not finished and will be expanded with more requirements about the front end as the back end work progresses.

Detailed Tracking Links edit

TBD

Users edit

User Description
Product Managers The people who are researching, designing and iterating on the page view metrics
Researchers/Analytics Developers The people who define the various page view metrics
Analytics Developers The people who write the software that produce the metrics
Analytics Operators The people who ensure the software is running and the data is updated
Management WMF who make decisions based on the results of the data
Community The wikipedians who look at the data to assess their success and the health of the community and their pages
Readers The people who read wikipedia

Prioritized Use Cases edit

High Priority edit

  1. As a Wikipedian, I need an API that allows me to query various page view stats
  2. As a Reader, I want any PII (IP address, UA, etc) to be removed from my page view information
  3. As a Product Owner, I want page views to be geo-coded at a country level
  4. As a Product Owner (and a lot of other stakeholders), I want raw logs to be deleted within 90 days
  5. As a Product Owner, I want page views to conform to a community reviewed definition

Later edit

Non functional requirements edit

  1. Data should be updated daily, with hourly granularity

Additional information edit

We've done some planning with tech-ops documented here: List of tasks for backend work