Analytics/Roadmap/PlanningMeetings/2012 Sept 20
Notes for the Team Analytics roadmap planning meeting for 20 Sept 2012, taken by Dave Schoonover.
Need to create a set of Data Release Processes
edit- Search log data release contained unacceptable data. How can we prevent this in the future?
- Diederik on point
- Conversation with: Team Analytics, robla, Moeller, Legal, Dario, Chris MzMcBride?, Chris Steipp?
- Who else might have a valued/paranoid perspective? MZMcBride??
- Process for NEW datasets, as well as Smoke-check each data upload prior to public notice
- Concrete threads:
- For any new datastream: "What's the Attack Surface?"
- Need to spend more time thinking about what sort of privacy exploits are possible
- Strip all (no matter if we take other steps):
- URLs? Spam, SEO links, etc
- Email addresses
- IP addresses
- What criterion for k-anonymity are we going to use (if any)? --> Publish behavioral/request data only as aggregates
- For any new datastream: "What's the Attack Surface?"
- Followup on this release:
- Disclosure requirements
- Legal is looking into the impact and our obligations
- Need to convey clearly to the community what happened and what we're doing
Milestone Planning
editBy Project
editKraken
edit- Set up Cassandra cluster, get it working with Hadoop. [otto + dsc] (Sept)
- Load in sample data sets. [otto] (Sept)
- Tee the udp2log stream into Kraken. [otto + dsc] (Sept)
- First-pass at Hive/Pig Jobs [dsc + otto] (Sept)
- Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Oct)
- WMF Maven parent pom [dsc] (Oct)
- Puppetize Kraken [otto] (Ongoing)
- Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Oct)
- Get Storm set up [dsc + otto] (Oct)
- Start work on ETL topology [dsc] (Oct)
- Hardware reinstallation -- Depends on Ops [otto] (Oct)
- Get to consensus with Ops regarding logging of the firehose [dsc + otto] (Oct)
- Research needed: test running cli JVM producers does not cause extra load [otto] (Oct)
Legacy Log Collection
edit- Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept)
- udp2log filters
- Update filters for Wikipedia Zero [otto] (Ongoing)
- Filter by X-Carrier headers. [otto + asher + diederik] (Oct)
- udp-filter to filter by http status. [otto] (Oct)
WikiStats
edit- Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
- Repair data errors in wikistats, and add process for checking data integrity [ezachte] (Sept)
- Make wikistats more robust (MoM validations) [ezachte] (Oct)
- Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)
Ops & Maintenance
edit- Access/support requests for stat1, stat1001 [otto] (Ongoing)
- Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Oct)
- Maintenance of oxygen/emery/locke [otto] (Ongoing)
Data
edit- Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)
- Create Data Release Practices Task Force [diederik] (Sept)
- Start pushing datasets to AWS [diederik] (Oct)
- Finalize scripts to massively compact dammit.lt data [erik] (Oct)
- Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct)
Limn
edit- Bootstrap Dan [dan + dsc] (Sept) [DONE]
- Refactor charting to use d3 [dan + dsc]
- Initial Prototype with Options UI (Sepåt)
- Feature Parity with Dygraphs (plus bugfixes, etc) (Oct)
- Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Oct)
- Mirror GitHub to Gerrit [dsc] (Sept)
- Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Oct)
- Coke (
make
for Coco) task to create symlinks intodataDir
from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Sept)- Coke task to download and setup dummy testing data for ease of development [dsc] (Sept)
- UI support for remote datasets via proxy [dsc + dan] (Oct)
- Migrate Dario's dashboards to Limn [dsc] (Sept)
- Support the Global Dev dashboard [evan] (Ongoing)
- Support the Gerrit Stats dashboard [diederik] (Ongoing)
- Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Oct)
By Month
editSeptember
edit- (Kraken) Set up Cassandra cluster, get it working with Hadoop. [otto + dsc] (Sept)
- Load in sample data sets. [otto] (Sept)
- Tee the udp2log stream into Kraken. [otto + dsc] (Sept)
- First-pass at Hive/Pig Jobs [dsc + otto] (Sept)
- (Kraken) Puppetize Kraken [otto] (Ongoing)
- (Legacy Log Collection) Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept)
- (Data) Create Data Release Practices Task Force [diederik] (Sept)
- (Limn) Bootstrap Dan [dan + dsc] (Sept) [DONE]
- (Limn) Refactor charting to use d3 [dan + dsc]
- Initial Prototype with Options UI (Sept)
- (Limn) Mirror GitHub to Gerrit [dsc] (Sept)
- (Limn) Coke (
make
for Coco) task to create symlinks intodataDir
from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Sept)- Coke task to download and setup dummy testing data for ease of development [dsc] (Sept)
- (Limn) Migrate Dario's dashboards to Limn [dsc] (Sept)
- (Limn) Support the Global Dev dashboard [evan] (ongoing)
- (Limn) Support the Gerrit Stats dashboard [diederik] (Ongoing)
October
edit- (Kraken) Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Oct)
- WMF Maven parent pom [dsc] (Oct)
- (Kraken) Puppetize Kraken [otto] (Ongoing)
- (Kraken) Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Oct)
- (Kraken) Get Storm set up [dsc + otto] (Oct)
- Start work on ETL topology [dsc] (Oct)
- (Kraken) Hardware reinstallation -- Depends on Ops [otto] (Oct)
- (Kraken) Get to consensus with Ops regarding logging of the firehose [dsc + otto] (Oct)
- Research needed: test running cli JVM producers does not cause extra load [otto] (Oct)
- (Legacy Log Collection) udp2log filters
- Update filters for Wikipedia Zero [otto] (Ongoing)
- Filter by X-Carrier headers. [otto + asher + diederik] (Oct)
- udp-filter to filter by http status. [otto] (Oct)
- (WikiStats) Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
- (WikiStats) Make wikistats more robust (MoM validations) [ezachte] (Oct)
- (WikiStats) Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)
- (Ops & Maintenance) Access/support requests for stat1, stat1001 [otto] (Ongoing)
- (Ops & Maintenance) Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Oct)
- (Ops & Maintenance) Maintenance of oxygen/emery/locke [otto] (Ongoing)
- (Data) Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)
- (Data) Start pushing datasets to AWS [diederik] (Oct)
- (Data) Finalize scripts to massively compact dammit.lt data [erik] (Oct)
- Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct)
- (Limn) Refactor charting to use d3 [dan + dsc]
- Feature Parity with Dygraphs (plus bugfixes, etc) (Oct)
- (Limn) Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Oct)
- (Limn) Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Oct)
- (Limn) UI support for remote datasets via proxy [dsc + dan] (Oct)
- (Limn) Support the Global Dev dashboard [evan] (Ongoing)
- (Limn) Support the Gerrit Stats dashboard [diederik] (Ongoing)
- (Limn) Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Oct)
Followups
edit- [dsc] Update wiki with project pages for everything on the Roadmap page
- Each project owner will then update their Project Status for Sept
- [dsc] Update the Engineering Roadmap wiki page: https://www.mediawiki.org/wiki/Roadmap
- [dsc] Fill in week-by-week team roadmap without breakout by project