Wikimedia Technology/Goals/2019-20 Q2
Technology Department Team Goals and Status for Q2 FY19/20 in support of the Medium Term Plan (MTP) Priorities and Annual Plan for FY19/20
Team Manager: Nuria Ruiz
- Reduce platform Complexity. Modern Event Platform
- Build a reliable, scalable, and comprehensive platform for creating services, tools and user facing features that produce and consume event data
- Resolve Kafka Connect HDFS Licensing issue and decide if we will use Kafka Connect task T223626 Postponed
- Initial (Stream) Config Service implementation in vagrant task T233634 Done
- Build a reliable, scalable, and comprehensive platform for creating services, tools and user facing features that produce and consume event data
- Smart Tools for Better Data. Make easier to understand the history of all Wikimedia projects
- Release Mediawiki History in JSON/CSV or mysql dump format (the best dataset to date measure content and contributors) Blocked
- Deploy hadoop client to dump hosts so mediawiki history public dataset can get to dumps on a reasonable timeframe task T234229 In progress
- Release Mediawiki History in JSON/CSV or mysql dump format (the best dataset to date measure content and contributors) Blocked
- Smart Tools for Better Data. Make easier to understand how Commons media is used across our projects.
- Announce the deployment of the mediarequests API: task T231589 Done
- Add mediarequests metrics to Wikistats UI task T234589 Done
- Smart Tools for Better Data. Increase Data Quality, Privacy and Security
- Deploy Entropy-based alarms for data issues that could indicate, bugs, traffic drops due to censorship on inconsistencies task T215863, this work continues from Q1 In progress
- Productionize Kerberos Service Done
- Create test Kerberos identities/accounts for some selected users from Analytics Team in test cluster T212258, Done
- Core. Operational Excellence. Increase Resilience of Systems
- New zookeeper cluster for tier-2 task T217057 Done
- Core. Operational Excellence. Reduce Operational Load by Phasing Out Legacy Systems/Technologies
- Sunset MySQL data store for eventlogging. task T159170, this work continues from Q1 Done
- Migrate eventlogging to python3 task T234593 Done
Dependencies on:
Status
- October 28, 2019 status:
- Finalize productionizing kerberos service, and then possibly enabling it Done
- Set up a generic workflow to create Kerberos accounts In progress
- Create test Kerberos identities/accounts for some selected users from Analytics Team in test cluster Done
- Deprecate eventlogging-service-eventbus Done
- Bot Detection “Remove automated traffic not identified as such from readers data” In progress
- December 12, 2019 status:
- Done
- Make kerberos infra prod ready
- New zookeeper cluster for tier-2
- Sunset MySQL data store for eventlogging
- Allow all Analytics tools to work with Kerberos auth
- Superset upgrade
- Release of editors per country dataset
- Finish Swift workflow to transfer binaries from Hadoop to production
- Enable GPU infrastructure on stats machines with purely OS components
- Schema Repository CI for convention and backwards compatibility enforcement
- Continue moving events from eventbus to eventgate-main
- Start planning work on Stream Configuration Service and Product use of with Event Platform
- Set up of Mediarequests API public endpoint. Phase 1. Infra.
- Bot Detection Code Prototype: “Remove automated traffic not identified as such from readers data”.
- Develop mediarequests API to get statistics of view of individual Wikimedia images
- In progress
- Enqueue eventlogging requests for better performance
- Release Mediawiki History in JSON/CSV or mysql dump format (the best dataset to measure content and contributors)
- Enthrophy-based alarms for data issues
- Presto experiments, interaction with HDFS/superset.
- Blocked
- Describe statement of work (and task) for upcoming designer for wikistats
- Done
Team Manager: Corey Floyd
- Reduce platform Complexity
- Migrate Service – changeprop
- Modernizing front end project planning (from Front End Working Group)
- Add API Integration tests and decouple components
- Initial librarization of MediaWiki
- Frontend Architecture Group Planning for Desktop Improvements
- Tech and Product Partnerships
- Implement MediaWiki REST APIs for MVP
- Integrate OAuth 2.0 into API
- Prototype Documentation Portal
Dependencies on:
Status
- October 28, 2019 status:
- Modernizing front end project planning (from Front End Working Group) In progress
- Implement MediaWiki REST APIs for MVP In progress
- Prototype Documentation Portal In progress
- December 12, 2019 status:
- Done
- Integrate Session Service
- Migrate Mainstash
- API Integration testing infrastructure
- Kick off Front End Working Group to explore recommendations from the Q4 research and identify a project to begin working on in Q2
- Schema Registry CI
- Stream Config Planning and Design
- REST API for Parsoid
- In progress
- OAuth 2.0 Initial implementation
- Done
Team Manager: Erika Bjune
- Core Work
- Support high revenue/high risk campaigns
- Extra attention payed to security and privacy during highest revenue campaigns
Dependencies on:
Status
- October 28, 2019 status:
- all goals In progress
- December 12, 2019 status:
- Done
- Support Advancement in testing and planned Q1 campaigns
- Get India form to first 1 hour test and continue further development
- Get recurring up-sell to first 1 hour test and continue further development
- Proactively update current systems with latest security patches and respond to compliance or regulation changes.
- Complete required security and complaince scans
- Done
Team Manager: Gilles Dubuc
- Core Work
- In progress - Provide performance expertise to FAWG outcome
- To do - Hold 3 or more workshops and training sessions with 1 engineering team
- In progress - Hire and onboard Systems Performance Engineer
- To do - Publish 2 blog posts about performance
- In progress - Organise and run the Web Performance devroom at FOSDEM 2020
- Reduce Complexity of the Platform
- To do - Create performance alerts for 12 different wikis
- To do - Create synthetic tests for backend editing with XHGui profile comparison
- To do - Expand coverage of metrics from synthetic testing (introducing user journeys). Add 5 new user journeys and a minimum of 7 new metrics
- To do - Add a new Graphite instance for synthetic metrics. It needs to be connected with our current Grafana instance and documented.
- To do - Migrate ResourceLoader dependency tracking off the RDBMs
Dependencies on:
Status
- October 28, 2019 status:
- Hire Systems Performance Engineer and create onboarding material, ensuring that this new hire has a shared understanding of the team’s performance culture. In progress
- Organise and run the Web Performance devroom at FOSDEM 2020 In progress
- Add a new Graphite instance for synthetic metrics. It needs to be connected with our current Grafana instance and documented. In progress
- MachineVision extension performance review In progress
- December 12, 2019 status:
- Done
- Support AbuseFilterCachingParser deployment
- Create Grafana dashboard for WANObjectCache statistics
- Support Parsing Team with performance insights on Parsoid-php roll out
- Line up interested speakers for a FOSDEM Web Performance devroom proposal
- Audit use of CSS image-embedding (improve page-load time by reducing the size of stylesheets)
- In progress
- Improve the filtering of obsolete domains in GTIDs to avoid timeouts on GTID_WAIT. (get reviewed and merged)
- Reduce reliance on master-DB writes for RL file-dependency tracking (Multi-DC prep)
- Figure out the right store to use for the main stash
- Publish 8 blog posts about performance
- Support and maintenance of MediaWiki's object caching and data access components.
- Support and maintenance of WebPageTest and synthetic testing infrastructure
- Support and maintenance of MediaWiki's ResourceLoader
- Support and maintenance of Fresnel
- Provide performance expertise to FAWG outcome
- Blocked
- Swift cleanup + WebP ramp up
- Done
Team Manager: JR Branaa
- Core Work
- A clear set of unit, integration, and system testing tools is available for all supported engineering languages.
- Update WebdriverIO from version 4 to 5 for Core.
- A clear set of unit, integration, and system testing tools is available for all supported engineering languages.
- Core Work
- Actionable code health metrics are provided for code stewards
- Add all applicable repos to the Code Health pipeline (Code Health Metrics).
- Solicit feedback from current users of CHM POC and define phase 2 enhancements.
- Improve Code Review experience
- Interview engineering teams to understand their current code review practices - To do
- Relaunch the Code Review Office Hours- In progress
- Put in place Code Review performance metrics- In progress
- Actionable code health metrics are provided for code stewards
- Reduce complexity of the platform to make it easier for new developers to contribute
- Actionable code health metrics are provided for code stewards
- Make CI warn about slow tests, and publish a collated list of slow tests
- Actionable code health metrics are provided for code stewards
Dependencies on:
Status
- October 28, 2019 status:
- Solicit feedback from current users of CHM POC and define phase 2 enhancements In progress
- Relaunch the Code Review Office Hours In progress
- Put in place Code Review performance metrics In progress (We've defined things, but need to implement)
- December 12, 2019 status:
- Done
- Update existing Selenium documentation (https://www.mediawiki.org/wiki/Selenium/Node.js)
- Expand set of repositories covered by code health metrics (via sonarqube)
- In progress
- Team inception, formalization, and assessment of current organizational practices
- Blocked
- Scope out requirements for a self-hosted version of SonarQube for our use.
- Done
Team Manager: Tyler Cipriani
- Reduce Complexity of Platform
- Build and support a fully automated and continuous Code Health and Deployment Infrastructure
- Update weekly branchcut script for MediaWiki to allow for automation
- Production configuration is compiled into static files on deployment servers
- Seakeeper (New CI) proposal for a dedicated CI cluster submitted for feedback
- A demonstration MediaWiki development environment hosts the full TimedMediaHandler front-end and back-end workflow
- Other service deployment pipeline migrations as prioritized between SRE/RelEng and relevant teams.
- Build and support a fully automated and continuous Code Health and Deployment Infrastructure
- Core Work
- Improve and maintain the Wikimedia code review system
- Migrate Gerrit master from Cobalt to Gerrit1001
- Migrate from Gerrit version 2.15 to 2.16
- Continuation of Phabricator and Gerrit improvement (in conjunction with SRE)
- Improve and maintain the Wikimedia code review system
Dependencies on:
Status
- October 28, 2019 status:
- Migrate Gerrit master from Cobalt to Gerrit1001 Done (Completed on 2019-10-22; needed to be done early in the quarter to ensure we could also jump Gerrit versions this quarter)
- Update weekly branchcut script for MediaWiki to allow for automation In progress
- Production configuration is compiled into static files on deployment servers In progress
- Seakeeper (New CI) proposal for a dedicated CI cluster submitted for feedback In progress
- December 12, 2019 status:
- Done
- Streamline the Kibana -> Phab error reporting workflow (using client-side code, at first)
- Work with SRE to identify and implement needs of Phabricator and Gerrit
- Determine path forward with current CI infrastructure given Jan 1, 2020 python2 EOL
- Document an implementable architecture for what we want in new CI
- POCs of GitLab, Argo, and Zuul3 systems (as possible); evaluate options
- Migrate restrouter
- (Stretch): Preparatory MediaWiki config clean-up & static loading work
- Preliminary work on a CLI for setup/management (local charts)
- Instantiate testing and linting of helm charts
- In progress
- (Stretch): MobileContentService
- Blocked
- Migrate local-charts to deployment-charts
- Postponed
- Scope updated CI/testing KPIs
- Done
Team Manager: Leila Zia
- Content Integrity
- In progress - A comprehensive literature review of disinformation published in arxiv and meta (completing the work started in Q1).
- To do - Build a prioritized list of actions to take (tools to build, datasets to release, etc.) for combating disinformation (though discussions with the community of editors and developers, internal consultation, and maybe with external researchers)
- To do - Build one formal collaborations in the disinformation space to start the research for building solutions starting Q3.
- Foundational
- To do - Prepare the Research Internship proposal.
- In progress - Finalize the research brief for crosslingual topical model laying out the work that will be done in this space starting Q3.
- To do - Literature review of reuse. task T235780
- To do - Review of the different types of re-use and what we know about their effect on traffic to Wikimedia. task T235781
- To do - Review of what data is available to us and what data is not. What questions we can currently answer. What questions we can't. task T235784
- To do - Initiate monthly or quarterly office hours for the community. (trial for 6 months if monthly and 12 months if quarterly)
- Done - Wiki Workshop 2020 proposal submission. task T236066
- To do - Plan for a challenge: come up with an initial format, put a committee together, choose a venue for presentations.
- Address Knowledge Gaps
- To do - Finalize the taxonomy of readership gaps
- In progress - Make significant progress towards building the taxonomy of search (usage gaps). (We expect the research part of this work to conclude in Q3, as a stretch in Q2).
- To do - Literature review of identified content gaps in Wikipedia
- To do - Taxonomy of the causes of content gaps in Wikipedia
- To do - Build a series of hypotheses for the possible causes of skewed demographic representation of Wikipedia readers (specific to gender). Identify possible formal collaborations for research and testing starting Q3 if relevant based on the learnings from the list of hypotheses.
- Done - Submit the citation usage paper to TheWebConf 2020. task T236067
- In progress - (via mentoring an Outreachy) start work on the development of the data-set for statements in need of citation. task T233707
- In progress - Supervise a student evaluating methods to recommend images to Wikipedia pages. task T236142
- To do - Train from scratch and evaluate an end-to-end (simple) classification model using Wikimedia Commons categories, optimized for GPU usage. task T221761
- To do - Conduct a literature review, plan and set up collaborations for projects about understanding engagement with Wikimedia images around the world.
- Core Work
- In progress - Complete two 30-60-90 day plans.
- To do - Finalize a proposal for changes in Research based on learnings about Reseach's audience, what they expect from the team, our positioning within WMF, Movement, and the Research community, and the opportunities for impact.
- To do - Document and communicate with the team: expectations of the Research Scientist role and trajectory in the IC track.
- To do - Research Showcase feedback collection, assessment, and proposal for changes if relevant.
- To do - A half-yearly newsletter for Research with the goal of making it quarterly if bandwidth allows and/or project is successful.
Dependencies on:
Status
- October 28, 2019 status:
- Finalize the research brief for crosslingual topical model laying out the work that will be done in this space starting Q3. In progress
- Finalize the taxonomy of readership gaps In progress
- Make significant progress towards building the taxonomy of search (usage gaps). (We expect the research part of this work to conclude in Q3, as a stretch in Q2). In progress
- Literature review of identified content gaps in Wikipedia In progress
- Build a series of hypotheses for the possible causes of skewed demographic representation of Wikipedia readers (specific to gender). Identify possible formal collaborations for research and testing starting Q3 if relevant based on the learnings from the list of hypotheses. In progress
- Submit the citation usage paper to TheWebConf 2020. Done
- (via mentoring an Outreachy) start work on the development of the data-set for statements in need of citation. In progress
- Supervise a student evaluating methods to recommend images to Wikipedia pages. In progress
- Build a prioritized list of actions to take (tools to build, datasets to release, etc.) for combating disinformation (though discussions with the community of editors and developers, internal consultation, and maybe with external researchers) In progress
- A comprehensive literature review of disinformation published in arxiv and meta (completing the work started in Q1) Partially done
- Prepare the Research Internship proposal. In progress
- Literature review of reuse. In progress
- Initiate monthly or quarterly office hours for the community. (trial for 6 months if monthly and 12 months if quarterly) In progress
- Wiki Workshop 2020 proposal submission. Done
- Complete two 30-60-90 day plans. In progress
- December 12, 2019 status:
- Done
- Determine important features of articles w/r/t level of reader interest across different demographic groups (as motivation for what aspects a general article category model should capture)
- Conduct the analysis on reader surveys to understand the relation between demographics and the consumption of content on Wikipedia across languages. (Why We Read Wikipedia + Demographics). This research will be concluded in Q2 and we expect substantial progress in Q1
- Wrap up editor gender work
- Complete the research on characterizing Wikipedia citation usage. (Why We Leave Wikipedia). This goal will continue in Q2 and depending on the submission results potentially in Q3.
- Make substantial progress towards a comprehensive literature review about automatic detection of misinformation and disinformation on the Web. We expect this work to be completed in Q2 and inform the work in this direction in Q3+
- Understand patrolling on Wikipedia. A write-up describing how patrolling is being done on Wikipedia across the languages. This work may be extended further by understanding the patrolling on Wikipedia in the context of Wikipedia's interaction with other projects such as Wikidata, Wikimedia Commons, ...
- Run a series of interviews, office hours, or surveys to gather volunteer editor community's input on citation needed template recommendations. The result of this work will inform the specifications of an API (to be developed) to surface citation needed recommendations as well as future directions for this research.
- A comprehensive literature review of disinformation published in arxiv and meta (completing the work started in Q1)
- Hiring and onboarding. We expect 1-2 scientists to join the team in Q1 and the onboarding work will need to happen. We also expect to open a position for an engineering position in the team.
- Computer vision consultation as part of Structured Data on Commons
- Postponed
- Building a pipeline for image classification based on Commons categories.
- Done
Team Manager: Aaron Halfaker
- Core Work
- Hire ML Engineer
- Machine Learning Infrastructure
- Jade use, maintenance, and user-research
- Deployment of session-based models
- Jade Entity Page UI
- Newcomer quality session models
- Expansion of Topic Model to ar, ko, and cswiki
Dependencies on:
Status
- October 28, 2019 status:
- (no updates available)
- December 12, 2019 status:
- Done
- Build out the Jade API to support user-actions
- Build/improve models in response to community demand (ongoing every quarter)
- In progress
- Hire ML Engineering Manager
- Done
Team Manager: Guillaume Lederrey
- Address Knowledge Gaps
- Any new data retention requirements are implemented
- Core Work
- New query parser is used in production by the end of Q2
- WDQS storage expansion
- CirrusSearch writes are split into per cluster kafka partitions to isolate clusters from each others by end of Q2
- Get "explore similar" running again, with whatever has changed since we last looked at it
- Increase understanding of our work outside our team, and outside the Foundation
- Improve search quality, especially for non-English wikis by prioritizing community requests – Positive feedback from speakers/community on changes made
- CirrusSearch writes can be paused during cluster operations without causing excessive stress on change propagation infrastructure by end of Q2
- Rerun "explore similar" A/B test with rigorous analysis of results
- Enable cross-wiki searching for 3+ new languages/projects (stretch)
- Machine Learning Infrastructure
- Glent method 0 (session reformulation) A/B tested and deployed by end of Q2
- Learning to Rank (LTR) applied to additional languages and projects to improve ranking (needs experimentation, might not work at all)
- Glent method 1 (comparison to other users' queries) offline tested, tuned, A/B tested and possibly deployed end of Q2
- Structured Data
- Proof of Concept SPARQL endpoint for SDoC is available on WMCS and updated weekly. (stretch)
Dependencies on:
Status
- October 28, 2019 status:
- WDQS storage expansion In progress (Quote requested, waiting for feedback from vendor)
- Glent method 0 (session reformulation) A/B tested and deployed by end of Q2 In progress (A/B test running, still need to evaluate results and activate in production (provided the results are positive))
- Glent method 1 (comparison to other users' queries) offline tested, tuned, A/B tested and possibly deployed end of Q2 In progress (Some quality issues are identified in offline tests and need to be addressed before we can move forward. The biggest problems are that we are looking at edit distance per-string rather than per-token (probably because we thought too much about single word queries, where per-string and per-token are the same thing), and that Method 1 is too ready to add spaces or change the first letter of a word, all of which can make the ""semantic distance"" between a query and a suggestion much bigger.)
- Proof of Concept SPARQL endpoint for SDoC is available on WMCS and updated weekly. In progress (SPARQL endpoint for SDC (Commons Query Service – CQS) is blocked on having dumps from SDC that we can load on the endpoint.)
- December 12, 2019 status:
- Done
- Refactor query highlighting
- RDF export
- Address the indexing issues of MediaInfo (labels vs descriptions)
- Full data reimport for WDQS to enable optimizations
- Start the hiring process for a new WDQS Engineer
- In progress
- Refactor Mjolnir jobs into separate smaller jobs
- 2.1. Hardware renewal: replace elastic1017-1031
- 3.1. "Did you mean" suggestions: deploy method0 to production and deployed by end of Q2
- Improve WDQS updater performance
- Done
Team Manager: John Bennett
- Core Work
- Security Engineering and Governance
- Create initial version of PHP security toolkit
- Deploy StopForumSpam
- Create privacy engineering charter
- Incident response Table Top and updates to security after action reports and improvement plans
- Release of Phan 2.x
- Update and publish data classification policy
- Create initial set of security measurements and metrics
- Publish data protection and retention guidelines (goal is being refined)
- Bug Bounty SOP
- Draft new employee security awareness content
- Publication of privacy review template
- Finalize and publish Security services catalog
- Vulnerability Management
- ERM implementation
- Supplier assessments
- Draft 3 new Security Incident Response playbooks Q2
- Draft 3 new security policies Q2
- Security release Q2
- Assess, produce, and socialize Security documentation
- Create or improve language-based best security practices documentation
- Perform 2 phishing campaigns and provide awareness content
- Assess / Refine Phab Usage and Workflows
- Facilitate Agile / Scrum adoption
- Develop Security PM Best Practices
- Security Engineering and Governance
Dependencies on:
Status
- October 28, 2019 status (all are In progress)
- Create initial version of PHP security toolkit
- Deploy StopForumSpam
- Create privacy engineering charter
- Incident response Table Top and updates to security after action reports and improvement plans
- Release of Phan 2.x
- Update and publish data classification policy
- Create initial set of security measurements and metrics
- Publish data protection and retention guidelines (goal is being refined)
- Draft new employee security awareness content
- Publication of privacy review template
- Finalize and publish Security services catalog
- ERM implementation
- Draft 3 new Security Incident Response playbooks Q2
- Draft 3 new security policies Q2
- Security release Q2
- Assess, produce, and socialize Security documentation
- Create or improve language-based best security practices documentation
- Perform 2 phishing campaigns and provide awareness content
- Assess / Refine Phab Usage and Workflows
- Facilitate Agile / Scrum adoption
- Develop Security PM Best Practices
- December 12, 2019 status:
- Done
- Team retro, implement agile ceremonies for appsec related projects
- Draft 3 new security policies
- Create team learning circles
- Socialize and Formalize Corrective Action plan for Security Incidents
- Publication of security team roadmap
- Phishing Security Awareness, at least 2 completed Phishing campaigns
- Security release Q1
- Discovery ticket for ElastAlert detection and alerting
- Done
Directors: Mark Bergsma and Faidon Liambotis
- Cross-cutting
- Begin hiring for the SRE Engineering Manager positions and ensure at least 4 candidates are interviewed by the end of Q2, to position ourselves to fill our remaining IC positions
- Deliver 80% of the asks set by the System of Performance project by EOQ
Service Operations
editTeam Manager: Mark Bergsma
- Core Work
- Finish what we started: Cleanup remnants of HHVM from our infrastructure by end of Q2
- Migrate core software components of the Deployment Pipeline to current major releases
Data Persistence
editTeam Manager: Mark Bergsma
- Core Work
- Ensure general backup service is migrated to new hardware infrastructure by end of Q2 and general backup runs are monitored for basic success/failure criteria
Traffic
editTeam Manager: Brandon Black
- Core Work
Infrastructure Foundations
editTeam Manager: Faidon Liambotis
- Core Work
- Integrate with Netbox for device selection and topology data gathering
- Assist with adoption of at least 2 additional services into the Deployment Pipeline by service owners by end of Q2
- Develop a new alert notification, escalation and paging capability to accommodate the increased needs of the team and department.
- Enable opt-in 2FA for web services SSO
- Extend security vulnerability tracking for container images
- Upgrade the Elastic/Logstash version to >= 7.2
- Replace/renew the internal Certificate Authority (expires Jun 2020)
- Reduce the number of service clusters running a soon-to-be unsupported Debian release by 8
- Reduce the number of manual steps involved in the provisioning and decommissioning of new services by 1
- Drive the configuration of the networking infrastructure via automated means & ensure multiple team members are able to deploy new configuration
Observability
editTeam Manager: Faidon Liambotis
- Core Work
Data Center Operations
editTeam Manager: Willy Pao
- Core Work
- Deliver 80% of new installs by its requested need by date.
- Complete decommission of at least 50% (currently 48 tasks) of existing decommission tasks in eqiad, with servers completed unracked, to make room for new installs.
- Grant root access for Papaul, to take over remote portion of decommissioning servers in eqiad.
- Complete the rebuild/refresh of the esams caching facility in/near Amsterdam by end of October.
- Upgrade all PDUs in eqiad to new Servertech models (15 racks total) by end of November.
- Return all servers back to Cisco from previous server donations by end of Q2.
- Identify at least 3 new vendors as potential options for future disposition and sale of goods/services.
- Order and upgrade all PDUs in eqsin by end of quarter.
- Proper proper training for dc-ops team for receiving equipment in Coupa.
- Partner with Finance and determine point person for submitting orders in Coupa.
- Utilize bi-weekly meetings with Finance to target and resolve all issues within Coupa that may impede our current hardware procurement process.
Dependencies on:
Status
- October 28, 2019 status: all goals below are In progress
- Deliver 80% of new installs by its requested need by date.
- Complete decommission of at least 50% (currently 48 tasks) of existing decommission tasks in eqiad, with servers completed unracked, to make room for new installs.
- Grant root access for Papaul, to take over remote portion of decommissioning servers in eqiad.
- Complete the rebuild/refresh of the esams caching facility in/near Amsterdam by end of October.
- Upgrade all PDUs in eqiad to new Servertech models (15 racks total) by end of November.
- Return all servers back to Cisco from previous server donations by end of Q2.
- Identify at least 3 new vendors as potential options for future disposition and sale of goods/services.
- Order and upgrade all PDUs in eqsin by end of quarter.
- Utilize bi-weekly meetings with Finance to target and resolve all issues within Coupa that may impede our current hardware procurement process.
- Ensure general backup service is migrated to new hardware infrastructure by end of Q2 and general backup runs are monitored for basic success/failure criteria
- Finish what we started: Cleanup remnants of HHVM from our infrastructure by end of Q2
- December 12, 2019 status:
- Done
- Clear out existing decommissioned hardware in ulsfo and codfw
- Implement a new hardware repair template & refine existing triaging processes
- Implement general template form for service owners to fill in
- Improve average end-to-end turnaround time from hardware request to hardware delivery
- Clean up existing backlog of Netbox inconsistencies and data errors
- Maintain zero error reports going forward (catch up and get to close to zero)
- Determine alternative disposition company for Juniper equipment
- Hire and on-board a contractor for additional support in eqiad
- Keep all Netbox reports in a "passed" state
- Identify 3rd party contractor to take care of straightforward tasks at remote caching sites
- Tighten up procurement cycle by implementing regularly scheduled deadlines for quotes, approvals, and purchase orders
- [stretch] Deploy codfw non-Mediawiki database proxies
- Failover all codfw masters
- Transfer ownership and knowledge of Bacula backup infrastructure
- Deploy new Bacula hardware
- Failover eqiad masters to new hosts and decommission old masters
- Order, rack and setup 10 new hosts in codfw
- [stretch] Migrate general backup service from old to new host(s)
- Ensure general backup service is migrated to new hardware infrastructure by end of Q2 and general backup runs are monitored for basic success/failure criteria
- Build a production prototype of an Apereo CAS identity provider
- [stretch] Evaluate Netbox to store network secrets
- Switch (at least) one service to authenticate against the identity provider
- Import existing management interfaces IPs into Netbox
- Move all application server & API traffic to PHP 7
- Support migration of services RESTrouter, wikifeeds by service owners
- Move jobrunners to PHP 7
- Move maintenance scripts to PHP 7
- Begin testing a small fraction of live cache_text traffic through ATS backends
- Finish evaluating current running implementation under live test
- Switch most production hosts to using anycast recdns @ 10.3.0.1
- Implement any minor improvements we need (anycast, etc)
- Decide on Prometheus vs Webrequest
- In progress
- Gradually migrate all MediaWiki instances to read the database configuration from etcd
- Set up MediaWiki to optionally read the database configuration from etcd
- Productionize dbctl (deploy, import data, set up alerts)
- Automate the assignment of new host's management interface IP
- Establish periodic alerts reviews, complete one by EOQ
- Produce and circulate an alerting infrastructure roadmap
- Reduce Icinga alert noise
- [stretch] Remove HHVM from production
- Switch production edge TLS termination to ATS
- Implement basic TLS termination for cache_text services (may not be final solution w/ real PKI)
- Design new dynamic response architecture for future needs
- Continuation of previous Q goal – Finish TLS deployment via ATS
- Postponed-or- Blocked
- Iterate on a process for running the incident documentation review board; review 90% of incident documents written this quarter
- [stretch] Research possible implementations for synchronizing team contact information to everyone's phone
- Produce a standardized template for a status document for ongoing major incidents
- Automate the generation of management interface DNS records
- Add safe push method for the configuration: interactive and sequential
- Upgrade production PuppetDB to 6.2 in both data centers
- Productionize existing configuration management software (jnt)
- Upgrade all production Puppetmasters to Puppet 5.5
- Define and document the process for service owners to deploy a new service onto the pipeline
- Done
Team Manager: Birgit Müller
- Core Work
- Done - [IaaS] All out of warranty hardware used for offsite backups of Cloud Services data in the codfw datacenter is replaced
- In progress - [IaaS] 60% of the remaining Debian Jessie systems in the hardware layer underlying Cloud VPS are upgraded to Debian Buster or Stretch
- In progress - [IaaS] All Debian Jessie instances are removed/replaced in 95% of Cloud VPS hosted projects
- Done - [IaaS] Deploy a minimum viable Ceph cluster in eqiad and convert 1+ cloudvirt servers to use it for instance storage
- To do - [IaaS] Measure IOPS as seen at the instance level, IOPS as seen at the Ceph cluster level, and network activity generated in delivering IOPS at the backbone network level to produce a forecast for impact of full conversion of cloudvirt servers to Ceph instance storage.
- In progress - [IaaS] Create a shared understanding of systems and service continuity and availability constraints in the current Cloud VPS product which can be used to design follow-on projects to reduce single points of failure and establish practices for testing and maintaining continuity and availability of Cloud VPS core services.
- Done - [IaaS] OpenStack APIs and services are upgraded to the "Ocata" release
- Done - [PaaS] Deploy a Kubernetes 1.15.2+ cluster in Toolforge which will be used to provide a more modern, secure, and performant PaaS baseline to Tool maintainers.
- In progress - [PaaS] Migrate 5+ early adopter/beta tester tools from legacy Kubernetes cluster to new Kubernetes cluster to validate integration with ingress proxy layer and sandboxing/isolation of new Kubernetes cluster deployment.
- In progress - [PaaS] Create timeline and operational plan for migrating all Kubernetes workloads in Toolforge to the new Kubernetes cluster and decommissioning the legacy cluster by the end of FY19/20.
- Done - [Docs] Create a functional template and content checklist for Help pages in the Toolforge and Cloud VPS technical content collections.
- Done - [Docs] Establish a technical content review process with developers on WMCS team.
- Done - [Docs] Noticeably improve readability for 5 instances of Toolforge and Cloud VPS "Help" documentation on Wikitech.
- Reduce Complexity of the Platform, Movement Diversity
- Increased visibility & knowledge of technical contributions, services and consumers across the Wikimedia ecosystem
- In progress - Create a blog by and for technical audiences where members of the technical community can post about their technical work
- Postponed - Publish 6 (min) technical blog posts
- In progress - Coordinate Tech Talks and increase views on tech talks by 10%/quarter
- In progress Prepare release of 2nd edition of the Tech Community Newsletter (publishing date: Jan 2020)
- Done - A dashboard for Wikimedia Cloud Services edit data is available to the Wikimedia movement
- To do - Provide “showroom”, introducing newcomers to a variety of different tools to show what developers can do in Toolforge by Q3
- To do - Find out what is needed to get data on all technical contributions/contributors
- In progress - Coordinate with Bitergia and get data on "Avg. Time Open (Days)" for Gerrit patchsets per affiliation and "time to first review" data for patches (by end of Q4).
- To do - Gather and publish current numbers on technical contributions provided by Bitergia in the Quarterly Tech Community newsletter (by Jan 2020)
- Increased visibility & knowledge of technical contributions, services and consumers across the Wikimedia ecosystem
- Reduce Complexity of the Platform, Movement Diversity
- Support Wikimedia's diverse technical communities
- To do - Develop workshop concept with partner community for technical workshops in Q3
- To do - Conduct workshop and document the technical challenges small wikis face in North America
- In progress - Coordinate GCI. In Q2/Q3, in Google Code-in, > 35 mentors volunteer to provide tasks and mentor students in >70 task instances
- In progress - Coordinate Outreachy round 19. At least 5 featured projects are accepted for Outreachy round 19 by Oct 1st Done. At least five projects are successfully completed by Outreachy interns by end of Q3.
- In progress - Prepare and hold session on Wikimedia's Tech internships at WikiCon North-America
- Support Wikimedia's diverse technical communities
Dependencies for core work is on: SRE/Data Center Operations team
Status
- October 28, 2019 status:
- ll Debian Jessie instances are removed/replaced in 95% of Cloud VPS hosted projects (Annual unused project/instance purge) In progress
- 60% of the remaining Debian Jessie systems in the hardware layer underlying Cloud VPS are upgraded to Debian Buster or Stretch (Cloud VPS Domain name(s) migration) In progress
- Create a shared understanding of systems and service continuity and availability constraints in the current Cloud VPS product which can be used to design follow-on projects to reduce single points of failure and establish practices for testing and maintaining continuity and availability of Cloud VPS core services. In progress
- Deploy a Kubernetes 1.15.2+ cluster in Toolforge which will be used to provide a more modern, secure, and performant PaaS baseline to Tool maintainers. In progress
- Technical internships + mentoring – Q2 In progress
- Coordinate new rounds, GCI In progress
- Create a blog by and for technical audiences where members of the technical community can post about their technical work. In progress
- Publish 6 (min) technical blog posts In progress
- Coordinate Tech Talks and increase views on tech talks by 10%/quarter In progress
- A dashboard for Wikimedia Cloud Services edit data is available to the Wikimedia movement In progress
- Coordinate with Bitergia and get data on "Avg. Time Open (Days)" for Gerrit patchsets per affiliation and "time to first review" data for patches (by end of Q4). In progress
- Coordinate GCI. In Q2/Q3, in Google Code-in, > 35 mentors volunteer to provide tasks and mentor students in >70 task instances In progress
- Coordinate Outreachy round 19. At least 5 featured projects are accepted for Outreachy round 19 by Oct 1st Yes Done. At least five projects are successfully completed by Outreachy interns by end of Q3. Done
- At least five projects are successfully completed by Outreachy interns by end of Q3. In progress
- Prepare and hold session on Wikimedia's Tech internships at WikiCon North-America In progress
- December 12, 2019 status:
- Done
- Develop Technical Engagement narrative and shared understanding in the team
- Technical internships and mentoring: Mentor 3 students in GSOD, GSOC, Outreachy
- Blog posts on Small Wiki Toolkits & Coolest Tool Award
- Conduct Coolest Tool Award 2019
- Design & publish Tech Engagement quarterly newsletter Ed1
- Continue Tech Talks
- Develop support format: Coordinate Small Wiki Toolkits focus area, create toolkits & experiment, evaluate, iterate, document
- Provide continuous bug management support in Phabricator (ongoing)
- Publish Technical Contributors Map
- Develop visualization tool for WMCS edit data/integrate WMCS edit data in existing tools
- Advocate for better processes to support developer productivity (ongoing)
- HA for OpenStack API endpoints (keystone, glance, nova, designate)
- Improve Toolforge documentation (ongoing every quarter)
- Jessie deprecation (infra + Cloud VPS)
- OpenStack version upgrade(s)
- Toolforge Kubernetes redesign/upgrade
- Improve Cloud VPS documentation (ongoing)
- In progress
- Hire Developer Advocate
- Done