Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps
Program outline
editTeams contributing to the program
editSite Reliability Engineering
Annual Plan priorities
editPrimary Goal: 3. Knowledge as a Service - evolve our systems and structures
How does your program affect annual plan priority?
editThe wiki projects are Wikimedia’s primary tool advancing its mission, and the underlying infrastructure is core to its work. By evolving this infrastructure, and strengthening the teams, processes and structures supporting it we are putting Wikimedia in a better position for execution of its mid-term strategy.
Program Goal
editLong-standing gaps in the resiliency, reliability and maintainability of Wikimedia’s technical infrastructure and resourcing of supporting teams are addressed.
- Outcome 1
- Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.
- Output 1
- Create a staging cluster comparable to production infrastructure
- Output 2
- Migrate (micro)services to our Streamlined Service Delivery platform with integrated CI/CD
- Outcome 2
- Technical staff have increased visibility into the operation of our services and infrastructure.
- Output 3
- Modernize logging, alerting and metrics monitoring infrastructure
- Outcome 3
- Wikimedia projects and content are protected against major disasters that threaten availability.
- Output 4
- Strengthen backups with reliable and redundant backup infrastructure
- Output 5
- Serve projects and services out of multiple data centers
- Outcome 4
- Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Output 6
- Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
- Output 7
- Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts
- Outcome 5
- The Site Reliability team is able to perform its duties with adequate resourcing and a more reasonable division of responsibilities
- Output 8
- Continue the FY17-18 efforts to build a management support structure to support the SRE team's growth and process duties
- Output 9
- Address under-resourcing and reduce the bus-factor in several key areas by additional engineering capacity/staffing
Resources
editFY2017–18 | FY2018–19 | |
---|---|---|
People (OpEx) |
|
|
Stuff (CapEx) | TBD | TBD |
Travel & Other |
Targets
editOutcome 1
editOutcome 2
edit- Target 2
- 20% increase of services having adopted the modern metrics stack
- 100% of services involved in page views are using centralized logging
- Measurement method
- Percentage of modern metrics stack adoption
- Percentage of production services using centralized logging
Outcome 3
edit- Targets
- > 90% of backup generation jobs succeeds
- < 5% services important and relevant to the wider public served out of a single data center
- Measurement method
- Ratio of successful/failed backup generation jobs
- Number of services that are served out of a single data center
Outcome 4
edit- Measurement method
- Amount of time spent on manual, non-automated tasks ("toil") in common workflows as indicated in repeated surveys of SREs
Outcome 5
edit- Target
- 50% improvement between status quo at program start and 3-year "healthy" goal
- Measurement method
- Progression on the SRE team's "get healthy" staff responsibilities diagram
Dependencies
editn/a