Wikimedia Technology/Annual Plans/FY2019/TEC3: Deployment Pipeline
NOTE: This is a continuation of the FY18 program titled "Streamlined Services Delivery"
We will build a new production platform for integrated development, testing, deployment, and hosting of applications. This will greatly reduce the complexity and speed of delivering a service and maintaining it throughout its lifecycle, with fewer dependencies between teams and greater automation and integration. The platform will offer more flexibility through support for automatic high-availability and scaling, abstraction from hardware, and a streamlined path from development through testing to deployment. Services will be isolated from each other for increased reliability and security.
A big focus for this year is reducing the risk of each deployment we make, addressed by 3 of the work streams:
- Unifying our Continuous Integration infrastructure and tooling with production (with the added benefit of speeding up developer feedback),
- Giving our Release Engineers the tools they need to assess and reduce risk of any given deployment, and
- Deploy new updates to our users through percentage based stages to catch issues early before all users have a bad experience.
NOTE: This program only covers the "build" through "deploy" stages in the above image. The support (and more) of the "dev" stage is covered in the Developer productivity proposal. If funded, that program would be merged into this one for manageability.
Program outline
editTeams contributing to the program
editRelease Engineering, Site Reliability Engineering, and Services
Annual Plan priorities
editPrimary Goal: 3. Knowledge as a Service - evolve our systems and structures
How does your program affect annual plan priority?
editBy enabling our developers to quickly see their code in production we will enable faster and more efficient product development aiding all who create and consume the sum of all human knowledge.
Program Goal
editWe will streamline and integrate the delivery of services, by building a new production platform for integrated development, testing, deployment and hosting of applications.
Wikimedia developers experience a tooling parity between our Continuous Integration (CI) and production environments which enables them to release code more frequently by continuously reducing risk.
Outcomes
editOutcome 1: Continuous Integration is unified with production tooling and developer feedback is faster
edit- Output 1.1
- Convert current CI builds to use the new tooling (Blubber).
- Output 1.2
- Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.
- Output 1.3
- Research and share a report of our options for implementing delta-only/code path aware testing.
Outcome 2: Deployers have a better assessment of risk with each deploy
edit- Output 2.1
- Create a deployments report with metrics from the Code Health Group.
- Output 2.2
- Stretch: Create a dashboard for real-time insight to the deployment report
- Output 2.3
- Improve our incident response, post-mortem, and follow-up management tooling.
Outcome 3: Deployments happen through percentage based stages (eg: canaries, 10%, 100%)
edit- Output 3.1
- Migration of services currently on our "shared service cluster" into Kubernetes deployments with staged rollout
- Output 3.2
- Make preparations for moving MediaWiki into the Kubernetes system by defining a set of broad service level tests.
Outcome 4: Developers are able to create services that achieve production level standards with minimal overhead
edit- Output 4.1
- Developers get a service creating experience that is on par with production level standards with regard to logging, monitoring, security and configuration
Outcome 5: Services and the deployment pipeline are hosted on production-level infrastructure
edit- Output 5.1
- Adequately maintain the service infrastructure according to production standards. Upgrade the platform to new upstream versions to benefit from bug fixes and new features.
Outcome 6: Developers and deployers are aware of the platform, its benefits and how to make use of it
edit- Output 6.1
- Create a developer portal for the Deployment Pipeline platform with documentation and instructions
- Output 6.2
- Promote the platform's adoption
Resources
editPeople | FY2017–18 | FY2018–19 |
---|---|---|
Release Engineering |
|
|
SRE |
|
|
Stuff (CapEx) | ||
| ||
Travel & Other | ||
(Consolidated with Code Health and Reliability, Performance and Maintenance):
|
Targets
editOutcome 1: Continuous Integration is unified with production tooling and developer feedback is faster
edit- Target 1.1
- All Continuous Integration jobs are migrated to use production deployment tooling (eg: helm, minikube, docker, and blubber).
- Measurement method
- This is measured by the number of Jenkins Jobs migrated to using our production deployment tooling (eg: Blubber).
Outcome 2: Deployers have a better assessment of risk with each deploy
edit- Target 2.1
- We reduce the number of MediaWiki deployment incidents by 10%
- Measurement method
- This is measured by the number of rollback inducing deployments either through the weekly release train or SWAT deploys.
Outcome 3: Deployments happen through percentage based stages (eg: canaries, 10%, 100%)
edit- Target 3.1
- All services currently on our "shared service cluster" are deployed through percentage based stages.
- Measurement method
- This is measured by identifying which services are deployed on Kubernetes through a percentage based rollout method.
Outcome 4: Developers are able to create services that achieve production level standards with minimal overhead
edit- Targets
- 100% of new services following our own coding standards will have their logs collected, their metrics exposed and monitored and will be using encryption
- Measurement method
- Number of Phab tasks under https://phabricator.wikimedia.org/tag/service-deployment-requests/
Outcome 5: Services and the deployment pipeline are hosted on production-level infrastructure
edit- Targets
- 99% of availability for the deployment pipeline
- Measurement method
- CI availability metrics
Outcome 6: Developers and deployers are aware of the platform, its benefits and how to make use of it
edit- Targets
- Measurement method
- Survey among the target audience
Dependencies
edit- MediaWiki Platform: This program requires cross-team collaboration and planning for deploying MediaWiki and Services on a Kubernetes cluster.