Wikimedia Technology/Annual Plans/FY2019/TEC3: Deployment Pipeline

NOTE: This is a continuation of the FY18 program titled "Streamlined Services Delivery"

We will build a new production platform for integrated development, testing, deployment, and hosting of applications. This will greatly reduce the complexity and speed of delivering a service and maintaining it throughout its lifecycle, with fewer dependencies between teams and greater automation and integration. The platform will offer more flexibility through support for automatic high-availability and scaling, abstraction from hardware, and a streamlined path from development through testing to deployment. Services will be isolated from each other for increased reliability and security.

A big focus for this year is reducing the risk of each deployment we make, addressed by 3 of the work streams:

Unifying our Continuous Integration infrastructure and tooling with production (with the added benefit of speeding up developer feedback),
Giving our Release Engineers the tools they need to assess and reduce risk of any given deployment, and
Deploy new updates to our users through percentage based stages to catch issues early before all users have a bad experience.

NOTE: This program only covers the "build" through "deploy" stages in the above image. The support (and more) of the "dev" stage is covered in the Developer productivity proposal. If funded, that program would be merged into this one for manageability.

Goals

Program outline

Teams contributing to the program

Release Engineering, Site Reliability Engineering, and Services

Annual Plan priorities

Primary Goal: 3. Knowledge as a Service - evolve our systems and structures

How does your program affect annual plan priority?

By enabling our developers to quickly see their code in production we will enable faster and more efficient product development aiding all who create and consume the sum of all human knowledge.

Program Goal

We will streamline and integrate the delivery of services, by building a new production platform for integrated development, testing, deployment and hosting of applications.

Wikimedia developers experience a tooling parity between our Continuous Integration (CI) and production environments which enables them to release code more frequently by continuously reducing risk.

Outcomes

Outcome 1: Continuous Integration is unified with production tooling and developer feedback is faster

Output 1.1: Convert current CI builds to use the new tooling (Blubber).

Output 1.2: Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.

Output 1.3: Research and share a report of our options for implementing delta-only/code path aware testing.

Outcome 2: Deployers have a better assessment of risk with each deploy

Output 2.1: Create a deployments report with metrics from the Code Health Group.

Output 2.2: Stretch: Create a dashboard for real-time insight to the deployment report

Output 2.3: Improve our incident response, post-mortem, and follow-up management tooling.

Outcome 3: Deployments happen through percentage based stages (eg: canaries, 10%, 100%)

Output 3.1: Migration of services currently on our "shared service cluster" into Kubernetes deployments with staged rollout

Output 3.2: Make preparations for moving MediaWiki into the Kubernetes system by defining a set of broad service level tests.

Outcome 4: Developers are able to create services that achieve production level standards with minimal overhead

Output 4.1: Developers get a service creating experience that is on par with production level standards with regard to logging, monitoring, security and configuration

Outcome 5: Services and the deployment pipeline are hosted on production-level infrastructure

Output 5.1: Adequately maintain the service infrastructure according to production standards. Upgrade the platform to new upstream versions to benefit from bug fixes and new features.

Outcome 6: Developers and deployers are aware of the platform, its benefits and how to make use of it

Output 6.1: Create a developer portal for the Deployment Pipeline platform with documentation and instructions
Output 6.2: Promote the platform's adoption

Resources

People	FY2017–18	FY2018–19
Release Engineering	~2 Engineers	0.25 ✕ QA Engineer (contractor, reallocated) 0.75 ✕ Software Engineer (contractor, reallocated) Software Engineer (reallocated) Software Engineer (reallocated) 0.5 ✕ Software Engineer (reallocated) 0.75 ✕ Sr Software Engineer (reallocated) 0.5 ✕ Engineering Manager (reallocated)
SRE	~0.5 Site Reliability Engineers	Site Reliability Engineer (new hire) 0.5 ✕ Site Reliability Engineer
Stuff (CapEx)
		Kubernetes cluster for Continuous Integration use (should be in SRE's CapEx already).
Travel & Other
		(Consolidated with Code Health and Reliability, Performance and Maintenance): 8 x Developer Summit & Team offsite 8 x Hackathon & Team offsite

Targets