Program Goals and Status for FY18/19 edit

TEC3 Deployment Pipeline

Goal Owner: Greg Grossmeier
Program Goals for FY18/19: We will streamline and integrate the delivery of services, by building a new production platform for integrated development, testing, deployment and hosting of applications. Wikimedia developers experience a tooling parity between our Continuous Integration (CI) and production environments which enables them to release code more frequently by continuously reducing risk.
Annual Plan: TEC3 Deployment Pipeline
- Primary Goal is Knowledge as a Service: Evolve our systems and structures
- Tech Goal: Sustaining

Q1 Goals
edit

Outcome 1 / Output 1.1 edit

Continuous Integration is unified with production tooling and developer feedback is faster

Convert current CI builds to use the new tooling (Blubber).

Dependencies on: SRE team

Goal(s) edit

Move verify stage from Minikube to CI k8s namespace in production context

Status edit

Note: July 2018

In progress

Note: August 10, 2018

In progress Discussed that work on a patch is still ongoing, need to refactor the pipeline job to the new namespace. This will be a change to the existing service but will need to be refactored when we get to the shared library.

Note: September 14, 2018

This is now

Done!

Q2 Goals
edit

Outcome 1 / Output 1.2 edit

Continuous Integration is unified with production tooling and developer feedback is faster

Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.

Goal edit

Formalize the collection of CI infrastructure and tooling metrics - task T205923

Status edit

Note: October 2, 2018

This is now

In progress

Note: November 7, 2018

In progress

dduvall gave a presentation Monday looking at CI performance percentiles
Work continues on automating the collection of these metrics.

Note: December 6, 2018

This goal is

Partially done but we need to expose the interface of the metrics that we're collecting.

Outcome 2 / Output 2.3 edit

Deployers have a better assessment of risk with each deploy

Improve our incident response, post-mortem, and follow-up management tooling.

Goal edit

Develop set of metrics to assess incident reports/post mortems.

Status edit

Note: October 2, 2018

This work has not yet been started at this time

To do

Note: November 7, 2018

This is now

In progress with Zeljko's analysis of the past year's worth of incident reports.

Note: December 6, 2018

We can now determine how associated commits, repos, etc are connected to the incident reports and consider this goal

Done. There is more work that we can do to further refine the metrics.

Outcome 3 / Output 3.1 edit

Deployments happen through percentage based stages (eg: canaries, 10%, 100%)

Migration of services currently on our "shared service cluster" into Kubernetes (k8s) deployments with staged rollout

Primary teams: Service Operations, Release Engineering

Goal(s) edit

Adopt more services into Deployment pipeline
- Migrate graphoid to the Deployment pipeline N Postponed
- Deploy zotero v2 to the Deployment pipeline Done
Deploy blubberoid Done
Reprise the work on the logging infrastructure task T207200

Status edit

Note: October 2, 2018

This is now

Done

Note: November 7, 2018

Deploy zotero v2 to the Deployment pipeline Done
- Currently living in k8s staging
- Plan to go live next week
Deploy blubberoid Done
- liw working on changes to internal data structuring as a prerequisite to creating OpenAPI spec required for pipeline — on track.

Note: December 6, 2018

Zotero is

Done, Graphoid will be recommended for stewardship review

Postponed, Blubber will be

Done in the next week. Reprise the work on the logging infrastructure is still

Done.

Q3 Goals
edit

Outcome 1 / Output 1.2 (RelEng) edit

Continuous Integration is unified with production tooling and developer feedback is faster

Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.

Goal edit

Instrument Quibble for data collection
Create a graph where time is spent and make a prioritized list for improvements.

Status edit

Note: January 10, 2019

Discussed that as we've just gotten back from our vacations, this work is ramping up and is In progress

Note: February 5,2019

These task has been documented in Phab during last week's all hands meetings.

Note: March 12, 2019

This goal is currently in danger of finishing up this quarter and will be part of Q4 goals

Stalled

Outcome 2 / Output 2.1 (RelEng) edit

Deployers have a better assessment of risk with each deploy

Create a deployments report with metrics from the Code Health Group.

Goal(s) edit

Select and integrate a code health metric solution into our tooling.

Status edit

Note: January 10, 2019

Discussed that as we've just gotten back from our vacations, this work is ramping up and is In progress

Note: February 2019

Discussed that this is contingent on some other program work in TEC13, but we should be able to get fully started on it soon.

Note: March 12, 2019

We've selected sonarcube to be our metric solution, but we need to create and finalize the integration for it (with sonarcloud) for self hosting and get it integrated into CI (still In progress and moved into Q4 work.

Outcome 3 / Output 3.1 (SRE / ServiceOps & RelEng) edit

Deployments happen through percentage based stages (eg: canaries, 10%, 100%)

Migration of services currently on our "shared service cluster" into Kubernetes (k8s) deployments with staged rollout

Primary teams: Service Operations, Release Engineering, Core Platform

Goal(s) edit

Adopt more services in the pipeline

cxserver, ORES (partially), citoid, changeprop, cpjobqueue (stretch)
Deploy eventgate

Status edit

Note: January 10, 2019

Discussed that as we've just gotten back from our vacations, this work is ramping up and is In progress

Note: February 5, 2019

Discussed that there is one service that is currently (event gate) going through the pipeline...still In progress

Note: March 2019

cxserver: all prerequsites are ready, just need to be deployed; ores: still In progress and blockers identified; citoid Done; changeprop is still In progress; cpjobqueue is N Postponed to Q4 and eventgate has been deployed and is Done

Note: April 8, 2019

cxserver is now Done

Outcome 4 / Output 4.1 (SRE / ServiceOps) edit

Developers are able to create services that achieve production level standards with minimal overhead

Developers get a service creating experience that is on par with production level standards with regard to logging, monitoring, security and configuration

Goal(s) edit

Evaluate helm charts management solutions

Status edit

Note: January 10, 2019

Discussed that as we've just gotten back from our vacations, this work is ramping up and is In progress

Note: February 5, 2019

Discussed this at All Hands, but we're working with SRE on it, still In progress

To do March 2019

Discussed...

Outcome 5 / Output 5.1 (SRE / ServiceOps) edit

Services and the deployment pipeline are hosted on production-level infrastructure

Adequately maintain the service infrastructure according to production standards. Upgrade the platform to new upstream versions to benefit from bug fixes and new features.

Goal(s) edit

Aim for a better resilient, scalable, easier to manage and upgrade Kubernetes cluster service
- Upgrade cluster components to a newer version
- Improve docker registry architecture

Status edit

Note: January 10, 2019

Discussed that as we've just gotten back from our vacations, this work is ramping up and is In progress

To do February 2019

Discussed...

To do March 2019

Discussed...

Outcome 6 / Output 6.1 (SRE / ServiceOps) edit

Developers and deployers are aware of the platform, its benefits and how to make use of it

Create a developer portal for the Deployment Pipeline platform with documentation and instructions

Goal(s) edit

Create a developer portal with Deployment Pipeline documentation

Status edit

Note: January 10, 2019

Discussed that as we've just gotten back from our vacations, this work is ramping up and is In progress

To do February 2019

Discussed...

To do March 2019

Discussed...

Outcome 6 / Output 6.2 (SRE / ServiceOps) edit

Developers and deployers are aware of the platform, its benefits and how to make use of it

Promote the platform's adoption

Goal(s) edit

Conduct Promotion and Training events for Wikimedia developers

Status edit

Note: January 10, 2019

Discussed that as we've just gotten back from our vacations, this work is ramping up and is In progress

To do February 2019

Discussed...

To do March 2019

Discussed...

Q4 Goals
edit

Outcome 1 / Output 1.2 (RelEng) edit

Continuous Integration is unified with production tooling and developer feedback is faster

Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.

Goal edit

Instrument Quibble for data collection
Create a graph where time is spent and make a prioritized list for improvements.
Prepare the Deployment Pipeline for changes to our CI tooling.

Status edit

Note: April 8, 2019

This is now In progress, but the instrumenting N Blocked right now.

Note: May 7, 2019

This is still N Blocked, as we are waiting on other teams.

Note: June 4, 2019

This is still N Blocked, as we are waiting our team to get done with other more urgent work, will probably go into next FY.

Outcome 3 / Output 3.1 (RelEng + SRE) edit

Deployments happen through percentage based stages (eg: canaries, 10%, 100%)

Migration of services currently on our "shared service cluster" into Kubernetes deployments with staged rollout

Goal(s) edit

Create a .pipeline/config.yaml standard to give users more control over how their tests are run in the pipeline and allow the easy saving of artifacts at pipeline completion. (RelEng)
Migration of more services to the pipeline (RelEng + SRE) - task T212801:
- Wikidata Termbox SSR
- Kask for Session Storage Service
- cpjobqueue (stretch)
- ORES (stretch)

Status edit

Note: April 8, 2019

All goals are In progress

Note: May 7, 2019

All goals are In progress

Note: June 4, 2019

All goals are In progress, termbox is now producing images, but not yet in production; kask is also Partially done. Stretch goals will probably move to next FY. Pipeline config work is Done

Note: June 13, 2019

Migration should be done this quarter, but the overall program will remain In progress into next FY, stretch goals will be done next FY. (SRE update)

Outcome 5 / Output 5.1 (SRE) edit

Services and the deployment pipeline are hosted on production-level infrastructure

Adequately maintain the service infrastructure according to production standards. Upgrade the platform to new upstream versions to benefit from bug fixes and new features.

Goal(s) edit

Upgrade the infrastructure to recent/current software versions
Add dedicated security sensitive nodes to the Kubernetes clusters
Stretch: Implementation of a Helm chart management solution

Status edit

Note: June 13, 2019

Upgrade is still In progress and will be done by end of quarter. The addition of the dedicated nodes is Done and the stretch goal was just recently started and will be done by the end of the quarter too.

Outcome 6 / Output 6.1 + 6.2 (SRE) edit

Developers and deployers are aware of the platform, its benefits and how to make use of it

Create a developer portal for the Deployment Pipeline platform with documentation and instructions

Promote the platform's adoption

Goal(s) edit

Conduct at least N trainings for new pipeline users
Increase documentation quality

Status edit

Note: June 13, 2019

Trainings have been on-going and still In progress
Documentation quality has been going slow, and will probably go into next quarter.

Wikimedia Technology/Annual Plans/FY2019/TEC3: Deployment Pipeline/Goals

Program Goals and Status for FY18/19 edit

Q1 Goals edit

Outcome 1 / Output 1.1 edit

Goal(s) edit

Status edit

Q2 Goals edit

Outcome 1 / Output 1.2 edit

Goal edit

Status edit

Outcome 2 / Output 2.3 edit

Goal edit

Status edit

Outcome 3 / Output 3.1 edit

Goal(s) edit

Status edit

Q3 Goals edit

Outcome 1 / Output 1.2 (RelEng) edit

Goal edit

Status edit

Outcome 2 / Output 2.1 (RelEng) edit

Goal(s) edit

Status edit

Outcome 3 / Output 3.1 (SRE / ServiceOps & RelEng) edit

Goal(s) edit

Status edit

Outcome 4 / Output 4.1 (SRE / ServiceOps) edit

Goal(s) edit

Status edit

Outcome 5 / Output 5.1 (SRE / ServiceOps) edit

Goal(s) edit

Status edit

Outcome 6 / Output 6.1 (SRE / ServiceOps) edit

Goal(s) edit

Status edit

Outcome 6 / Output 6.2 (SRE / ServiceOps) edit

Goal(s) edit

Status edit

Q4 Goals edit

Outcome 1 / Output 1.2 (RelEng) edit

Goal edit

Status edit

Outcome 3 / Output 3.1 (RelEng + SRE) edit

Goal(s) edit

Status edit

Outcome 5 / Output 5.1 (SRE) edit

Goal(s) edit

Status edit

Outcome 6 / Output 6.1 + 6.2 (SRE) edit

Goal(s) edit

Status edit

Q1 Goals
edit

Q2 Goals
edit

Q3 Goals
edit

Q4 Goals
edit