Wikimedia Release Engineering Team/Quarterly review, August 2013

Date: August 21st, 2013

Time: 1:30pm Pacific (20:30 UTC)

Slides: gslides

Who:

Leads: Greg G, Chris M
Virtual team: Greg G, Chris M, Antoine, Sam, Chris S, Chad, Zeljko, Michelle G, Andre, Ariel
Other review participants (invited): Robla, Sumana, Quim, Maryana, James F, Ryan Lane, Ken, Terry, Tomasz, Alolita

Topics: Deploy process/pipeline, release process, bug fixing, code review, code management, security deploy/release, automation prioritization

Big picture

Release Engineering and QA are where our efforts in Platform can be amplified. When we do things well, we start to see more responsive development with higher quality code. That is our focus.

What we want to accomplish:

More appreciation of, response to, and creation of tests in development
Better monitoring and reporting out of our development and deployment processes
Reduce time between code being finished and being deployed, while finding issues with code earlier and with more certainty.
Provide information about software quality in a way that informs release decisions
Help WMF Engineering learn and adapt from experience

...All in an effort to pave the path to a more reliable continuous deployment environment.

Team roles

Many people outside of the virtual team play an important role in releases, but this review will focus on the work of the following people in the following roles:

Release engineering: Greg G, Sam, Chris S (security)
QA and Test Automation: Chris M, Zeljko, Michelle G,
Bug escalation: Andre, Greg G., Chris M, Chris S (security)
Beta cluster development/maintenance: Antoine, Ariel, Sam
Development tools (e.g. Gerrit, Jenkins): Chad, Antoine

What we've done

Built the Beta Cluster to be something that is instrumental in the quality of our code production
- all platform and extension code merged to master is deployed to beta labs automatically
- automated db updates are still under discussion but greatly improved
Provided embedded QA support to important feature teams (Language and Mobile)
Successfully transitioned to a one-week deploy cycle
Community growth through eg OPW, live and online training sessions, QA mail list
Virtual team creation
Testing and automated browser tests across WMF development teams and projects

Still in progress

Proper support for all extensions in beta cluster https://bugzilla.wikimedia.org/show_bug.cgi?id=49846
Break browser tests out of catch-all /qa/browsertests and into per-feature builds, following the Mobile model. CirrusSearch, ULS, VE https://bugzilla.wikimedia.org/show_bug.cgi?id=52890 https://bugzilla.wikimedia.org/show_bug.cgi?id=52120

Goals for the next quarter

We have a lot - see also, the list of sprints with associated tracking tickets

Better align QA effort with high profile features
- see: QA testing levels describing test events
- Apply model of Language/Mobile embedded QA to a new feature team (specifically VisualEditor)
- Include more user contributed code testing (eg: Gadgets)
- Increase capacity through community training for browser tests
Improve our deployment process
- automate as much as possible
- improve monitoring
- improve tooling (eg: atomic updates/rollbacks and cache invalidation)
Take the Beta Cluster to the next level
- monitoring of fatals, errors, performance
- add more automated tests for eg the API
- feed experiences/gained knowledge of Beta Cluster automation up to production automation

Stretch activities as time allows

Provide hermetic test environments for developers/testers/community. Vagrant shows the way.
Use Vagrant for targeted tests within the WMF Jenkins work flow

ACTIONS!

ACTION RL/CM/JF: Put together an RFP for experienced tester for VisualEditor with "experience writing automated tests" as a plus rather than a core (Quim has ~3 CVs already from the QA events in the past).
ACTION JF: VE team have hack JS splice-out proxy idea that they will share so that others can use it (but only allows local testing against production where the code is in JS and executed client-side).
ACTION CM: Put browser tests in the repos of the feature they test, this will allow more frequent test running than the twice a day we have now.
ACTION GG: We need test discoverability for Selenium/etc. tests - add to core's backlog a system for QA tests similar to how unit tests work in MW core right now?
ACTION GG: outline the options of testing infra and documenting where we want to go/what we're missing/pain points
ACTION GG/CM/RL: process documentation for ideal test/deployment steps - re-run the ThoughtWorks process we used two years ago to examine and help us start to iterate?
ACTION GG: Add atomicity to success metrics for deploy related goal
ACTION GG/KS: do retrospectives (post-mortem isn't a nice word)

Measures of success

Successfully integrate QA support in one more feature team (as defined by: more regular/predictable testing and more test coverage)
Automation provides the bulk of what is needed now to deploy code
The Beta Cluster has an equal amount of monitoring to that of production (just without the paging to Ops). https://bugzilla.wikimedia.org/show_bug.cgi?id=51497
Atomicity in deploys (see DevOps Sprint 2013)

Questions

What does Product Management and QA communication look like?
There's a lot to do, where should we prioritize? Where should we build capacity?
Sign off for bigger feature deploys/enablings?
Is how we plan on measuring success sufficient for your needs?

Worries

Our goals are wide-ranging and need support from multiple teams, maybe moreso than your average goal list