Deployment tooling/Future

As of June 2015 the state of deployment tooling is there are currently two competing tools:

A diagram describing Wikimedia's production deployment process, used to deploy MediaWiki and extensions to the production web cluster.

Trebuchet A salt and git-based deployment mechanism
Scap A python-based deployment mechanism which handles many MediaWiki-specific tasks as well as deploys Mediawiki

The future state is to create one deploy tool to rule them all.

So the broader picture is that RelEng, Services, and some Opsen (toward the end of the quarter) have been having meetings about the future of all deployment-tooling; taking inventory of the current state and discussing possible improvements.

I can hopefully distill the findings of the quarter below:

High-level deployments

In order to evaluate deployment tooling, we first had to take stock
Each deployment process was discussed in its current form
We then attempted to break apart the steps of deployment, in the abstract (see abstract deployment diagram)

Pre-deployment: Prepare code to be deployed; Add necessary branches and tags; Update submodules and dependencies; Build other necessary dependencies (e.g., l10n updates); May be manual or automated
Deployment: Begin command and control interface (git-deploy or scap); Content distribution (git-deploy == git; scap == rsync)
Post-transfer: Orchestration on each appserver (git-deploy == Salt; scap == SSH); Perform service restart actions; Rudimentary checks—is service up? Is service responsive to basics?; Provides feedback to command and control interface

Much discussion took place on Phabricator on ticket T97068

Current problems

Scap does not perform follow-up actions on hosts (restart service, check service health, etc—although this may not be true in a week)
Trebuchet has had issues (like T63882) that are mostly salt-related that have led to fragmentation of the deployment process via git-deploy (read: services does kinda their own thing, cf. this and this)
Trebuchet has also had issues with unclear feedback for deployers
Troubleshooting Trebuchet without root is difficult
Having two deployment systems is more to maintain and more to know.

Future requirements

There were many discussions surrounding requirements for an ideal deployment system—beyond simply addressing current problems

Code distribution mechanism that is fast enough and scalable enough to work for large deployments (read: MediaWiki)
Orchestration mechanism that is fast enough and scalable enough to work for large deployments (read: MediaWiki)
Command and control mechanism that allows for easy aborts, provides needed feedback without swallowing errors
Modular system that allows some customization of specific deployment-phases per repo
Rolling deploys and the ability to configure the size of the initial deploy pool
Post-deploy host health checks
Rollback mechanism for failed health checks
Minimum privilege escalation for deployment
Don't fragment deployment system further

Ideas/discussion that led us here

Ideally, deployment tooling would be combined and modernized—to that end many graphs were created demonstrating that our deployment processes were not too dissimilar.
Trebuchet has a mechanism to perform followup-tasks on each deployment target, by way of running a custom salt execution module post-fetch and post-checkout. And (if you specify a service name in the `repo_config` pillar) is capable of restarting a service—close to what we want for MediaWiki deploy
Deploying MediaWiki via Trebuchet would require some significant work to both Trebuchet and infrastructure—Trebuchet relies on each node pulling code from a central git server (tin). Having evidently tried deploying MediaWiki via Trebuchet in the past, a fan-out of the git repo to proxy git servers would be needed to deploy via Trebuchet.
While Trebuchet, seemingly, has all the features required of a modern deployment system, in practice salt has had some issues (T102808) that make us reluctant to move forward with it.
None of the deployment systems are perfect. Trebuchet is pretty close to what we want, but the problems with salt have made it a difficult system with which to work. Scap doesn't do 100% of what we want it to do, but it's reliable and works for MediaWiki's scale.
Further ideas and discussions on phabricator T101023

Next quarter (July-Sept 2015)

Let's build a deployment system!

Instead of trying to build a deployment system that is perfect and works for everything (a seemingly impossible task), let's build a deployment system that is modular and test it with a single use-case initially (Create new RESTBase deploy method (tracking)). The initial narrow focus allows work to progress more quickly since it tightens the testing feedback loop.

To avoid falling into the trap of many competing standards https://xkcd.com/927/ we've attempted/are attempting to do the following:

Done Gather requirements most deployment use-cases of Trebuchet and Scap - cf. T97068
Evaluate approaches to meeting requirements based on adherence to future requirement criteria
Flesh-out best approach into more holistic system
Test system against RESTBase, evaluate
If initial rollout to RESTBase is a success, work quickly to expand the system to cover all current deployment uses