Deployment tooling/Future

As of June 2015 the state of deployment tooling is there are currently two competing tools:

A diagram describing Wikimedia's production deployment process, used to deploy MediaWiki and extensions to the production web cluster.
  1. Trebuchet A salt and git-based deployment mechanism
  2. Scap A python-based deployment mechanism which handles many MediaWiki-specific tasks as well as deploys Mediawiki

The future state is to create one deploy tool to rule them all.

So the broader picture is that RelEng, Services, and some Opsen (toward the end of the quarter) have been having meetings about the future of all deployment-tooling; taking inventory of the current state and discussing possible improvements.

I can hopefully distill the findings of the quarter below:

High-level deployments

edit
  • In order to evaluate deployment tooling, we first had to take stock
  • Each deployment process was discussed in its current form
  • We then attempted to break apart the steps of deployment, in the abstract (see abstract deployment diagram)
Pre-deployment
Prepare code to be deployed
Add necessary branches and tags
Update submodules and dependencies
Build other necessary dependencies (e.g., l10n updates)
May be manual or automated
Deployment
Begin command and control interface (git-deploy or scap)
Content distribution (git-deploy == git; scap == rsync)
Post-transfer
Orchestration on each appserver (git-deploy == Salt; scap == SSH)
Perform service restart actions
Rudimentary checks—is service up? Is service responsive to basics?
Provides feedback to command and control interface
  • Much discussion took place on Phabricator on ticket T97068

Current problems

edit
  • Scap does not perform follow-up actions on hosts (restart service, check service health, etc—although this may not be true in a week)
  • Trebuchet has had issues (like T63882) that are mostly salt-related that have led to fragmentation of the deployment process via git-deploy (read: services does kinda their own thing, cf. this and this)
  • Trebuchet has also had issues with unclear feedback for deployers
  • Troubleshooting Trebuchet without root is difficult
  • Having two deployment systems is more to maintain and more to know.

Future requirements

edit

There were many discussions surrounding requirements for an ideal deployment system—beyond simply addressing current problems

  1. Code distribution mechanism that is fast enough and scalable enough to work for large deployments (read: MediaWiki)
  2. Orchestration mechanism that is fast enough and scalable enough to work for large deployments (read: MediaWiki)
  3. Command and control mechanism that allows for easy aborts, provides needed feedback without swallowing errors
  4. Modular system that allows some customization of specific deployment-phases per repo
  5. Rolling deploys and the ability to configure the size of the initial deploy pool
  6. Post-deploy host health checks
  7. Rollback mechanism for failed health checks
  8. Minimum privilege escalation for deployment
  9. Don't fragment deployment system further

Ideas/discussion that led us here

edit
  • Ideally, deployment tooling would be combined and modernized—to that end many graphs were created demonstrating that our deployment processes were not too dissimilar.
  • Trebuchet has a mechanism to perform followup-tasks on each deployment target, by way of running a custom salt execution module post-fetch and post-checkout. And (if you specify a service name in the `repo_config` pillar) is capable of restarting a service—close to what we want for MediaWiki deploy
  • Deploying MediaWiki via Trebuchet would require some significant work to both Trebuchet and infrastructure—Trebuchet relies on each node pulling code from a central git server (tin). Having evidently tried deploying MediaWiki via Trebuchet in the past, a fan-out of the git repo to proxy git servers would be needed to deploy via Trebuchet.
  • While Trebuchet, seemingly, has all the features required of a modern deployment system, in practice salt has had some issues (T102808) that make us reluctant to move forward with it.
  • None of the deployment systems are perfect. Trebuchet is pretty close to what we want, but the problems with salt have made it a difficult system with which to work. Scap doesn't do 100% of what we want it to do, but it's reliable and works for MediaWiki's scale.
  • Further ideas and discussions on phabricator T101023

Next quarter (July-Sept 2015)

edit

Let's build a deployment system!

Instead of trying to build a deployment system that is perfect and works for everything (a seemingly impossible task), let's build a deployment system that is modular and test it with a single use-case initially (Create new RESTBase deploy method (tracking)). The initial narrow focus allows work to progress more quickly since it tightens the testing feedback loop.

To avoid falling into the trap of many competing standards https://xkcd.com/927/ we've attempted/are attempting to do the following:

  1.   Done Gather requirements most deployment use-cases of Trebuchet and Scap - cf. T97068
  2. Evaluate approaches to meeting requirements based on adherence to future requirement criteria
  3. Flesh-out best approach into more holistic system
  4. Test system against RESTBase, evaluate
  5. If initial rollout to RESTBase is a success, work quickly to expand the system to cover all current deployment uses