Wikimedia Release Engineering Team/Deployment pipeline/2019-01-29
This discussion took place at the 2019 WMF All-Hands at the Bently Reserve.
Last Time
editCurrent Quarter Goals
edit- TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation
- TEC3:O3:O3.1:Q3: Move cxserver, citoid, changeprop, eventgate (new service) and ORES (partially) through the production CD Pipeline
General
edit- ideal flow through the pipeline from the developer perspective and/or mapping the current flow from project inception to production.
- More concrete idea: talk about what's missing for CDep Blubberoid to production
- Alex: the problem with option B is that it changes things drastically
- Joe: MVP for developers so we need to provide a decent experience, focus on what that is in 2 months time
- Greg: coming up with an ideal workflow depends on what the endpoint is, i.e., a CDep is going to be a different experience
- Joe: How many interactions do people need to have in order to get something through the pipeline. Right now it's coming to us to figure out what to do.
- Greg: Avoid the cargo culting.
- Tyler: This dovetails with conversation at the RelEng offsite. How many points of contact and how can we reduce that? Bringing in a bigger view would be useful.
Summarizing RelEng exercise:
- Ideal next step: Like toolforge but for production.
- Dev requests a project that goes into a proper namespace on Gerrit.
- Sets up CI, etc.
- On repo creation adds a dotfile that configures pipeline.
Discussion:
- Dan: What's correct form of feedback for a developer?
- Alex: Gerrit is the thing that developers interact with, so that should be the thing that users interact with, we shouldn't make developers click through to several sites
- Joe: This is a common problem with how we report feedback to gerrit, but the amount of indirection means this is a bigger problem that it actually is. There's more interaction and it's a more complicated set of jobs
- Mukunda: Deciphering console output is a mess.
- Joe: Summary: creation of a pipeline should be automatic as soon as someone puts a .pipeline/config into their repository. Feedback from the pipeline needs to be better. Not have the link many pages down in gerrit.
- James: the "standard" is github, you get a comment form abot, you clikc that, you see travis, you see red X, you click that you read that.
- Travis output isn't that great either, basically.
- Alex: for the failure scenario is fine to send people down a deep path, but in a success scenario we need something simple
- Joe: Do we publish an image for each successful merge? (Yes)
- Alex: We publish for each successfully merged commit.
- Joe: for whatever we merge we should get back the url of the artifact for the image
- Lars: if you change the interface so that the link to the artifact is in the metadata area (???)
- Dan: What's the MVP for a feedback mechanism in the short term?
- Alex: docker image plus version, also nice to have a link to the entire pipeline state so you know it's step 1 vs step 2.
- Alex: In Gerrit you konw where you are in the process of getting it deployed. "I'm in step2 of 5 steps to production"
- Dan: adding a nother label to Gerrit would be simple, like an "image built" label with links to the docker register url
- Tyler: Summarize of ideal workflow now:
- How do you request a project currently? A task. For now keep that for the MVP.
- Somehow get the url of the image into the Gerrit UI on successful build, and a link to the successfull run
- QUESTION: do we want to change the image creation process?
- sidenote: no image per patchset :)
- Joe: retention of our (old?) images needs an answer.
- Joe: if we move to CDep it'll be impossible to store, for most thing smoving that direction, keep the latest N versions
- Joe: questions of the workflow in the CI pipeline
- developer want to build a nodejs project in the pipeine, are there things I need to do that are different here than what I used to do?
- Mukunda: not much, just the .pipeline config
- Dan: the blubber config has the entry point
- Lars: we give a number of options to choose from, otherwise we end up with 100 projects copy/pasting but there turns out to be an issue so we have to upgrade them all
- Joe: ... less free blubber templates...
- Dan: blubber has proven to be flexible, which is good, without much modification at all. the importance of explicitness and tie in entrypoints/dependencies. Hesitant to make it more contrained than it is.
- James: we have CI entry points across 2000 repos, we have bots to sync them together, not too worried about it being c/p and fix it later.
- Lars: I'm convinced ^
- Joe: there is a value in using containers so that developers are contained :)
- Joe: would it be possible for someone to build their blubber image starting from an image not in our registry
- Tyler/Dan: yes
- Tyler: however we have a policy file that was built for this scenario
- Joe: let's make it clear so that CI uses that policy file
- Dan: the pipleine job references it which is centrally located (and away from developers ;) And you can get really specific with it.
- Lars: we can make it dow what we need. We should allow our developers do something useful without constraining them too much.
- Lars: For an MVP of CDep, we need to get it started and then iterate.
- Joe: we just want to build images that start from ones we (bless)
- Lars: we need to know what versions of what is in each image
- Joe: that will be a part of debmonitor (as planned)
- Fabian: sometimes updates have to be done, what happenes when update Debian, we need to figure out the underlying serbvices
- Joe: we will know because when we build an image through the pipeline we submit it to a thing that analyzes it with debmonitor. How do we update those images after building? TBD.
- Tyler: we have atask about mass rebuilding all the images
- Tyler: to answre your nodejs developer question:
- Joe: James needs to convince audiences to migrate to it
- Joe: from SRE's side, what does a developer need to do..
- Alex: you get your image, you're happy, the pipeline deployed it CI staging...
- ssh deploy1001, scap-helm, 100lines of bash, give it an image version, it deploys, to eqiad, or staging
- the user interface includes setting things via ENV variables
- moriel has already used it herself
- it's ugly UI
- currently evaluating replacing it with helmfile ( https://github.com/roboll/helmfile )
- TODO pipeline should use helmfile
- things devs can't do: LVS, DNS, etc
- Lars: there are few review points: eg: does this project make any sense to Wikimedia? needs a security review?
- Joe: how it's done now ^
- Lars: not just security but also SRE, design an implementation that's suitable for production
- James: will the helmfile configs be in the repo itself or somewhere else? pros and cons...
- Joe: <argues for centralization>
- Alex: operations/deploymentcharts
- Joe: to get into production
- create a helm chart via scaffold script in the deploy-charts
- review from SRE, setting up DNS, load balance it
- Greg's arms are getting tight... slowing down with note taking
- thcipriani picks up the batton!
- Dan: We could make this part of our setup skaffold project, i.e., filing a task, what the skaffold script creates is probably confusing for newcomers
- James: how often are people going to do this?
- Joe: if we make the process good enough then probably we would see more services to be creates, but I think making a few requests is a Good Thing
- Lars: in 1996 someone wrote a packaging helper and we went from a very small amount of packages to 600 packages
Beta
edit- Joe: has a solution, running an image in docker
- antoine: devs want to test stuff in beta with an updated service image in staging ask the backend to their thing
- Joe: open staging to public internet
- Alex is sad about that idea
- Dan: deploy to the service namespace in automatically as part of the pipeline
- Tyler: Can we have a k8s in labs for labs use
- Alex: BGP, LVS, Calico -- none of these things exist in labs
- Exposing staging to beta cluster would require staging to be open to the public internet
- TODO we'll need some way to update this automagically in beta...restart and pull
next steps
edit- Lars: If dan and lars want to do this? what do we do next?
- Joe: we're a bit behind on SRE side
- Lars: not general person, me and dan :)
- Joe: oh ok :)
- Alex: missing a token in production
- Dan: try to include the reporting back to gerrit (image uri etc)
Other questions
edit- How will you know what is deployed?
- How will you troubleshoot logs?
- How do you troubleshoot deployments?
- How do we rollback?
TODOs
edit- TODO: write blubber policy to ensure that we're using only wmf base images
- TODO: file task to automagically create job from seed job
- TODO: file task about automagic setup of pipeline on .pipeline/config.yaml creation
- TODO: continuous deployment, what's missing? a k8s api token on contint1001
- TODO: support documention like the one tyler did for the portal and pipeline/helmfile and deployment