Wikimedia Release Engineering Team/Deployment pipeline/2019-01-29

This discussion took place at the 2019 WMF All-Hands at the Bently Reserve.

Last Time edit

Current Quarter Goals edit

General edit

  • ideal flow through the pipeline from the developer perspective and/or mapping the current flow from project inception to production.
  • More concrete idea: talk about what's missing for CDep Blubberoid to production
    • Alex: the problem with option B is that it changes things drastically
    • Joe: MVP for developers so we need to provide a decent experience, focus on what that is in 2 months time
    • Greg: coming up with an ideal workflow depends on what the endpoint is, i.e., a CDep is going to be a different experience
    • Joe: How many interactions do people need to have in order to get something through the pipeline. Right now it's coming to us to figure out what to do.
    • Greg: Avoid the cargo culting.
    • Tyler: This dovetails with conversation at the RelEng offsite.  How many points of contact and how can we reduce that?  Bringing in a bigger view would be useful.

Summarizing RelEng exercise:

  • Ideal next step:  Like toolforge but for production.
  • Dev requests a project that goes into a proper namespace on Gerrit.
  • Sets up CI, etc.
  • On repo creation adds a dotfile that configures pipeline.

Discussion:

  • Dan: What's correct form of feedback for a developer?
  • Alex: Gerrit is the thing that developers interact with, so that should be the thing that users interact with, we shouldn't make developers click through to several sites
  • Joe: This is a common problem with how we report feedback to gerrit, but the amount of indirection means this is a bigger problem that it actually is. There's more interaction and it's a more complicated set of jobs
  • Mukunda: Deciphering console output is a mess.
  • Joe: Summary: creation of a pipeline should be automatic as soon as someone puts a .pipeline/config into their repository. Feedback from the pipeline needs to be better. Not have the link many pages down in gerrit.
  • James: the "standard" is github, you get a comment form abot, you clikc that, you see travis, you see red X, you click that you read that.
  • Travis output isn't that great either, basically.
  • Alex: for the failure scenario is fine to send people down a deep path, but in a success scenario we need something simple
  • Joe: Do we publish an image for each successful merge? (Yes)
  • Alex:  We publish for each successfully merged commit.
  • Joe: for whatever we merge we should get back the url of the artifact for the image
  • Lars: if you change the interface so that the link to the artifact is in the metadata area (???)
  • Dan: What's the MVP for a feedback mechanism in the short term?
  • Alex: docker image plus version, also nice to have a link to the entire pipeline state so you know it's step 1 vs step 2.
  • Alex: In Gerrit you konw where you are in the process of getting it deployed. "I'm in step2 of 5 steps to production"
  • Dan: adding a nother label to Gerrit would be simple, like an "image built" label with links to the docker register url
  • Tyler: Summarize of ideal workflow now:
    • How do you request a project currently? A task. For now keep that for the MVP.
    • Somehow get the url of the image into the Gerrit UI on successful build, and a link to the successfull run
    • QUESTION: do we want to change the image creation process?
    • sidenote: no image per patchset :)
  • Joe: retention of our (old?) images needs an answer.
  • Joe: if we move to CDep it'll be impossible to store, for most thing smoving that direction, keep the latest N versions
  • Joe: questions of the workflow in the CI pipeline
    • developer want to build a nodejs project in the pipeine, are there things I need to do that are different here than what I used to do?
  • Mukunda: not much, just the .pipeline config
  • Dan: the blubber config has the entry point
  • Lars: we give a number of options to choose from, otherwise we end up with 100 projects copy/pasting but there turns out to be an issue so we have to upgrade them all
  • Joe: ... less free blubber templates...
  • Dan: blubber has proven to be flexible, which is good, without much modification at all. the importance of explicitness and tie in entrypoints/dependencies. Hesitant to make it more contrained than it is.
  • James: we have CI entry points across 2000 repos, we have bots to sync them together, not too worried about it being c/p and fix it later.
  • Lars: I'm convinced ^
  • Joe: there is a value in using containers so that developers are contained :)
  • Joe: would it be possible for someone to build their blubber image starting from an image not in our registry
  • Tyler/Dan: yes
  • Tyler: however we have a policy file that was built for this scenario
  • Joe: let's make it clear so that CI uses that policy file
  • Dan: the pipleine job references it which is centrally located (and away from developers ;) And you can get really specific with it.
  • Lars: we can make it dow what we need. We should allow our developers do something useful without constraining them too much.
  • Lars: For an MVP of CDep, we need to get it started and then iterate.
  • Joe: we just want to build images that start from ones we (bless)
  • Lars: we need to know what versions of what is in each image
  • Joe: that will be a part of debmonitor (as planned)
  • Fabian: sometimes updates have to be done, what happenes when update Debian, we need to figure out the underlying serbvices
  • Joe: we will know because when we build an image through the pipeline we submit it to a thing that analyzes it with debmonitor. How do we update those images after building? TBD.
  • Tyler: we have  atask about mass rebuilding all the images
  • Tyler: to answre your nodejs developer question:
  • Joe: James needs to convince audiences to migrate to it
  • Joe: from SRE's side, what does a developer need to do..
  • Alex: you get your image, you're happy, the pipeline deployed it CI staging...
    • ssh deploy1001, scap-helm, 100lines of bash, give it an image version, it deploys, to eqiad, or staging
    • the user interface includes setting things via ENV variables
    • moriel has already used it herself
    • it's ugly UI
  • currently evaluating replacing it with helmfile ( https://github.com/roboll/helmfile )
    • TODO pipeline should use helmfile
  • things devs can't do: LVS, DNS, etc
  • Lars: there are few review points: eg: does this project make any sense to Wikimedia? needs a security review?
  • Joe: how it's done now ^
  • Lars: not just security but also SRE, design an implementation that's suitable for production
  • James: will the helmfile configs be in the repo itself or somewhere else? pros and cons...
  • Joe: <argues for centralization>
  • Alex: operations/deploymentcharts
  • Joe: to get into production 
  • create a helm chart via scaffold script in the deploy-charts
  • review from SRE, setting up DNS, load balance it
  • Greg's arms are getting tight... slowing down with note taking
  • thcipriani picks up the batton!
  • Dan: We could make this part of our setup skaffold project, i.e., filing a task, what the skaffold script creates is probably confusing for newcomers
  • James: how often are people going to do this?
  • Joe: if we make the process good enough then probably we would see more services to be creates, but I think making a few requests is a Good Thing
  • Lars: in 1996 someone wrote a packaging helper and we went from a very small amount of packages to 600 packages

Beta edit

  • Joe: has a solution, running an image in docker
  • antoine: devs want to test stuff in beta with an updated service image in staging ask the backend to their thing
  • Joe: open staging to public internet
  • Alex is sad about that idea
  • Dan: deploy to the service namespace in automatically as part of the pipeline
  • Tyler: Can we have a k8s in labs for labs use
  • Alex: BGP, LVS, Calico -- none of these things exist in labs
  • Exposing staging to beta cluster would require staging to be open to the public internet
  • TODO we'll need some way to update this automagically in beta...restart and pull

next steps edit

  • Lars: If dan and lars want to do this? what do we do next?
  • Joe: we're a bit behind on SRE side
  • Lars: not general person, me and dan :)
  • Joe: oh ok :)
  • Alex: missing a token in production
  • Dan: try to include the reporting back to gerrit (image uri etc)

Other questions edit

  • How will you know what is deployed?
  • How will you troubleshoot logs?
  • How do you troubleshoot deployments?
  • How do we rollback?

TODOs edit

  • TODO: write blubber policy to ensure that we're using only wmf base images
  • TODO: file task to automagically create job from seed job
  • TODO: file task about automagic setup of pipeline on .pipeline/config.yaml creation
  • TODO: continuous deployment, what's missing? a k8s api token on contint1001
  • TODO: support documention like the one tyler did for the portal and pipeline/helmfile and deployment

RelEng edit

Serviceops edit

Services edit

As Always edit