Wikimedia Release Engineering Team/CI Futures WG/Meetings2019-07-18
Purpose
editLet's chat about where we are and what we can/need to do in the near term.
The instigation for this meeting is the Zuulv2/python2 EOL and determining where we are in the relevant timelines.
See also: https://docs.google.com/spreadsheets/d/1TrkGTfPLR0C74va3XyY6faYplSh6UggGiPdmxIVm1uo/edit#gid=0
Notes
editgitlab update
- gitlab and one runner, with ansible
- 3 components to address CI Arch
- VCS worker fetches components
- artifact store, outside of gitlab due to not being able to figure out gitlab's release storage
- test environment, just publishes the artifacts
- missing 4th component: a controller that's triggered by something, which tells the VCS worker to fetch, and tells to pull from artifact store and push to test env.
- to hide gitlab behind it
- Process:
- Pushing to gitlab triggers commit stage
- commit stage builds binaries/runs unit test, uploads to an artifact store (ick's)
- deployment worker gets artifacts and puts them into a test env
- hoping (at least some of) the written components can be shared across different POC options
- timing: either Friday or Monday Lars should have something for others to play with
- currently using a hello world C program,
- Building blubber might be a good first test
- Worker can be a VM, a container in k8s, bare-metal machine
- Register workers for GitLab by running a script and passing a secret from the master
Migration plans
- Tyler: Wanted to get to actual plan of attack for Zuul v2. 2 potential approaches:
- Migrate to v3 as interim step
- Do proof of concepts and do the work of migrating and building out new solution all in one step
- Python2 end of lifes at end of 2019
- we *may* need to be off of python2 by the end of 2019
- TODO: Ask SRE if Python 2 is really going away.
- Tyler: Python 2 aside, we're already past EOL for Zuul v2. We've already had problems with this (Gearman isn't getting patched).
- Antoine: oh yeah the way Zuul talks with Jenkins is ... outdated and prone to breakage anytime Jenkins breaks something
- Jenkins plugin maintained by openstack (4 years ago) -- that person isn't involved in the openstack project -- and openstack doesn't even use jenkins
- Jenkins is going to make a lot of backwards-incompatible changes
- This is already the case for the pipeline jobs (Deployment Pipeline) -- they don't register in gearman -- so we cannot trigger the jobs from zuul -- we have a hack in place
- Greg: we're already hacking around zuul with Pipeline -- question: Dan will be back next week to move forward with PoC we can have him focus on argo
- liw: I want to talk with Dan about argo
- Tyler: I feel like we ought to focus on Zuul v3 in the near term. We're going to get bitten (security issues)
- Greg: Other than switch to Ansible, what does that mean for infra? How involved is that really?
- Antoine: We could migrate Jenkins jobs to Ansible, have v2 run those, then migrate those to v3 - turns out to be way more complicated.
- Tyler: Is it possible to run 2 versions of Zuul server with different configurations to watch for different events?
- Antoine: Yes.
- The work that's already happened to migrate to docker makes most jenkins jobs very trivial
- docker run + jenkins to store output
- all of the logic is inside of shell scripts inside of docker containers
- Greg: infra changes: what does that mean hardware-wise? WMCS?
- Antoine: New CI solution cannot use WMCS.
- k8s + duplication of effort?
- Lars: We need somewhere to run GitLab, but GL can use k8s containers as runners...
- Tyler: I disagree with the idea that Zuul v3 is an equivalent amount of effort to full new CI. Production code isn't gonna stop moving. Fewer unknown-unknowns with Zuul v3. Unknowns for GitLab are bigger. We know Zuul v3 can handle the complex use cases.
- Željko: Other solutions might be equivalent work for unknown payoff? tl;dr: Move to v3 then...
- Tyler: Summary is there will be a big effort at the start of all projects; v3 will not have a long tail, others might.
- Lars: Even though v3 is entirely different?
- Tyler / Željko: Concepts / features are similar, implementation differs.
- Tyler: Is this right? Will it be easy once things are set up?
- Antoine: Still has concepts of pipelines.
- zookeeper + nodepool ... this stuff is new
- Does gearman still exist?
- PoC still needed
- Migration should be straight-forward after some initial PoC
- Where are artifacts stored? Dunno
- This requires a whole new infrastructure, just like all of our solutions (argo, gitlab), but after the infra is setup migration of the jobs should be straightforward
- Side note: Brennen's notes on Zuul v3 from earlier: https://phabricator.wikimedia.org/T218138
- Greg: Maybe we do v3 in a way that's aimed at speed / efficiency and not necessarily how we'd do it if we were staying with v3 long-term.
- Do we see an infra issues that we'd be blocked on?
- Tyler: Zookeeper and Nodepool will be contentious.
- Antoine: And Ansible.
- Lars: Suggested next action: Discuss with SRE, make decisions based on feedback.
- Hopefully temporary Zuul v3 migration.
- TODO set up a meeting with SRE about this
- TODO thcipriani to start google doc:
- What goes in doc:
- Background of problem (Tyler)
- Current POCs (eg: gitlab) (Lars)
- Constraints with EOL Zuul v2 (Antoine)
- Estimates of implementations for each option (collaborative)
- 1) zuulv3 now, migration to $something later
- 2) zuulv2 for a while past deprecation
- Doc should be ready to share early next week (2019-07-22)
- What goes in doc: