Wikimedia Release Engineering Team/Deployment pipeline/2019-01-17
Last Time
editCurrent Quarter Goals
edit- TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation
- TEC3:O3:O3.1:Q3: Move cxserver, citoid, changeprop, eventgate (new service) and ORES (partially) through the production CD Pipeline
General
edit- Pipeline cabal meetup as part of All-Hands?
- do we have any current projects (2 weeks from now) that would benefit from in-person high-bandwidth?
- Hacking on ORES maybe?
- Productivity aside, it was fun at the previous Hackathon. :)
- Joe: We have some time Monday evening (maybe) but ideally we'd work at the hackaton that no one goes to :(
- Lars: is anyone opposed to hanging out?
- Alex: maybe a small group since that's more productive
- Joe: we should come some ideas for stuff to work on
TODO: start an email thread
- Lars's email > I'm looking for feedback on whether the vision I'm describing is where we want to end up.
- fselles: sounds like a comprehensive plan. re:deployment velocity we need existing metrics for this
- thcipriani: aside RelEng is thinking about this
- alex: don't we have statsd counters
- thcipriani: we do, but we can't trace individual scap commands to patches or windows etc
- jeena: don't we care about how long it takes an individal patch to hit production?
- dan: I think elasticsearch might be better than a statsd since we need to tie this together using logged metadata (repo deployed, scap commands, patchsets deployed, etc)
- Joe: one thing I didn't see was self-servicing, i.e., create a repo and everything is setup for a developer -- how much toil is needed for this?
- Alex: SRE/serviceops, RelEng, 4 different commits in 4 different projects -- there is quite a lot of friction here
- Joe: something more is needed in terms of UI from the point of view of the developer, we should think about setup from all points of view, maybe when someone creates a .pipeline file it sets up the pipeline for them
- Lars: I agree that the developer experience should be massively simpler that current, as I was thinking about this I hadn't gotten as far (as UI yet)
- Joe: we want to take it further!
- Lars: there is a proposal to start continuous deployment with the Blubberoid service, i.e., not load balancers and k8s; is there any objection to having a continuously deployed Blubberoid?
- Joe: what you are proposing is > CDep, it is total ownership of a service -- Icinga needs work to allow this -- but I think we can experiment with CDep with Blubberoid
- Lars: Proposal for this is due to Blubberoid being a nice, safe, small, and friendly service -- it has no dependencies or databases; input over http and output over http -- can't get more simple
- fselles: icinga does need work, but we need metrics
- Lars: RelEng will start thinking about what we need to make this happen
- Joe: we need to work on permissions for k8s
- Antoine: or we get an Icinga container in the pod that runs the service and deploy it ourselve via helm?
- Joe: Pearson does a namespace per project including a Jenkins instance but let's not do this :) Sadly CDep may not possible for some services since there are many interdependant services, so it's best to start with something simple
- fselles: sounds like a comprehensive plan. re:deployment velocity we need existing metrics for this
RelEng
edit- Dan working on Manually defining artifacts results in default copy of all project files
- Code proposal: .pipeline/blubber.yaml:
development: copies: - from: build source: /bin/foo destination: /bin/foo copies: - from: local source: ./config.dev.yaml destination: ./config.yaml
- and a short-hand format/structure that expands
development: copies: [build, local]
- Code proposal: .pipeline/blubber.yaml:
- The continuous release pipeline should support more than one service per repo
- Implicit assumption that every repo is one service Counterexample: MediaWiki! (good point :))
- Implicit assumption that there is one test entrypoint per repo
- Code proposal: .pipeline/config.yaml
pipelines:
serviceOne:
blubberfile: serviceOne/blubber.yaml # could be the default based on service name for the dir
helmConfig: serviceOne/helm.yaml # ditto
directory: src/serviceOne
variants:
test: [phpunit, mocha] # defaults to ["test"]
production: foo # defaults to "production", also supports false for test-only runs
serviceTwo:
directory: src/serviceTwo
- Joe: let's keep everything that developer needs to control in the repo, what dan is proposing seems sane to me
- Dan: this is the inverse pattern of mediawiki
- Added Wikimedia Portals to tbd on the migration to k8s task https://phabricator.wikimedia.org/T198901#4881831
- seems self-contained
- gets it out of the mediawiki deployment tree (/srv/mediawiki-staging)
- no more portals deploy in SWAT
Minor update things
edit- Blubber docs sparkle: https://wikitech.wikimedia.org/wiki/Blubber
- Blubber binary downloads on releases: https://releases.wikimedia.org/blubber/
- Thanks Alex for the review!
- ASIDE: Moving scap back to gerrit -- going to use the test portion of the pipeline to run tests -- was really nice and simple (for a person who has contributed to blubber anyway) https://phabricator.wikimedia.org/D1138
Serviceops
edit- Zotero has been handled over to Marrielle today \o/
- Managed to deploy, rollback and get changes through the pipeline
- One issue that did come up was the difficulty of finding out the version/tag of the image.
- Should jenkins-bot comment on the change and say "Here's the newly created image: <version>" +1 +1 (even better if it's not in a comment but somewhere more visible)+1
- Joe: main technical points are there, but we need to polish the ui of the pipeline
- Jeena: is there no visual indication in jenkins?
- Dan: Kinda sorta -- we have the blue ocean dashboard, but it's not the default, it needs work -- feedback needs to be addressed sooner rather than later
- fsells: I ran a patch through the pipeline and it failed, which is fine, but I had no way to rerun it
- thcipriani: currently you can comment "recheck" on a patch, but that is totally not discoverable, I want a gerrit plugin for this