Wikimedia Release Engineering Team/Deployment pipeline/2019-03-14

< Wikimedia Release Engineering Team | Deployment pipeline

2019-03-14

Last Time

Current Quarter Goals

Roughly 2 weeks left!

TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation

TEC3:O3:O3.1:Q3: Move cxserver, citoid, changeprop, eventgate (new service) and ORES (partially) through the production CD Pipeline
- In progress cxserver
  - Images built via deployment pipeline
  - Namespaces created for k8s eqiad/codfw
  - 8% of traffic (evidently as of 3 hrs ago :)
  - Plan is to finish Tuesday moving the remainder of traffic!

- Done citoid
  - Images built via deployment pipeline
  - Deployed

- changeprop
  - Should we bump this?
  - marko: we have to fix the kafka driver depending on the node version and kafka version: how will we have different versions of different things?
  - alex: side-step the problem and build the image with node6

- Done eventgate

- Done (for this quarter I'd guess?) ORES
  - cf: Dan's comments

Next Quarter Goals

Services to migrate

cpjobqueue
- marko: can use node6 image, but scaling is still a problem: sometimes it uses a lot of resources, sometimes it does nothing. I worry about scaling. How do we determine the resources needed so that service doesn't starve?
- jeena: are we against autoscaling?
- alex: autoscaler is not yet deployed. Could we work from current scb capacity?
- marko: will have to continue conversation about number of workers per pod—we don't want 100 pods, nor do we want to have 1 pod that is massive, so we'll have to find a balance
- liw: are there means to perform benchmarks and capacity tests?
- marko: we know current resource usage
- alex: we have ways to perform benchmarks (jeena used that for blubberoid), but in this case we have prod services already
- marko: the most important thing is to get everything correct for when surges happen
- alex: I think we can accommodate, we're adding more capacity next quarter, we can also add more pods as needed. Provides more flexibility than the current environment
- marko: still manual
- alex: yes, manual, but we don't have any way to scale currently, so this is an improvement
- marko: worse-case scenario is that cpjobqueue would "just" begin to lag
ORES

New Services

mobrovac: RESTBase?
- marko: for next quarter Q4 we want to split RESTBase into 2 services: api routing layer and storage layer (is current thinking) -- storage on cassandra nodes (where resbase is) -- api routing on k8s

alex: termbox (wmde) -- renders javascript for wikidata; session storage for CPT—moves sessions into cassandra; Discourse for Quim

General

Install heapdump and gc-stats when env production
- tl;dr: installing node deps into images is hard and trying to figure out where to do that

QUESTION: what is the plan with "evaluation environments"?
- https://docs.google.com/document/d/1QU_6Svn4iduK0TPLSOghYP4g1lK-byCv-0ZKoHfIAVY/edit#heading=h.6gq2j7lm5pz8
- Tyler: Is this something happening in the near term / something RelEng should be involved in?
- Alex: What is that?

TODOs from last time

In progress TODO various attack vectors document to start
- antoine and I started to talk about it
- thcipriani to more thoroughly noodle

In progress TODO: support documention like the one tyler did for the portal and pipeline/helmfile and deployment
- https://wikitech.wikimedia.org/wiki/Deployment_pipeline now exists, https://wikitech.wikimedia.org/wiki/Continuous_Delivery has been deleted.

TODO: Joe & James_F to work on eventual 2019-04-01 email
- Beware: announces on 04/01 can be considered an April's fool

Done TODO: improve feedback from pipeline—link to actual failing job, show images, and tags as applicable
- still no feedback for git tags https://phabricator.wikimedia.org/T177868#4984766
  - tags also currently "failing", i.e., the run of test-and-publish fails (due to not being able to comment), but test and publish actually succeeds
- image name point to internal registry
- might be nice to vote on a label
- failure feedback is much improved IMO
  - https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/495398/#message-63fc709bee82599ca720c5c293802587f1a9800d

RelEng

Dan starting work on .pipeline/config.yaml
- The pipeline should provide a way to save artifacts from a stage
- .pipeline/config.yaml Proposal The Latest™
- marko: how do services relate to the blubber.yaml?
- dan: you could use the same blubber file if you want, or you could specify a separate file if that makes sense, I want to have sensible defaults for these things, but if you do have special requirements you should be able to specify those and control the execution and steps in the pipeline. You can specify variants that are built and run in parallel in addition to the sequential steps of the pipeline.
- marko: if I have one service and I want to use this to tell jenkins what to do that could also be done?
- dan: yep. This has come up since we have people who want to run helm test, but don't want to deploy to k8s. There are other use-cases that want test variants but not run helm test. This allows folks to specify which parts of the pipeline execute and in which order
- brennen: what happens if wind up with CI tooling that conflicts with this?
- dan: what we have now is written in groovy so we'll have to refactor unless we move to jenkins x -- it's possible that this could be a benefit—perhaps there could be a translation layer
- hashar: the groovy is very minimal at this point so should be easy to refactor—let's migrate every year to ensure that we keep our code to a minimum! Point taken on potential of creating the next tech debt though.

Tyler: We're migrating stuff to v4 of Blubber.

Jeena: Talking with Greg about local dev environment, we're working on the mediawiki part whereas pipeline is working on services. However: Seems like it's not really useful for developers if they can't run services in the local env. We've been adding services like RESTBase and parsoid; Greg also mentioned Zotero. These aren't classified as a priority to move to the pipeline for various reasons. For example, RESTBase.
- marko: you can use SQLite for RESTbase.
- Jeena: So there's not going to be an image built in near future...
- marko: Shouldn't be too much an issue. Task becomes repetitive.
- Jeena: My thought was: We're not officially putting them into k8s / prod pipeline... Is it ok if we build images in the pipeline that aren't going to production.
- marko: We do serve the images... Well, we could.
  - Maybe could use different tags, create them semi-manually for the transition period. (similar to https://github.com/wikimedia/mediawiki-containers )
- alex: depends on the service we're talking about. RESTBase and parsoid? moving to the pipeline is a saner approach than building manually
- dan: we could still use the same process
- it's going to depend on the service whether or not we put things through the pipeline. I.e. some services (MediaWiki) are not going to fit through the pipeline currently and in those instances we'll have to build manually (e.g., with docker-pkg)

Antoine: track which version of Debian package are in which container image, weren't we talking about a system to track this? This is going to be an issue soon. How do we know what images we need to rebuild for update?
alex: adding support for this to debmonitor, but it is not resourced. We want to do *exactly that*. We're writing an image lifecycle document inside serviceops
hashar: if you have documents I'd be happy to read them

Serviceops

Services

As Always

Retrieved from "https://www.mediawiki.org/w/index.php?title=Wikimedia_Release_Engineering_Team/Deployment_pipeline/2019-03-14&oldid=4564618"