Wikimedia Release Engineering Team/Deployment pipeline/2018-12-20

Last Time edit

General edit

  • "I survived another meeting that could have been an email"
    •   Strive for this not to be true
    • Sometimes it is
    • Let's be bold about skipping (but lets have an email version instead)
  • topic: discuss Beta aka deployment-prep and k8s
    • (couldn't find task that tracks this)
    • but we have a patch instead: https://gerrit.wikimedia.org/r/c/operations/puppet/+/478637
    • Marko: is beta important? If so something should be done. Have run into this since the last meeting
    • Joe: I would like to move to a proper staging, do things have to in beta? probably not, but sometimes they are needed
    • Marko: A higher percentage of the puppet code we use in production will become obsolete or maybe won't be in puppet
    • Joe: whatever is needed to test a mediawiki extension is probably needed there (for services)
    • Joe: hiera to run this image, use this config, etc. Want to avoid setting up a k8s cluster to run in beta that is different than production.
    • Marko: try this next quarter for eventgate
    • Joe: I want to try with mathoid Soon™
  • Track and install additional npm packages for all service container images
    • SRE nodeX base image in the operations base image repo
    • Joe: gc-stats?
    • Marko: used for sending stats
    • Dan: There's another way to do this with Blubber that doesn't involve relying on a "custom" docker-pkg base image
      • Plus-side: more ability to make changes by services
      • Downside: lots of blubber file duplication
  • Allow access to blubberoid.discovery.wmnet:8748
    • Summary so far:
      • Use Cases: local development, CI, Pipeline building prod images
      • Dan: single deployment for developers and CI and prod unifies environments (due to things like policy files [not currently in use, but is useful])
      • Alex: WMCS can't talk to wmnet, so opening to WMCS == opening to everyone
      • Alex: Blubber as a Service (BaaS) works counter to unified tooling because it neglects offline/low-bandwidth use-case
        • Dan: I don't see how the service model works counter to unifying but perhaps it works counter to an offline dev-env requirement that we haven't named. That's fine but we shouldn't conflate the requirements
      • Joe: people download and install so much untrusted  binary garbage from github, we can distribute binaries for linux/windows/osx quite easily I think?
        • Thcipriani: FWIW, we do have garbage binaries via `make release` target in repo posted on my people page currently, unfortunately: https://people.wikimedia.org/~thcipriani/blubber/
        • Lars: it would be good to avoid perpetuating bad security practices? Sure, that wasn't my point :)
      • Joe: I wouldn't point developers to BaaS, but it could be exposed publicly -- low potential for abuse
      • Dan: I don't see much potential for abuse either
      • liw: provides means to overload CPU, but maybe k8s policies can prevent this
      • alex: we have policies already (1800 millicores is blubber's limit -- max found via testing with Jeena)
      • alex: I worry that BaaS becomes critical to the tooling due to networking problems for developers -- non-up-to-date policy files, non up-to-date blubber
      • fselles: could commit output from blubberoid into some repo
      • joe: could generate lots of variants from one blubber file; I think we could tell folks to download the binary from gerrit
      • Joe: I worry about a tool that creates images for the k8s cluster being dependant on the k8s cluster -- maybe we should use blubberoid in it's own container -- I need to think this through
      • fselles: we have 2 clusters also we should trust blubberoid
      • Joe: redudantcy probably means this is OK
      • compromise: use blubber for local development, and blubberoid for CI
      • TASK: releases should be updated automagically
      • EPIC TASK: for developer tooling to keep track of this discussion
        • Alex: we know the components of the developer tooling, but we don't know how those will fit together yet


RelEng edit

Serviceops edit

  • TEC3 goal posted by mark
    • Lots of services for next quarter
    • ORES is going to consume some time
  • changeprop/cpjobqueue at least a month apart?
    • Marko: need some clarification; I don't think that's doable. We need the same version of the kafka driver and since these share a repo not sure how to use node6 and node10 to have the same driver version
    • Joe: cpjobqueue is scary to move (we can only handle a few minutes of outage for that service). If we need to stagger these repos we could maybe use the same deploy repo
    • Joe: we could maybe use git branches, or something, for a short period: we shouldn't migrate both at the same time
    • Marko: heuristics in terms of resource allocation for these services
    • Alex: both of these are hard to benchmark
    • fselles: try to assign similar resources and adjust using monitoring
    • Joe: I think we're not limited to 1 process per pod
    • Alex: we do not want to use ncpu

Services edit

As Always edit