2020-05-27

Always

[1]
Archive
Task tree: https://phabricator.wikimedia.org/T198901

TODOs from last time

TODO make mw/core task
- Done (James_F did this) https://phabricator.wikimedia.org/T238770
- Asked Will to flesh-out what this looks like from CPT side
Done TODO alex to reach out to mate re:Wikia k8s
- he replied, will reach out to folks
- Their CTO is already talking to Grant (likely)
- It'll follow that work
Done TODO thcipriani to reach out to ORES folks
- Welcome Aaron!
Done TODO thcipriani find status of kask integration tests
- https://phabricator.wikimedia.org/T209106
  - Needs cassandra on every run (doable via helm test?)
  - Imported for kask -- has a requirements file that includes cassandra
  - kubernetes chart incubator -- uses a dockerhub image -- will have to check with serviceops for using this for production

General

Questions from Build blubber file for ORES
- Does the base image need to come from the wmf docker registry? If so, then it might make sense for us to create an optimized base image that has the scipy + enchant binaries ready to go.
  - Background -- enchant libraries in dictionaries (i.e., what proportion of these words show up in dictionary)
  - Might be better to pull these in manually instead of as apt packages
  - TODO: reply on task that we can install apt via blubber

- For the production image, we need to start a container to run the uwsgi service and also some separate containers to run the celery workers as well, can we specify that in blubberfile? Or do we need to create a deployment template for the helm chart somewhere?
  - Questions answered in task. See PipelineLib. Some of the functionality might not be documented. Send specific questions to thcpriani via the task (async)

Question from Deploy ORES in kubernetes cluster
- Also wondering if there are any sort of RAM limits per pod. Right now our production servers need ~56GB (and growing) to run all the models, although some models are much more memory-intensive than others. We should eventually split out the different model types into separate containers so we can manage memory better, but for now everything is hosted together
- Worker process loads all potential models into memory -- that gets big -- how should we think about memory usage in this context: should we prioritize splitting things up? I don't think we can reduce overall memory, but memory footprint *per container* could be reduced.
- Alex: CoW is complicated things a lot, containers may actually increase the 56GB use -- I'm not sure we're able to accomidate 56GB of memory per container because that'd be 1 container per node -- it might make sense to lower memory per container
- Aaron: Models per wiki are related -- we can group models around "contexts" i.e., wikis -- the biggest single model would be under a GB by itself
- Alex: would still need how many containers we need for each wiki
- Aaron: open question: how would we fit that into a scaling strategy
- Alex: autoscaling -- we're working on that -- the idea would be that we could automatically scale workloads per demand
- Aaron: timeline?
- Alex: maybe oct. maybe worse :(
- Aaron: is this finishing within a fiscal? Or is this a maybe someday?
- Alex: we have a person working on it, maybe by end of June, we'll know more then

https://phabricator.wikimedia.org/T210267
- .pipeline/config.yaml created via ^

RelEng

deploy.sh
- Automating steps for blubberoid deployment: https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/597653/
- Trouble finding the service ip address in the environment -- the intent is that so people can check their changes have been deployed a specific environment
- staging.svc.eqiad.wmnet/${service}.svc.codfw.wmnet + ${service}.deployment.wmnet is multi-dc aware and will point you to the active DC
- We'll have to change directory structure at some point -- at some point we'll have to rip off the bandaid

Wikimedia Release Engineering Team/Deployment pipeline/2020-05-27

Contents

2020-05-27

Always

TODOs from last time

General

RelEng

Serviceops

CPT

As Always