Wikimedia Release Engineering Team/CI Futures WG/Meetings/2019-12-11-RelEng+SRE sync

2019-12-11 edit

Attendees: thcipriani, liw, greg, mark, giuseppe, effie, alexandros, dan

Documentation edit

Agenda edit

  • Since last meeting
  • This meeting
    • surface any additional requirements from SRE
      • Additional questions about security requirements
    • agree on where this system should live (there was talk about cloud hosting)
      • Agree whether or not we can deploy artifacts from a 3rd party hosted solution
    • project next steps


third-party discussion edit

Effie: in general we try to avoid 3rd party solutions; however, we have 2.5 people (joe: 1.5, alex: 0.5) working on k8s -- we have to be realistic before relying on k8s more than we do -- even with hiring someone that puts us at [some number of people] who will need to be onboarded

Mark: this is not k8s specific, CI is a fair amount of work to setup and maintain -- we are not well staffed between the two teams -- this is definitely part of plans for the deployment pipeline

Joe: deploying artifacts from external sources: building artifacts not hosted on our premises. Going with an external k8s, we could have a k8s that is not logically external, but physically external. Does external hosting have the security we need to do Continuous Delivery? I think so. There are some general concerns for clouds: CPU vulnerabilities, there is some risk involved in running things on infras where other people run things; however, building artifacts should be safe. I'm worried about exposing things to the internet; e.g., a web interface + allowing anyone to submit code

Greg: Is this generally a concern for building artifacts? or for having something talk to production k8s?

Joe: I don't like the idea of continuous deployment for our current environment. Gerrit is a concern. We have many exposed parts of our CI system. What slows us down is not continuous deployment lacking, it's setting up CI, it's moving to production. I am dubious of our ability to keep such a system secure; e.g., we decided against having our Jenkins build security release fr. ex. My concern is that the same system that builds arbitrary patches from anyone is to be trusted Joe: We're not ready for Continuous Deployment in general.

thcipriani: summary: we could use k8s for building artifacts and potentially could use them for continuous delivery, in an ideal world could we ever do continuous deployment from a hosted k8s?

effie: it's about amazon having access to our stuff (paraphrased), not sure how far it'll go and cause issues. We don't have cloud experience, generally. Do we have people who are able to manage aws/gce consoles?

mark: that's not only infra AAS, but also platform AAS (eg a full fledged solution). We have multiple levels of what we do here. Build on a k8s cloud service, or go buy something like gitlab.

Tyler: do we want to even explore non self-hostable solutions?

Dan: this question goes beyond us and SRE, the board would have to sign-off on it. This might not be a productive path of conversation since it's not an option.

Mark: from the c-team it's something we should strongly be considering

Dan: until we hear directly from them we can't move forward working under some vague assumption.

Mark: it's something they've asked us to look into

Effie: at the end of the day, are we *ready* to go off premise? We operate under the assumption that we're operating on our own infra

Alex: Not just security, are we really cloud native at this point? Not a lot of SRE that are familar with managing AWS/GKE/whatever. there's a cost to go to a cloud, but we need to determine which level and who pays for those costs. Even if we adopt gke, it's not the same thing of doing that internally so there's a cost associated with it. We need to balance all those things.

Mark: we should evaluate that option.

Tyler: this goes back to "what are the requirements if we move to a third-party" question? Let's limit the discussion to "hosted k8s", what do we need for us to be able to say "yes, we can do that"? We don't have, "could we use github?", that's another conversation.

Lars: One thing that we should that would help me think about this is a written threat model that is shared and people could reason about would be helpful. I can start working on something, but I'm not an SRE so either SRE could make one or work with me to make one.

Dan: from comments made earlier: re maintenance (mark's point) for CI in general and currently. Valid concern but the thing is we're going to have maintenance of some thing in some way unless we go full hosted (code review, ci, cd). Jenkins is deteriating quickly, this proposal was moving us forward, so shooting it down on those grounds doesn't give much recourse/response options, other than fully hosted solutions. 2) we short staffed on k8s experts in house, that's confusing because for the last 3 years we've focused on Dep pipeline which has k8s at it's heart. Does that preclude us from running something on a internal k8s?

Mark: the first question, I don't think we need to start from scratch, we just need to take a step back and clarify assumptions in the proposal. Perhaps, hopefully, we'll be able to gain resources for more pipeline work. This depends on whether or not we could save time. There are security concerns. Argo needs additional network solutions so there's reason to look at IaaS solutions. Re second, are we investing in the Dep Pipeline: yes, we are, we didn't get our head count requests this year, sadly. Let's focus on the things we can't have others do. Mark would very much like to focus on the deployment pipeline.

Dan: yes, thanks. The response we've gotten from the proposal hasn't been encouraging to move forward. It'd be good to figure out whether the proposal is a good starting point. When we bring up the possibility of completely third-party hosted stuff, or why arne't we using gitlab (well, we did evaluate that). It makes the reception to the proposal unclear or making us feel it's not worth pursuing [paraphrased badly].

Joe: more context, whatever we're going to build -- if it's a k8s cluster -- this is going to have different needs than the k8s in production. e.g., Argo may change k8s versions or we may change k8s versions; i.e., needs may diverge between the two k8s. We need to think of networks for isolating workloads as this will be a more untrsuted environment. CI systems only work well if they are flexible ipso facto this creates security concerns as it creates a larger surface to secure. Re second part: it's not clear how CI is separate from CDep/CDel, it needs to be evaluated wrt eg security. I'm not 100% sure I got what the UI we would offer people would be in either of these systems.

Greg: how nitty gritty of an implementation detail is this?

Joe: I disagree

Tyler: UI can be a security concern and that's on-topic for the discussion

Lars: when the blogs posts re CI I got lots of private emails all demanding contradicting requests (MUST github, MUST NOT github, etc etc). Getting feedback on an open question like "what should we use" is not useful.

Mark: I agree, by opening it up completely we may end up in an endless loop. I have been reading docs, and I did not see how the final decision was made between the final 3 things: zuulv3, gitlab, and argoCI. I could not see how that decision was made internally

Joe: going back to point re design doc: it's not clear/not specific re UI, it ties to the point that Tyler made re security concerns of the UI. If I don't know what the UI will do and what kind of access it requires, I don't know what recommendations to give regarding infra choices.

Tyler: what questions need to be answered so we can get started with something?

Mark: I like the proposal from Lars re a threat model, develop that together. One valid one was raised that if we go IaaS knowledge of working with public clouds is light in this org, so that's a consideration.

Effie: Think about the time we might need to get people acquainted with cloud stuff we could use to get people acq with k8s, or CI similarly.

Joe: the idea was always to use... a managed k8s solution with the full support of the... [sorry, metal detector bday present interrupted note taking]

Effie: in gke the operators are responsible for upgrading

Alex: in GKE to upgrade you build a new cluster and migrate to it from the old one

Mark: biggest win would be if we could buy a service off the shelf. One option was gitlab. Could we use it with the current gerrit or not?

Lars: on whether we should spend time learning services rather than k8s: if you're going to use k8s at all, the whole tech dept will have to get comfortable with it in some way or form.

Tyler: Is getting k8s training on RelEng in some formal way, maybe some SRE/RelEng pairing there? Re whole hosted gitlab, I don't think our criteria would have to be adjusted too much to evaluate it, in the next quarter, unless there are strong/meaningful objections. It would potentially surface some unanswered questions.

Dan: re gitlab, whether we looked into it: as a hosted option necessitates us moving our code hosting/review. We can't integrated hosted gitlab CI with Gerrit. If we tried to just integrate self-hosted gitlab-ci (only ci part) the integration isn't clean. It's a conversation that will open a much broader/bigger can of worms re code hosting.

Effie: If we go gitlab.com as paying customers (with our "Wikipedia uses gitlab"), isn't integration with Gerrit an ask we could make of them?

Greg: could be a hard thing to negotiate

Mark: could be part of the contract we have with them ("build this then we can use it")

Lars: the eval of gitlab was based on the assumption we stay with Gerrit, I spec'd out the bits needed to do CI on gitlab. The gitlab API provides everything we need. So hosted gitlab-ci shouldn't be a major blocker. Most of the work needed to do the POC in about a day, but undoubtedly missing things.

Effie: ... if we get a hosted gitlab we could consider running gitlab on our own infra with enterprise support.

Mark: not alleviate the security/upgrade concerns, but somethign we could consider

Greg: summarize: doing a list of requirements/threat model for different levels: self-hosted, IaaS, SaaS

TODO: Lars will take lead on the threat model and the intial list of levels and requirements. Joe/Alex as first points of contact/review this work. TODO: eval what externally hosted solutions are available? Refine list from https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/CI_Futures_WG/Candidates TODO: expose how the last cut happened: from 3 solutions (zuulv3, Argo, GitLab) down to 1 (Argo) TODO: Greg and/or Tyler to email group next steps and timelines for completion/next meeting.