User talk:LarsWirzenius/NewCI/threats

About this board

Edit description

Start a new topic

Roles, environments and isolation

3 comments • 16:11, 24 January 2020 4 years ago

3

GLavagetto (WMF) (talkcontribs)

I think we need to rework the definition of roles.

As far as the security context is concerned, we have three (maybe two) roles:

The Developer, or anyone who can submit a patchset for CI consumption.
The Deployer, or anyone who can merge a patch and deploy the resulting artifact to production. Potentially, they should be able to submit a job manually.
The Administrator, or anyone who can, in addition to the Deployer rights, also modify the setup of a job.

Any job generated by actions from the first role must only run in an "untrusted" environment. Jobs generated by actions of people in the remaining roles may run in a trusted environment.

This brings me to the environments definitions.

I think we need to simplify this to two major areas, that can be further divided logically but share the same level of isolation within themselves:

Trusted environment: anything that is built here is considered safe to run in production. Jobs that run in this environment (builds, tests, artifact upload) must only originate from actions of Deployers or Administrators. Artifacts built here can be used in trusted environments like the kubernetes staging cluster or production.
Untrusted environment: anything that is built / run here is considered unsafe for production, and should be properly isolated (see below) from the trusted environment. Jobs that run in this environment can originate by any Developer, and artifacts built here cannot be used in the trusted environment, but can be used for testing within the untrusted environment or other environments that are sufficiently isolated from production (e.g. deployment-prep)

And finally, we get to isolation, which is the hardest question. What level of isolation we want to achieve between the trusted and untrusted environments?

I think we need to ensure that nothing that runs in the untrusted environment can infulence things running in the trusted environment. Thus we should ensure that:

Trusted and untrusted jobs never run on the same host (physical or VM)
Hosts and jobs in the untrusted environment cannot connect to the trusted environments outside of a predefined whitelist of connections (as an example: connecting to the puppetmaster to get configuration).
Hosts and jobs in the untrusted environment cannot accept connections from any external source, with the exception of a read-only user interface for the CI system itself, and other connections in a whitelist.
No confidential, or secret information is accessible from the untrusted environment

I will upload a proposal for an updated diagram soon that should make some of these points clearer.

Reply 10:33, 22 January 2020 4 years ago

GLavagetto (WMF) (talkcontribs)

The diagram for what I described above is available at people.wikimedia.org

Reply 14:59, 22 January 2020 4 years ago

AKosiaris (WMF) (talkcontribs)

> I think we need to ensure that nothing that runs in the untrusted environment can infulence things running in the trusted environment

And as much as possible the rest of production. That is the untrusted environment should be as much unable as possible to reach out to the trusted environment AND production. And on the premise of principle of least privilege, probably the trusted env's access should be on a need-to only basis?

Reply 16:11, 24 January 2020 4 years ago

Reply to "Roles, environments and isolation"

A couple of points

One comment • 11:00, 17 December 2019 5 years ago

1

AKosiaris (WMF) (talkcontribs)

I have a couple of points to make, most stem from the diagram but are related to the threat model

There are no fine grained distinctions for developers (e.g. volunteer vs trusted volunteer/staff vs repo owner). Each of those should have different rights and be able to do different things. e.g. we certainly don't want every single change to be able to create a test environment
Tangential to the above, both code fetched from the internet and code uploaded in gerrit is inherently unsafe. We should be very clear about communicating that cause otherwise it might lead to people making assumptions. .e.g that means that a LGTM in gerrit isn't enough. It might very well be that some 3rd party repository is compromised and we end up with cryptominers (in either CI or production)
There will need to be a way for code to enter that cycle in an embargoed way. Whether that is gerrit private repos or some way to inject code directly in production and bypassing the entire CI/CD is debatable, but we definitely need something. Depending on how we do it we might have to make artifacts non public, which in principle is already a leak.
It's quite possible that we need the CI web UI and gerrit to interact somehow, either directly or via a agent/bot/something that provides useful information and links to developers. This adds attack vectors of course as that agent/bot/something could be compromised.
The distinction between the persistent and the temporary object store is unclear. I get the feeling that they are distinct just because of attack vectors on the temporary one, but it's unclear what those are. We should be adding them.
It's unclear if the deployment node is automated or user accessible/triggered. Depending on the answer above, it might mean different attack vector and perhaps a need to split it to multiple deployment nodes. e.g. if the deployment node that creates the test environments is user accessible, it could be compromised but that would have no adverse effects into production. Of course in that paradigm, the promotion should be done by a different deployment node
The list is nice as a start, but it probably should be unified a bit, e.g. the following are currently the same thing essentially.
- Elevate privilege by impersonating SRE/admin on Gerrit, over ssh.
- Elevate privilege by impersonating SRE/admin on Gerrit, over HTTP.
We need severities. I can start by saying that most denial of service attacks are of low severity (excluding the production node capacity one)
I find the labels on the arrows not very explanatory. Perhaps action verbs would be more useful?

Reply 11:00, 17 December 2019 5 years ago

Reply to "A couple of points"

There are no older topics