Wikimedia Release Engineering Team/MW-in-Containers thoughts

What we want

Objective: MediaWiki* is automatically packaged into (one or more) OCI container images**, which are semi-automatically deployed*** into Wikimedia production as kubernetes pods

"MediaWiki"

All of what we consider the current appserver layer running on mw\d\d\d\d servers (plus the Parsoid sub-cluster), as currently deployed through scap.
- Included:
  - the mediawiki/* repos including the code running on the appservers and the general and specialist jobrunnners
  - the core site configuration in operations/mediawiki-config's CommonSettings.php and related files like static assets
  - the appserver layer of Apache and PHP-FPM and related code (like etcd clients) and APCu
  - specialist code like ghostscript (for PDF rendering), lilypond (for sheet music LaTeX rendering) and ffmpeg (for video transcoding and scaling)
  - built artefacts that change based on the code but are invariant based on request; currently, the l10n CDB/PHP files, perhaps also base ResourceLoader bundles?
- Excluded:
  - the persistent data-heavy layers (primary and replica MySQL instances, external store MySQL instances, logstash, kafka, Swift, …)
  - the caching layers which vary on content more than code (memcached, Varnish, and Apache Traffic Service)
  - the independent services running in their own k8s pools (mathoid, page content service, kask, …) or independently on bare metal (ElasticSearch, Kafka transit, Thumbor, …)
  - site-variant configuration, as currently specified in operations/mediawiki-config's InitialiseSettings.php
TBDs:
- What is missing from or wrong in this list?

"automatically packaged into (one or more) OCI container images"

When triggered by base image updates or code being merged, an automated process selects the correct versions of each of the components and assembles them into a warmed-up container image or set of images that is tested, verified, and considered ready for production
TBDs:
- Do we run Apache on the same container as PHP-FPM or isolated from it? Do jobs run in the same container or a different one? Etc.
- How do we handle local security patches in a way that applies reliably? And how do we avoid disclosing their existence / contents?
- How do we handle upstream disclosed security patches that applies reliably? And how do we avoid disclosing their existence / contents?
- How do we debounce/group changes to build? Currently the scap init i18n build and sync process alone takes ~45 minutes and we land roughly 600 patches a week into production repos (aka one per ~17 minutes)
  - Initial start point: Build automatically every 24 hours (perhaps at 04:00 UTC as current global commit trough) [chat]
- How can we make this build as fast as possible (re-using layers?) without loss of generality (MediaWiki is monolithic, so a change in one extension could break all the others)?
- How do we audit changes if the built artefacts are private (due to local and upstream mitigations and pre-release security patches)?
- How do we apply emergency instant fixes (other than the static site config)? Do we need that still?
- How do we change our current build & test pipeline such that we have confidence in its results enough to deploy code?
- Are we still just building from master, or do we want to move to manual feature-branch-based picks?
TODOs:
- Provide a mechanism to build and inject the site-variant configuration in a static form (see T223602)
- Provide a mechanism to warm up the APCu cache ahead of an image going live (replay last N production GET requests or similar?)
- Provide a mechanism to pre-build the common ResourceLoader requests and inject them into the Varnish caches.

"semi-automatically deployed"

On some trigger, the k8s production deployment state is updated to add new version at a low percentage of traffic, and slowly scaled out to answer user requests until it is the only pod running, or is rolled back and removed if things go wrong.
TBDs:
- Will we have a staging environment for manual final verification before deployment?
- How do we trigger deployments? Just manually? Automatic every hour during "business hours"?
- Who can trigger deployments? All current deployers? All mergers (no)? Just SRE Service Ops and RelEng? Etc.
- How does we tell the controller (?) to know to which deployment state to route a given request?
- How does the use scale out? By wiki in risk-led phases (like current train)? By use case?
- What metrics are we going to use to judge the pool scale-out?
  - Reliability: Logstash is quite noisy (and misses some things)?
  - Performance: ?!?
  - Features: ?!?
- Do we plan to generally run one and at most two flavours at once, or would we run more than that ever?
- How long a window of "old" pods would we keep around to roll back to?
- How do we handle content-variant cache purges in a way that scales? Can we make this less manual and more reliable?

What changes

Deployments will now be:
- atomic;
- more isolated; and
- scaled out and rolled back without manual intervention
…

How we get there

…