Wikimedia Release Engineering Team/MW-in-Containers thoughts
This page is currently a draft.
|
What we want
edit- Objective: MediaWiki* is automatically packaged into (one or more) OCI container images**, which are semi-automatically deployed*** into Wikimedia production as kubernetes pods
"MediaWiki"
edit- All of what we consider the current appserver layer running on
mw\d\d\d\d
servers (plus the Parsoid sub-cluster), as currently deployed through scap.- Included:
- the
mediawiki/*
repos including the code running on the appservers and the general and specialist jobrunnners - the core site configuration in
operations/mediawiki-config
'sCommonSettings.php
and related files like static assets - the appserver layer of Apache and PHP-FPM and related code (like
etcd
clients) and APCu - specialist code like ghostscript (for PDF rendering), lilypond (for sheet music LaTeX rendering) and ffmpeg (for video transcoding and scaling)
- built artefacts that change based on the code but are invariant based on request; currently, the l10n CDB/PHP files, perhaps also base ResourceLoader bundles?
- the
- Excluded:
- the persistent data-heavy layers (primary and replica MySQL instances, external store MySQL instances, logstash, kafka, Swift, …)
- the caching layers which vary on content more than code (memcached, Varnish, and Apache Traffic Service)
- the independent services running in their own k8s pools (mathoid, page content service, kask, …) or independently on bare metal (ElasticSearch, Kafka transit, Thumbor, …)
- site-variant configuration, as currently specified in
operations/mediawiki-config
'sInitialiseSettings.php
- Included:
- TBDs:
- What is missing from or wrong in this list?
"automatically packaged into (one or more) OCI container images"
edit- When triggered by base image updates or code being merged, an automated process selects the correct versions of each of the components and assembles them into a warmed-up container image or set of images that is tested, verified, and considered ready for production
- TBDs:
- Do we run Apache on the same container as PHP-FPM or isolated from it? Do jobs run in the same container or a different one? Etc.
- How do we handle local security patches in a way that applies reliably? And how do we avoid disclosing their existence / contents?
- How do we handle upstream disclosed security patches that applies reliably? And how do we avoid disclosing their existence / contents?
- How do we debounce/group changes to build? Currently the
scap init
i18n build and sync process alone takes ~45 minutes and we land roughly 600 patches a week into production repos (aka one per ~17 minutes)- Initial start point: Build automatically every 24 hours (perhaps at 04:00 UTC as current global commit trough) [chat]
- How can we make this build as fast as possible (re-using layers?) without loss of generality (MediaWiki is monolithic, so a change in one extension could break all the others)?
- How do we audit changes if the built artefacts are private (due to local and upstream mitigations and pre-release security patches)?
- How do we apply emergency instant fixes (other than the static site config)? Do we need that still?
- How do we change our current build & test pipeline such that we have confidence in its results enough to deploy code?
- Are we still just building from master, or do we want to move to manual feature-branch-based picks?
- TODOs:
- Provide a mechanism to build and inject the site-variant configuration in a static form (see T223602)
- Provide a mechanism to warm up the APCu cache ahead of an image going live (replay last N production GET requests or similar?)
- Provide a mechanism to pre-build the common ResourceLoader requests and inject them into the Varnish caches.
"semi-automatically deployed"
edit- On some trigger, the k8s production deployment state is updated to add new version at a low percentage of traffic, and slowly scaled out to answer user requests until it is the only pod running, or is rolled back and removed if things go wrong.
- TBDs:
- Will we have a staging environment for manual final verification before deployment?
- How do we trigger deployments? Just manually? Automatic every hour during "business hours"?
- Who can trigger deployments? All current deployers? All mergers (no)? Just SRE Service Ops and RelEng? Etc.
- How does we tell the controller (?) to know to which deployment state to route a given request?
- How does the use scale out? By wiki in risk-led phases (like current train)? By use case?
- What metrics are we going to use to judge the pool scale-out?
- Reliability: Logstash is quite noisy (and misses some things)?
- Performance: ?!?
- Features: ?!?
- Do we plan to generally run one and at most two flavours at once, or would we run more than that ever?
- How long a window of "old" pods would we keep around to roll back to?
- How do we handle content-variant cache purges in a way that scales? Can we make this less manual and more reliable?
What changes
edit- Deployments will now be:
- atomic;
- more isolated; and
- scaled out and rolled back without manual intervention
- …
How we get there
edit- …