Wikimedia Release Engineering Team/MW-in-Containers thoughts

Wikimedia production Web request flow diagram as of August 2022

What we want

edit
  • Objective: MediaWiki* is automatically packaged into (one or more) OCI container images**, which are semi-automatically deployed*** into Wikimedia production as kubernetes pods

"MediaWiki"

edit
  • All of what we consider the current appserver layer running on mw\d\d\d\d servers (plus the Parsoid sub-cluster), as currently deployed through scap.
    • Included:
      • the mediawiki/* repos including the code running on the appservers and the general and specialist jobrunnners
      • the core site configuration in operations/mediawiki-config's CommonSettings.php and related files like static assets
      • the appserver layer of Apache and PHP-FPM and related code (like etcd clients) and APCu
      • specialist code like ghostscript (for PDF rendering), lilypond (for sheet music LaTeX rendering) and ffmpeg (for video transcoding and scaling)
      • built artefacts that change based on the code but are invariant based on request; currently, the l10n CDB/PHP files, perhaps also base ResourceLoader bundles?
    • Excluded:
      • the persistent data-heavy layers (primary and replica MySQL instances, external store MySQL instances, logstash, kafka, Swift, …)
      • the caching layers which vary on content more than code (memcached, Varnish, and Apache Traffic Service)
      • the independent services running in their own k8s pools (mathoid, page content service, kask, …) or independently on bare metal (ElasticSearch, Kafka transit, Thumbor, …)
      • site-variant configuration, as currently specified in operations/mediawiki-config's InitialiseSettings.php
  • TBDs:
    • What is missing from or wrong in this list?

"automatically packaged into (one or more) OCI container images"

edit
  • When triggered by base image updates or code being merged, an automated process selects the correct versions of each of the components and assembles them into a warmed-up container image or set of images that is tested, verified, and considered ready for production
  • TBDs:
    • Do we run Apache on the same container as PHP-FPM or isolated from it? Do jobs run in the same container or a different one? Etc.
    • How do we handle local security patches in a way that applies reliably? And how do we avoid disclosing their existence / contents?
    • How do we handle upstream disclosed security patches that applies reliably? And how do we avoid disclosing their existence / contents?
    • How do we debounce/group changes to build? Currently the scap init i18n build and sync process alone takes ~45 minutes and we land roughly 600 patches a week into production repos (aka one per ~17 minutes)
      • Initial start point: Build automatically every 24 hours (perhaps at 04:00 UTC as current global commit trough) [chat]
    • How can we make this build as fast as possible (re-using layers?) without loss of generality (MediaWiki is monolithic, so a change in one extension could break all the others)?
    • How do we audit changes if the built artefacts are private (due to local and upstream mitigations and pre-release security patches)?
    • How do we apply emergency instant fixes (other than the static site config)? Do we need that still?
    • How do we change our current build & test pipeline such that we have confidence in its results enough to deploy code?
    • Are we still just building from master, or do we want to move to manual feature-branch-based picks?
  • TODOs:
    • Provide a mechanism to build and inject the site-variant configuration in a static form (see T223602)
    • Provide a mechanism to warm up the APCu cache ahead of an image going live (replay last N production GET requests or similar?)
    • Provide a mechanism to pre-build the common ResourceLoader requests and inject them into the Varnish caches.

"semi-automatically deployed"

edit
  • On some trigger, the k8s production deployment state is updated to add new version at a low percentage of traffic, and slowly scaled out to answer user requests until it is the only pod running, or is rolled back and removed if things go wrong.
  • TBDs:
    • Will we have a staging environment for manual final verification before deployment?
    • How do we trigger deployments? Just manually? Automatic every hour during "business hours"?
    • Who can trigger deployments? All current deployers? All mergers (no)? Just SRE Service Ops and RelEng? Etc.
    • How does we tell the controller (?) to know to which deployment state to route a given request?
    • How does the use scale out? By wiki in risk-led phases (like current train)? By use case?
    • What metrics are we going to use to judge the pool scale-out?
      • Reliability: Logstash is quite noisy (and misses some things)?
      • Performance: ?!?
      • Features: ?!?
    • Do we plan to generally run one and at most two flavours at once, or would we run more than that ever?
    • How long a window of "old" pods would we keep around to roll back to?
    • How do we handle content-variant cache purges in a way that scales? Can we make this less manual and more reliable?

What changes

edit
  • Deployments will now be:
    • atomic;
    • more isolated; and
    • scaled out and rolled back without manual intervention

How we get there

edit