Wikimedia Release Engineering Team/Pretrain

Pretrain (formerly known as Group -1) is a multi-phase project that hopes to create an environment for both manual and automated integration and regression testing of MediaWiki changes in advance of those same changes progressing through the Deployment train process to the Wikimedia movement's content wikis like Wikidata and English Wikipedia.

The project has its roots in discussions about T215217: deployment-prep (beta cluster): Code stewardship request and the needs of the Wikimedia Quality Services & Wikimedia Testing Platform teams.

The work is currently associated with the 2025-2026 Annual Plan's Wiki Experiences (WE) WE6 objective's WE6.1 key result:

By the end of Q4, the number of train-blocking bugs that make it beyond test wikis is reduced by 10%

The work was associated with the 2024-2025 Annual Plan's Wiki Experiences (WE) WE6 objective's WE6.2 key result:

By the end of Q4, enhance an existing project and perform at least two experiments aimed at providing maintainable, targeted environments moving us towards safe, semi-continuous delivery.

Hypotheses

edit

If we deploy MediaWiki multiple times per day into a contained area of production we will create the most "production like" environment for QTE staff to use for manual exploratory testing and automated regression checks prior to train deployment to major content wikis.

Implementation will progress via a series of smaller hypotheses which will eventually connect to realize the overarching hypothesis. This approach is being used to avoid attempting to track progress against a single "boil the ocean" hypothesis that could take a year or more to reach an easily measurable state.

[WE6.2.1] Publish pre-train single version containers

edit

If we publish a versioned build of MediaWiki, extensions, skins, and Wikimedia configuration at least once per day we will uncover new constraints and establish a baseline of wallclock time needed to perform a build.

Step 0 towards the long term goal of being capable of continuous delivery (CD) into production is being able to deliver faster than the current weekly train process. A daily process would be approximately 3-4 times faster than our current production delivery cadence. We currently envision our eventual capability goal as being able to deliver every 15 minutes. Setting the initial goal two orders of magnitude higher (once per 1440 minutes vs once per 15 minutes) will still expose us to a number of real-world constraints that are not addressed by current workflows. We expect to uncover more details about challenges that will arise in continuing to accelerate the pace of delivery without tipping too quickly into extreme difficulty that could endanger our ability to use an iterative development model.

We are explicitly not constraining where this publishing workflow will be measured at this time. We expect the SRE groups who will need to be involved in deploying into a wikikube environment to be occupied by other goals in the initial months of FY24/25. We are not currently certain what new capabilities would need to be produced to target the current beta cluster shared environment or that doing so would be of long term benefit to the goal, team, or projects. We do expect to deliver beyond a single user development environment and will be able to provide more details as design progresses narrowing the cone of uncertainty for the overall project.

This hypothesis has been declared successfully completed. See our 2024-10-31 progress report for the writeup of what we accomplished and the lessons learned along the way.

[WE6.2.6] Create design document for Pretrain (née Group -1) deployment

edit

If we gather feedback from QTE, SRE, and individuals with domain specific knowledge and use their feedback to write a design document for deploying and using the wmf/next OCI container, then we will reduce friction when we start deploying that container.

We have a build process that is producing a wmf/next branch and related OCI image daily. That image is not yet being used anywhere however. The next major implementation milestone is to deploy the image into production somewhere with config to serve one or more wikis and edge routing to bring traffic to the deployment. This work will need input from a number of teams and individuals who will have various concerns and constraints that will need to be addressed. Finding consensus on known technical and social questions before attempting to implement the deployment process and its related config should reduce conflicts and confusion for everyone.

This hypothesis has been declared successfully completed. See our 2025-04-04 progress report for the writeup of what we accomplished and the lessons learned along the way.

[WE6.1.1] Move image build to deployment server and update for backports

edit

If we rehome daily image builds to the deployment server and add image updates triggered by select deployment actions we will uncover constraints and establish a baseline for time needed to perform more continuous deployments.

This is follow up to the work from [WE6.2.1] Publish pre-train single version containers informed by the agreements made during [WE6.2.6] Create design document for Pretrain (née Group -1) deployment. We will migrate the daily image build process from the https://releases-jenkins.wikimedia.org/ service to a systemd timer triggered scap process that happens on the active deployment server. We will also be adding new support in scap for updating the latest wmf/next image with configuration changes and security patches as they are deployed. These actions will prepare us for the "final" Pretrain phase of actually putting the container to work serving content for a set of testing wikis in a future hypothesis.

Goals

edit

Workflow goals

edit
  • Enable testing of pre-train ("next") branches of MediaWiki, skins, and extensions in a stable environment where newly discovered defects are more likely to be the result of the next branches than problems with support services or configuration in the environment itself.

Technical goals

edit
  •   Done Automate MediaWiki OCI image creation based on a timer or similar trigger.
  • Create an environment in the production network for running pre-train MediaWiki versions.
  • Automate MediaWiki OCI container deployment into the pre-train environment.
  •   Done Enable overriding any staged wikiversions.json when scap is determining which MediaWiki versions need to be included in a container (allow, but do not require single version builds).
  •   Done Enable overriding an in-container wikiversions.json with a hard coded MediaWiki version inside of the container.
  •   Done Enable creation of MediaWiki containers from arbitrary staging directories so that a single deployment or CI server can be used to build as many variant containers as we find need for.

Out of scope

edit

Workflow out of scope

edit
  • Enabling testing of pre-release services and configuration not managed by scap is out of scope.

Technical out of scope

edit
  • Replacing the weekly train progression with continuous delivery to all wikis is out of scope.
  • Building an image for every commit to a Train deployed repo/submodule is out of scope.
  • Keeping deployment-prep working in the face of production's migration to Kubernetes and containers as the MediaWiki deployment and runtime solution is out of scope.
  • Building a chain of images starting from public files only to produce images that can be used outside of production is out of scope.
  • Reimagining how configuration is delivered into MediaWiki containers is out of scope.
  • Runtime support for single version images beyond minimum functionality needed to support building and operating Pretrain directly is out of scope.
edit

There are a number of active, planned, and imagined projects which have some intersectionality with the Pretrain concept and implementation. When possible we should try to avoid becoming a blocker to these projects. We should also avoid making systemic changes that will cause us future headaches we can foresee today.

  • WE6.2.5 Move multiversion routing outside of the MediaWiki containers to unblock single version containers
  • WE5.4.2 PHP runtime upgrade process in a containerized world

Design agreements

edit
This section started as a straw-dog proposal reflecting collaborative design and discussion of the implementation details of the project. It is a living document, but attempts to record consensus at any given point in time. The initial proposal recorded here was the work product of T379683.

The Pretrain environment will be created by adding new releases to the staging Kubernetes cluster. These releases will either be multiple role-specific services under a new mw-pretrain namespace or additional "pretrain" services in existing role related namespaces such as mw-{cron,jobrunner,mwscript,web}. A "release" in this parlance is an instance of a Helm chart running in a Kubernetes cluster. These releases will generally be known as pretrain in documentation including this proposal.

Pretrain will use OCI container images containing a single wmf/next MediaWiki version. A new wmf/next image will be created once per day starting at 01:00 UTC. This image will be deployed via an automated process triggered by a scheduled task running from the active deployment server Monday through Thursday at 02:00 UTC. Additional workflows for rollback and ad hoc deployments will be available to authorized parties via scap and SpiderPig as appropriate.

Deployment of both MediaWiki configuration (operations/mediawiki-config.git) and security patches done via scap from a deployment server will also trigger new pretrain deployments. These deployments will not happen in-process with the other wiki updates. Instead they will trigger an asynchronous build and deployment workflow. We may choose to de-duplicate triggering events so that we only deploy pretrain once as the result of a typical backport or security window.

The initial pretrain wiki will be testwiki. We expect to add all of the test*wikis except test2wiki to the environment within the first 1-2 months of operation. Additional wikis are slated to be added later pending establishing compensating controls that will satisfy concerns about code integrity in a fully automated deployment environment. Once completed this will give the environment a mix of wikis with the intent that there is enough organic traffic to the "real" wikis to expose serious defects before code is automatically promoted into the weekly train release process.

A new testing wiki using a right-to-left (RTL) UI language such as Hebrew or Arabic will be added to the pretrain environment following the initial deployment. This RTL wiki will assist test engineers in validating RTL specific functionality and patrolling for regressions. Further discussion is needed to determine the exact UI language and wiki name.

The single version MediaWiki container for pretrain will not require WE6.2.5 routing. Instead traffic can be pointed towards the deployment based on domain. MediaWiki JobQueue will also need special configuration to route jobs to the appropriate wmf/next containers.

All logstash events emitted from MediaWiki Kubernetes containers will be annotated with the OCI container's image and tag (for example "restricted/mediawiki-multiversion:2024-11-07-141556-publish") in the dict of "kubernetes" values attached to the event. Generally we should seek to make log events as self-documenting as possible so that folks do not need deep knowledge of train or pretrain operations to figure out what code and config were active when the log event occurred.

Unknowns / open questions

edit
  • If we move existing wikis (testwiki, test2wiki, officewiki, mediawikiwiki, wikitech, etc) to Pretrain, will mwscript whatever --wiki=movedwiki work transparently or will these wikis need to be addressed from a special place?
  • What compensating controls should we implement to increase confidence in the safety of automated deployments?

Reports

edit