Topic on Talk:Wikimedia Release Engineering Team/Trainsperiment week

MusikAnimal (talkcontribs)

A multi-train week sounds like an exciting experiment! My only concern is that if this becomes the new norm, how will development teams ensure there's enough time to QA changes before they are deployed? Currently, my team often very diligently waits until early Tuesday UTC after the branch is cut to +2 a patch, to ensure QA has a full week to test it. With our QA resources shared with other teams, this system is already pushing it. Many changes don't get proper QA until after they've already landed in production.

Perhaps teams should consider doing more QA on Patch Demo, before changes are merged? This works in many cases but not all because Path Demo still has a ways to go before it has feature parity with the Beta cluster. Are there any other recommendations for testing in a prod-like environment before a patch is auto-deployed? Or perhaps we're aiming for a release early, release often cycle where fixes/reverts come as quickly as the bugs appear?

Warm regards

TCipriani (WMF) (talkcontribs)

Thanks for the thoughtful question!

My only concern is that if this becomes the new norm, how will development teams ensure there's enough time to QA changes before they are deployed?

One hope I have is that a smaller number of changes moving to production all at once (which won't be the case with the first deploy of this experiment, but will be the case if we move towards a more frequent release in general) will lead to a proportionally smaller amount of QA that is needed per release.

Coordinating merge, release, and QA will hopefully reach the same kind of equilibrium you're describing in a world where we do multiple trains in a week (although in the short-term, I recognize, this will difficult).

RelEng is meeting with QTE tomorrow and we'll try to coordinate with QA folks as much as possible during the week.

Perhaps teams should consider doing more QA on Patch Demo, before changes are merged? This works in many cases but not all because Path Demo still has a ways to go before it has feature parity with the Beta cluster.

It would be great if we had a better understanding of exactly the value of a slow roll-out like the train. What kinds of problems does the train catch that are not currently caught in other (less costly) environments like patch demo, beta, or another local environment?

I have data about bugs in the train, but I don't know if those bugs could have been caught earlier in some other environment (even in a hypothetical environment we don't have).

Intuitively, I see the value of slowly exposing code to a larger pool of traffic — but is there any other way to gain that confidence? I'm not sure.

KHarlan (WMF) (talkcontribs)

I'm concerned about this issue as well. For example, we recently had a breakage to a GrowthExperiments feature from a change made in core. Even though GrowthExperiments is in the gate-and-submit jobs for mediawiki/core, the issue wasn't caught because only the api-testing tests in GrowthExperiments caught the error (task T303255 has the follow-up to consider running api-testing tests for all extensions/skins when the core api-testing job runs).

We caught the problem because we were pushing patches to GrowthExperiments and noticed our builds were failing, but I am sure there are any number of cases where things might break due to 1) tests not covering the affected code or 2) not enough time to fix the problematic code before it ends up in front of users in group2. Having the breathing room in between groups in the current train schedule has been pretty useful for avoiding production issues and reducing stress, IMO.

Cscott (talkcontribs)

Parsoid doesn't get branched magically for the train; because it is developed as a library we have to manually tag the library each week and commit a patch to mediawiki-vendor naming that new version before it makes the train. That actually works pretty well for us: we have an extensive regression testing framework that runs against ~100k pages of user content and takes a good fraction of a day to complete. So we usually kick off a run of that test suite after any particularly "meaty" change, and then also on Friday/Monday, and then discuss the results during our Monday team meeting to pick a version to tag and release for the train.

For changes that are likely to affect other teams (changes to APIs, etc) we will sometimes tag and push a new version to mediawiki-vendor on Tuesday right *after* the branch, to give the new code a full week to "age" on beta and expose any cross-cutting issues.

So more frequent trains wouldn't automatically affect us, nor would they make code appear in production more quickly, although they would give us more flexibility to "do a mid-week release", which would be helpful when there are ordered dependencies between changes in Parsoid and changes in core. On the other hand, it would undercut the "tag on Tuesday" model because there wouldn't be any "beta-only" testing time.

Cscott (talkcontribs)

Talking to Editing team today, they have a similar "QA on beta" process that would be undercut by more frequent trains. They also set up a "patchdemo" server, however, which runs from an unmerged branch (not master), which would allow a longer QA process for their team. It may be that if we "trainsperiment" permanently, we will need to build more robust "patchdemo"-like servers to allow "beta style" QA without actually merging the patches-to-be-tested into master (since anything on master will get put into production the next day).

Santhosh.thottingal (talkcontribs)

If there is a way projects/extensions can mark revisions for train somewhere, it can solve some of these valid issues I guess. For example, if an extension defines production ready revision somewhere and train automatically picks up changes to that revision, the extension team has full control for QA and deployment readiness. Frequent trains helps to pick these revisions soon after they are marked for production. Is this is a practical idea?

MusikAnimal (talkcontribs)

That's basically how we do it at Community Tech for our external tools such as XTools. The main branch gets deployed to the staging server regularly (just like the Beta Cluster for MW), but nothing gets deployed to production until we add a tag in the git history, which is then automatically deployed via cron. In theory we could do something similar at WMF, perhaps once we move to GitLab, since (I assume?) it's easy to add git tags in the UI. Using branches probably makes more sense, though; and as you suggest this workflow shouldn't be imposed on developers/teams but rather opt-in. It's a fun idea since we could test on the more robust beta cluster for longer. Release engineering of course can give a better answer to the question of practicality.

Jdforrester (WMF) (talkcontribs)

If there is a way projects/extensions can mark revisions for train somewhere, it can solve some of these valid issues I guess. For example, if an extension defines production ready revision somewhere and train automatically picks up changes to that revision, the extension team has full control for QA and deployment readiness. Frequent trains helps to pick these revisions soon after they are marked for production. Is this is a practical idea?

The Wikimedia development policy is that the development branch is always meant to be good to deploy (so every commit in master/main should be considered 'production-ready' before it's merged. In practice, that's obviously not true all the time; pushing more for pre-merge QA processes like Editing's excellent PatchDemo system would be a great way to simplify the deployments for small patches like i18n updates etc.

Reply to "Allowing time for QA"