Extension:MobileFrontend/Deployments/2012-05-07/Post mortem

Overview of the deployment issues 7 May 2012

This was a problematic deployment with a handful of issues that ultimately took about 24 hours to get resolved in production. The biggest issue was a change that had been deployed that caused all infoboxes/tables to not be displayed. Jon put together a fix for the issue and contacted one of that platform engineers to help with deployment, since the rest of us were asleep. Rather than just cherry pick and sync the fix itself, they synchronized the production branch to master and deployed the fix along with a few other changes that had been merged to master. After they deployed, testing turned up a number of issues, particularly with section toggling and image display on large articles.

Initially, I assumed the problems I was seeing were the result of syncing production to master, rather than just cherry picking up the fix. But this proved to be a red herring - while the new changes included updated javascript files, the javascript running on production was exactly the same as the javascript we had in production from the previous day's deployment, since the minified version of our JS files were never updated to the latest changes. This led me to believe the problems we were seeing were likely introduced with the previous day's deployment and we just hadn't noticed from our testing the day before. So, I rolled back production to what I believed was the last stable point - 30 April, 2012, according to the MF deployments page. This proved to be a bit challenging as I was not sure quite how to roll back using git and submodules, but eventually got it sorted out with some help from Roan and Patrick.

However, there turned out to be a handful of small fixes that had been deployed after 30 April that were missing from my roll back. I had no idea that they had ever even been deployed, as there was no record of their deployment on the deployments page. Since these issues were discovered as we were preparing to roll forward, we just went ahead with the roll-forward.

We rolled forward to the known broken state of the 7 May deployment because we were unable to identify the issues in local/testing environments. Since Jon had come back online, we figured he'd be able to debug quickly with the issues live in production. Within minutes, he came up with a quick fix which was tested live, and then pushed out to production shortly after, along with a handful of other small bug fixes.

What went well

Having changelogs from previous deployments to refer to for rolling back
Once Jon was around after the issues had been fully identified, he was able to produce a fix really quickly after we pushed the broken code back to production for live testing
Having test.m.wikipedia.org working for staging
Patrick's git know-how helped me cope with the initial rollback quickly

What we can do to avoid these kinds of issues in the future

Until someone else is managing our approach to QA, I think we should put together a checklist of things to look out for after a deployment (eg is search working, are tables appearing, are sections toggling, etc). This could potentially turn into a list of things to automate via Selenium (or something similar) in the future.
We should automate the running of unit tests when someone submits something for review in Gerrit, and automate the minification of JS when changes to JS files get merged into master - probably via Jenkins
We should ensure that recording changes that are being deployed (even when deploying only a revision or two) get recorded on the deployments page
We should automate the generation and posting of deployment changelogs
Institute a team-wide policy about what to do in similar situations so that a deployer does not need to spend critical time justifying their actions to the rest of the team
Add a section to deployment documentation about what to do when things go wrong (RobLa and co. have already been contacted about this)
Have additional deployers on the team in timezones other than Pacific/Mountain (eg Max)
Have a handful of pages on our labs instance that mirror some of the biggest/most complex pages on other wikis that we can test against