Weeknotes: 7th October 2019
This week I unblocked our deployment pipeline.
Unblocking the pipeline wouldn’t ordinarily be a big highlight of my week, but this week we really blocked it and getting out of that was a big deal. I’ll talk about that more below.
We also ended up with a queue of changes stuck in the pipeline, including bug fixes, so Friday wound up being a ceremonious day of finally shipping a boat load of fixes to production that we’d fixed during the week.
Ultimately, blocking our releases for four days wasn’t that big a deal: six months ago we would only release once or twice a week, so it shows how far we’ve come that this was even an issue. But I’m glad to be in a team where we’re striving to get better every day and not falling back on excuses like that.
Pipeline blockage post-mortem
With apologies to readers who don’t use web development lingo for an otherwise horrible subtitle.
I wanted to go through what caused our blockage this week, why it was particularly bad, how we solved it, and what I’ve learnt.
We have a “preproduction” environment that all code is deployed to before release. Sometimes we pause a release at this point to test something with realistic data (our test environments just have a bunch of fake, sanitised data). But because everything has to go through preproduction before release, this means all subsequent changes get blocked too: until the change we’re testing is approved, nothing can be released.
On Monday we made such a call, pausing releases and queuing up a few changes until testing was complete. But testing identified serious issues and we couldn’t release with them in place. Initially we tried to create a fix and send that into preproduction too, but it was more complex than expected and taking too long. Meanwhile, critical bugs were stacking up.
On Thursday we made the decision to pull the initial change out, freeing the pipeline and allowing us to release everything. However, this couldn’t be done automatically because the queue of other changes had confused matters.
Instead, we manually disabled the functionality (by commenting out some key lines), but left everything else in place. This meant that our final release ultimately looked like a behind-the-scenes refactor, and when we reintroduce the change next week (with the bug fixed), we’ll have a really small changeset that’s easier to manage.
I came up with a couple of new rules I suggest implementing when blocking a deployment pipeline:
- Try to make the blocker as small as possible: release behind-the-scenes code first and add nice-to-have improvements in a subsequent change later.
- Schedule testing: agree to, for example, block on Monday at 9AM, test until midday, and either release or pull out at midday. Make sure everyone who’s needed to contribute to that can subscibe to those time scales.
- Wrote various bug fixes for issues with Complete the deputy report
- Identified a new way to handle “competing deputies” issues
- Moved our automated test scripts into a separate container
- Submitted my first PR to the MoJ Design System
- Started talking across OPG product teams about how we’ll use the MoJ Design System