“We got lucky”—it’s one of those phrases I listen out for during post incident or near-miss reviews. It’s an invitation to dig deeper; to understand what led to our luck. Was it pure happenstance? …or have we been doing things that increased or decreased our luck?
There’s a saying of apparently disputed origin: “Luck is when preparation meets opportunity”. There will always be opportunity for things to go wrong in production. What does the observation “we got lucky” tell us about our preparation?
How have we been decreasing our luck?
What unsafe behaviour have we been normalising? It can be the absence of things that increase safety. What could we start doing to increase our chances of repeating our luck in a similar incident? What will we make time for?
“We were lucky that Amanda was online, she’s the only person who knows this system. It would have taken hours to diagnose without her”
How can we improve collective understanding and ownership?
Post incident reviews are a good opportunity for more of the team to understand, but we don’t need to wait for something to go wrong. Maybe we should dedicate a few hours a week to understanding one of our systems together? What about trying pair programming? Chaos engineering?
How can we make our systems easier to diagnose without relying on those who already have a good mental model of how they work? Without even relying on collaboration? How will we make time to make our systems observable? What would be the cost of “bad luck” here? maybe we should invest some of it in tooling?
If “we got lucky” implies that we’d be unhappy with the unlucky outcome, then what do we need to stop doing to make more time for things that can improve safety?
How have we been increasing our luck?
Let’s seek to understand what preparation led to the lucky escape, and think how we can turn up the dials.
“Sam spotted the problem on our SLIs dashboard”
Are we measuring what matters on all of our services? Or was part of “we got lucky” that it happened to be one of the few services where we happen to be measuring the things that matter to our users?
“Liz did a developer exchange with the SRE team last month and learned how this worked”
Should we make more time for such exchanges or and personal learning opportunities?
“Emily remembered she was pairing with David last week and made a change in this area”
Do we often pair? What if we did more of it?
How frequently do we try our luck?
If you’re having enough production incidents to be able to evaluate your preparation, you’re probably either unlucky or unprepared ;)
If you have infrequent incidents you may be well prepared but it’s hard to tell. Chaos engineering experiments are a great way to test your preparation, and practice incident response in a less stressful context. It may seem like a huge leap from your current level of preparation to running automated chaos monkeys in production, but you don’t need to go straight there.
Why not start with practice drills? You could have a game host who comes up with a failure scenario. You can work up to chaos in production.
Dig deeper: what are the incentives behind your luck?
Is learning incentivised in your team, or is there pressure to get stuff shipped?
What gets celebrated in your team? Shipping things? Heroics when production falls over? Or time spent thinking, learning, working together?
Service Level Objectives (SLOs) are often used to incentivise (enough) reliability work vs feature work…if the SLO is at threat we need to prioritise reliability.
I like SLOs, but by the time the SLO is at risk it’s rather late. Adding incentives to counter incentives risks escalation and stress.
What if instead we removed (or reduced) the existing incentives to rush & sacrifice safety. Remove rather than try to counter them with extra incentives for safety? 🤔