This past year, a few of us who focus on the reliability of Spotify for Artists (S4A) spent some time taking a look at the collection of incidents dir

Failing Forward — How We Grow from Incidents

submited by
Style Pass
2022-05-13 12:30:08

This past year, a few of us who focus on the reliability of Spotify for Artists (S4A) spent some time taking a look at the collection of incidents directly affecting S4A in 2021. We made some hypotheses, and then went to work analyzing each one to build a set of quantitative and qualitative metrics. We learned a few things, and we hope those insights will help you as well.

Every time something doesn’t go according to plan, there are costs, and benefits. Our minds typically jump to the kinds of negative costs we incur when users are impacted, but there is another cost — productivity.

We found that most incidents have a moderate to high productivity impact. This means that when an incident happens, 55% of the time at least one responder spends the better part of a day addressing the problem. Quite often, 23% of the time in fact, the blast radius for productivity impact is even bigger.

Now, you might say, “Oh wow, you waste a lot of time fixing stuff!” But, as mentioned earlier, this cost has a benefit, when properly managed. When we take the time to run an incident review (we used to call this a postmortem), we have an opportunity to investigate and internalize how and where the system broke down. While we can read code and docs and hypothesize where the issue stemmed from, analyzing a failure is an opportunity to correct our understanding and bring it closer to reality. Leveling up our engineers in this way leads to better outcomes. This is how we buy productivity later, by sacrificing some of it today.

Leave a Comment