“Mean time to X” is a common term used to describe how long, on average, a particular milestone takes to achieve in incident response. There’s mean time to detect, acknowledge, mitigate, etc. And then there’s the elusive “mean time to recover,” also known as “MTTR.”
MTTR, a hotly debated acronym and concept, measures how long it takes to resolve an incident on average. The problem with MTTR, though, is that it doesn’t matter .
MTTR is an excruciatingly flawed metric that relies on comparing fundamentally different incidents with different contributing factors to measure whether a team improves its incident response. It’s like putting together a Volkswagen Jetta, a BMW M5, and a Lamborghini on a race track and saying the average of cars is 80mph an hour. The brutal reality is that incidents will always vary in how long they take to resolve, so attempting to average different incidents of varying severities across a complex system will yield a number that has a high degree of variance as well.
There are a few MTT*s I believe are valuable in incident response. Many of which can be quickly identified and acted on. I have an affinity to measuring MTTD or mean time to detect, which can reveal gaps in monitoring or thresholds in service level objectives that may need adjusting. But there’s one particular MTT* that can have a material impact on your team: MTTR.