Think back on all of the availability-impacting incidents that have occurred in your organization over some decent-sized period, maybe a year or more.

MTTR: When sample means and power laws combine, trouble follows

submited by

Style Pass

2024-12-23 18:00:03

Think back on all of the availability-impacting incidents that have occurred in your organization over some decent-sized period, maybe a year or more. Is the majority of the overall availability impact due to:

If you answered (2), then this suggests that the time-to-resolve (TTR) incident metric in your organization exhibits a power law distribution. This fact has implications for how good the sample mean of a collection of observed incidents approximates the population (true) mean of the underlying TTR process. This sample mean is commonly referred to as the infamous MTTR (mean-time-to-resolve) metric.

Now, I personally believe that incidents durations are power-law distributed, and, consequently I believe that observed MTTR trends convey no useful information at all.

But rather than simply asserting that power-law distributions render MTTR useless, I wanted to play with some power-law-distributed data to validate what I had read about power laws. And, to be honest, I wanted an excuse to play with Jupyter notebooks.