There are so many ways to measure application reliability today, with hundreds of key performance indicators (KPIs) to measure availability, error rates, user experiences, and quality of service (QoS). Yet every organization I speak with struggles to effectively use these metrics. Some applications and services require custom metrics around reliability while others can be measured with just uptime vs. downtime.
In my role at Causely, I work with companies every day who are trying to improve the reliability, resiliency, and agility of their applications. One method of measuring reliability that I keep a close eye on is MTT(XYZ). Yes, I made that up, but it’s meant to capture all the different variations of mean time to “X” out there. We have MTTR, MTTI, MTTF, MTTA, MTBF, MTTD, and the list keeps going. In fact, some of these acronyms have multiple definitions. The one whose meaning I want to discuss today is MTTR.
Before cloud-native applications, MTTR meant one thing – Mean Time to Repair. It’s a metric focused on how quickly an organization can respond to and fix problems that cause downtime or performance degradation. It’s simple to calculate too: