Ideally we would have received an automated alert, something like

Responding to and Learning from Distributed Incidents

submited by
Style Pass
2023-05-26 09:30:03

Ideally we would have received an automated alert, something like "Number of started jobs over the last 5 to 10 minutes is 0". This is an example of a symptom-based alert: it says nothing about why jobs may not be starting. There are two main reasons to alert on symptoms instead of causes. Firstly, symptom-based alerts [1] are directly related to what users expect from your platform. They want to run their jobs in a timely fashion, and we need to know quickly if user needs are not being met. Secondly, in complex systems there are so many possible causes for problems that finding them all ahead of time and configuring alerting rules for all of them is impossible. We would inevitably have gaps in our alerting. When we alert on symptoms we can configure one alert which covers all potential causes for those symptoms. Furthermore, it means we don’t have to try to determine alert queries and thresholds for all potential causes — this normally reduces alert noise and makes for a more maintainable alerting configuration. 

The only downside of symptom-based alerting is that once you receive an alert, you have to get from the symptom — your alert — to the cause. This is also true when users report problems. Let's return to our case study: users were unable to run their jobs. We had no idea yet what had caused this symptom. At the start of the incident, the impact was not known either. Only a couple of big consumers had raised the issue, so during the initial investigation we assumed some part of the fleet was still operational and just backlogged.

Leave a Comment