A sustainable on-call duty is critical for any Incident response, providing the human filter for the reliability and availability of services that are

The SRE Incident Response game

submited by
Style Pass
2021-07-05 06:00:04

A sustainable on-call duty is critical for any Incident response, providing the human filter for the reliability and availability of services that are under the watchful care of the engineering team. The key word is sustainable. Burnout from sleep peppered with PagerDuty alerts, complex and adrenaline pumping incidents can take its toll. Establishing an on-call rotation where an engineer feels underprepared, stressed and likely ready to quit is something we want to avoid, not only because it’s harmful to your engineers but also to maintaining the availability of your services. So let’s set the scene.

It’s 3 am, you and your family are fast asleep. Then BAM! You are woken by a PagerDuty alarm, ripping you from your dreams into what could be potentially a major incident. Your adrenaline is spiking, yours eyes are still focusing and you haven’t even had a cup of coffee yet. Are you ready to jump in and triage the page? Do you know where to start? Have you seen this error before? Are you even familiar with that particular service?

So how do we become comfortable with the uncomfortable? This takes practice and experience responding to incidents until it becomes muscle memory. However repetition alone isn’t going to build this capacity or improve the teams performance for responding to incidents. K. Anders Ericsson, a leading psychologist in expert performance, describes elite performers practice differently from everyone else. They engage in deliberate practice which has specific goals with a defined focus. Critical to this practice is immediate feedback, allowing for lessons to be learnt from mistakes which are made in a safe environment.

Leave a Comment