Building software at scale is hard. Maintenance is even harder. No company can succeed without a solid approach to handling operational problems.  Fa

Auto-remediation is important

submited by
Style Pass
2025-01-18 12:30:07

Building software at scale is hard. Maintenance is even harder. No company can succeed without a solid approach to handling operational problems.  Failures will happen. As fleet sizes cross 10,000 servers, services need to be increasingly operations-friendly 1. Each service should run with as little human intervention as possible. Unfortunately, Operations teams such as DevOps and SRE still find themselves intervening regularly.

There are several considerations when building an ops-friendly product. Some examples are designing for scale, providing redundancy, and building automatic failure recovery. As an experienced SRE, I have a vested interest in the last item, failure recovery without human intervention. How do we get there?

In 2018, a payment outage left millions of Visa customers in Europe unable to make payments with their cards. This caused widespread panic, and rightly so. People want to assurance that their money is safe and accessible. The postmortem revealed that a faulty datacenter switch caused a number of transaction failures. Considering it took ~10 hours for the issue to be resolved, I think it’s safe to say that a lot of people worked together for a while to get the fix out. Without intervention, it would’ve been even longer.

Before fixing problems through intervention, engineering teams need to first know when they happen. This is why Monitoring and Alerting are fundamental to preventing high-impact incidents. An alert queries a metrics system for a known problematic condition and notifies a responsible stakeholder when there’s a match. The alert priority then determines the level of intervention required.

Leave a Comment