When dealing with failures in a microservice system, localized mitigation mechanisms like load shedding and circuit breakers have always been used, bu

Failure Mitigation for Microservices: An Intro to Aperture

submited by
Style Pass
2024-10-22 13:00:06

When dealing with failures in a microservice system, localized mitigation mechanisms like load shedding and circuit breakers have always been used, but they may not be as effective as a more globalized approach. These localized mechanisms (as demonstrated in a systematic study on the subject published at SoCC 2022) are useful in preventing individual services from being overloaded, but they are not very effective in dealing with complex failures that involve interactions between services, which are characteristic of microservice failures. 

A novel way to deal with these complex failures takes a globalized view of the system: when an issue arises, a global mitigation plan is automatically activated that coordinates mitigation actions across services. In this post, we evaluate the open-source project Aperture and how it enables a global failure mitigation plan for our services. We first describe the common types of failures we have experienced at DoorDash. Then we dive into the existing mechanisms that have helped us weather failures. We will explain why localized mechanisms may not be the most effective solution and argue in favor of a globally aware failure mitigation approach. Furthermore, we will share our initial experiences using Aperture, which offers a global approach to address these challenges.

Before we explain what we have done to deal with failures, let's explore the types of microservice failures that organizations experience. We will discuss four types of failures that DoorDash and other enterprises have encountered. 

Leave a Comment