The Netflix Simian Army

submited by

Style Pass

2024-11-19 12:30:07

We’ve talked a bit in the past about our move to the cloud, and John shared some of our lessons learned in going through that transition in a previous post. Recently, we’ve been focusing on ways to improve availability and reliability and wanted to share some of our progress and thinking.

The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link. We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter-/availability-zone-, and even regionally-redundant deployments. But just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these “once in a blue moon” failures.

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.