Metastability and Distributed Systems - Marc's Blog

submited by

Style Pass

2021-05-25 07:30:03

There's no more time-honored way to get things working again, from toasters to global-scale distributed systems, than turning them off and on again. The reasons that works so well are varied, but one reason is especially important for the developers and operators of distributed systems: metastability.

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.

What they're identifying here is a kind of stable down state, where the system is stable but not doing useful work, even though it's only being offered a load that it successfully handled sometime in the past.

One classic version of this problem involves queues. A system is ticking along nicely, and something happens. Could be a short failure, a spike of load, a deployment, or one of many other things. This causes queues to back up in the system, causing an increase in latency. That increased latency causes clients to time out before the system responds to them. Clients continue to send work, and the system continues to complete that work. Throughput is great. None of the work is useful, though, because clients aren't waiting for the results, so goodput is zero. The system is mostly stable in this state, and without an external kick, could continue going along that way indefinitely. Up, but down. Working, but broken.

In Metastable Failures in Distributed Systems, Bronson et al correctly observe that these types of failure modes are well-known1 to the builders of large-scale systems: