In safety systems engineering (SSE) people talk about “safety type 1” and “safety type 2”. The first one is about prevention, the second is about resilience. Type 1 has been around for a long time, and type 2 is fairly new.
Software systems (and software companies) are complex. And for complex systems, type 1 safety (prevention) will not save us for three key reasons:
Reason 1a: It’s impossible to prevent all failures from happening in a complex system. Many of them are unknowns we can’t can’t know them beforehand. This makes it impossible to prevent them from ever happening. You can’t write unit tests for all potential error scenarios you don’t know about.
Reason 1b: The context around a system is never static. There is an assumption that all failure is introduced by an operator. But this is not always the case; For example, load changes over time, new users are registering, auto-scaling might happen, third-party providers might be unavailable, we run out of memory, or we run out of memory on our database. A system is dynamic on many dimensions, and many things can happen.
Reason 2: As long as an operator is making changes to a system, mistakes will occasionally happen. They can’t fully be prevented. For example, every new deployment of some software runs the risk of breaking it. However, you can reduce the likelihood of mistakes happening. This is what type 1 has focused on.