If you’re an Elixir, Gleam, or Erlang developer, you’ve probably heard about the capabilities of the BEAM virtual machine, such as concurrency, distribution, and fault tolerance. Fault tolerance was one of the biggest concerns of Tandem Computers. They created their Tandem Non-Stop architecture for high availability in their systems, which included ATMs and mainframes.
In this post, I’ll be sharing the fundamentals of the NonStop architecture design with you. Their approach to achieving high availability in the presence of failures is similar to some implementations in the Erlang Virtual Machine, as both rely on concepts of processes and modularity.
Why do systems fail? This question should probably be asked more often, considering all the factors it involves. It was central to the NonStop architecture because achieving high availability depends on understanding system failures.
For tandem systems, any system has critical components that could potentially cause failures. How often do you ask yourself how long can your system operate before a failure? There is a metric known as MTBF (mean time between failures), which is calculated by dividing the total operating hours of the system by the number of failures. The result represents the hours of uninterrupted operation.