It should be possible to focus almost entirely on the business logic and the complexity inherent to the domain. Instead, you need to review line-by-line and check: “what if the service crashes here?”, “what if the API we call here is temporarily unavailable”, “what if a concurrent invocation overtakes this one here”, or “what if this process becomes a zombie while executing this function, how do I prevent it from corrupting the state?”.
As a result, you spend a huge amount of time worrying about failover strategies, retries, race conditions, locking/fencing, ordering of operations, order visibility of changes, decoupling availability, etc. They add queues, key-value stores, locking services, schedulers, workflow orchestrators and they try to get them all to play nice together. And the hard truth is, many applications don’t get it right and are not correct under failures or even under load.
How can we radically simplify this? In this article, we walk through a core idea that addresses many of these issues, by avoiding distributed coordination. Much of this goes back to learnings from when we built Apache Flink.