A large majority of computer systems have some state and are likely to depend on a storage system. My knowledge on databases accumulated over time, but along the way our design mistakes caused data loss and outages. In data-heavy systems, databases are at the core of system design goals and tradeoffs. Even though it is impossible to ignore how databases work, the problems that application developers foresee and experience will often be just the tip of the iceberg. In this series, I’m sharing a few insights I specifically found useful for developers who are not specialized in this domain.
It’s an open debate how reliable today’s networking is and how commonly systems experience downtime because of networking outages. The available research is limited and is often dominated by large organizations who have dedicated networking with custom hardware, as well as specialized staff.
With 99.999% service availability, Google cites only 7.6% of Spanner (Google’s globally distributed database) issues are caused by networking even though it keeps crediting its dedicated networking as a core reason behind its availability. Bailis’ and Kingsbury’s survey from 2014 is challenging one of the Fallacies of Distributed Computing coined by Peter Deutsch in 1994. Is network really reliable?