Partial Network Partitioning

submited by

Style Pass

2021-06-14 03:00:09

While conducting a comprehensive study of network failures in 25 modern systems, we identified partial network partitioning, an unusual network fault that causes catastrophic failures.

Partial network partitioning is a network fault that disrupts communication among some but not all nodes in a cluster. The figure shows how a partial network partition divides a cluster into three groups of nodes so that two groups (Group 1 and Group 2) are disconnected, but Group 3 can communicate with Groups 1 and 2.

We were curious to understand how these faults impact systems. Is the community aware of this fault? Are there established fault-tolerance techniques?

To answer these questions, we conducted a comprehensive study of system failures caused by partial network partitions. We analyzed 51 reports of failure caused by a partial network partitioning fault in 12 modern systems. We found that this fault leads to catastrophic failure such as data loss and complete system shutdown, that these failures are easy to manifest once a partial partition occurs, and that the majority of failures are due to design flaws. Finally, we found that all failures are reproducible using only 5 nodes.

While we did not find any discussions of this fault in the literature, we found eight popular systems (VoltDB, MapReduce, HBase, MongoDB, Elasticsearch, Mesos, LogCabin, and RabbitMQ) that implemented fault-tolerance techniques specifically to tolerate partial partitioning faults. We dissected the design of these eight popular systems and identified four principled approaches for tolerating partial partitions. Unfortunately, our analysis shows that implemented fault-tolerance techniques are inadequate for modern systems; they either patch a particular mechanism, or they may lead to a complete cluster shutdown, even when alternative network paths exist.