This paper appeared in CIDR 2023 recently. It is a thought provoking big picture paper, coming from a great team. The paper draws attention to the divide between conventional wisdom on building scalable OLTP databases (shared-nothing architecture) and how they are built and deployed on the cloud (shared storage architecture). There are shared nothing systems like CockroachDB, Yugabyte, and Spanner, but the overwhelming trend/volume on cloud OLTP is shared storage, and even with single writer, like AWS Aurora, Azure SQL HyperScale, PolarDB, and AlloyDB. The paper doesn't analyze this divide in detail, and jumps off to discuss shared storage with multiple writer as a more scalable way to build cloud OLTP databases. It would have been nice to go deep on this divide. I think the big benefit from shared storage is the disaggregation we get between compute and storage, which allows us to scale/optimize compute and storage resources independently based on demand. Shared storage also allows scaling reads, which where most of the workload skews to. As the paper promotes multi-writer shared storage, it glorifies the promises of coherent caching for improving throughput and scalability. The paper does mention the complexity and coordination cost of coherent caching but fails to talk about other downsides of caching with respect to metastable failures and modal behavior. The paper doesn't have a good answer for solving the complexity and coordination challenges for caching, and I am not sold on this part. The second half of the paper is an overview of ScaleStore, a multi-writer shared-storage database the authors presented last year at Sigmod. ScaleStore provides sequential/prefix consistency and lacks transaction support, as it doesn't implement isolation yet. It will be a challenge to implement that in a scalable way, but the paper cites a couple directions for implementing coherent OLTP cache support more efficiently. Now that you have gotten the TL;DR, read on if you are interested in diving deeper.
In this design, the database is split into partitions, each of which can be updated by only one RW-node. For splitting the database, modern systems use consistent hashing or fine-grained range partitioning. This design provides elasticity to scale, and can provide good performance for mostly local workload characteristics. A big drawback of this design is that it couples storage and compute together. When one part becomes the bottleneck, you need to split and get a new partition, even though the other part is under-utilized.