At CloudKitchens, our software systems interact with humans in kitchens around the world in real time. From robotic conveyance to reliable messaging infrastructure, these systems need to “just work” to not impede food order fulfillment – yet, at scale, it is traditionally difficult to design and build them well enough to satisfy our latency, correctness and reliability requirements.
There are no simple cookie cutter options. On one hand, stateless designs that rely on databases tend to be too slow. On the other hand, stateful designs that coordinate using Etcd, Redis or internal consensus protocols do achieve low latency, but tend to be too complex and ultimately incorrect or unreliable. Older versions of our systems ran into these conundrums.
To overcome them, we built an opinionated sharding service, Work Distribution Service (WDS), aimed at exclusive in-memory ownership with dynamic explicit assignments, load balancing, and client-controlled routing logic. Today, our most critical, high-performance kitchen systems rely on WDS under the hood.