Complex systems are difficult to reason about at scale; we often can’t accurately extrapolate system behavior and performance, so we need to derive that data empirically. We use load testing to do just that: find the limits of our systems and weed out bugs at a large scale in a controlled environment. Slack is a pretty complex system — whether you’re triggering a workflow for thousands of members or uploading a file into a thread, everything is interconnected! The technology required to give our users the experience of sending a message and having it instantly appear in potentially millions of clients is very challenging to build and test at scale. To adequately load test our systems, we needed to build a tool that was both realistic and cost-efficient in how it mirrored actual user traffic and behavior.
As an example, the seemingly simple act of sending a message goes through quite the journey. First, a connected client calls the Web API chat.postMessage. This is sent to our backend which queries our real-time services stack for a timestamp and writes the message to the messages table in Vitess . The backend also kicks off an asynchronous processing job (which handles things like expanding previews for links and attachments) and finally sends an event through our real-time services stack to all the connected clients in the channel via their active websocket connections.