We use ClickHouse, a popular columnar database to back our Mux Data Monitoring dashboard. ClickHouse is optimized to handle ingesting and aggregating

Latency and throughput tradeoffs of ClickHouse Kafka Table Engine

submited by

Style Pass

2024-11-29 23:00:03

We use ClickHouse, a popular columnar database to back our Mux Data Monitoring dashboard. ClickHouse is optimized to handle ingesting and aggregating large amounts of data for analytics workloads. But even with its strengths, our team encountered significant performance bottlenecks in our ClickHouse cluster, particularly during the ingestion of high-volume real-time data through Kafka.

These performance issues manifested during periods of increased query load, as one might expect. Our Monitoring Dashboard became visibly delayed and response times climbed from 100ms to upwards of 10s as the number of queries grew. What baffled our team, however, was that this performance degradation occurred despite the cluster showing only 60% CPU utilization. With on-call engineers stressed and worn down by what seemed like weekly incidents, we had an urgent problem on our hands - one that throwing more hardware at would not fix.

This blog post outlines how we identified and fixed this bottleneck, which turned out to be a classic trade-off between latency and throughput. For those working with ClickHouse Kafka Table Engine or facing similar challenges, we describe a method to measure the latency of ingestion —a metric that, at the time of writing, ClickHouse does not expose out of the box. Measuring the latency in this way was crucial for us to elucidate the problem. We hope it helps you too.