At Zerodha, we run a multitude of internal and public-facing services that generate copious amounts of logs. While developers use these logs to debug

Logging at Zerodha - Zerodha Tech Blog

submited by
Style Pass
2023-03-24 07:30:17

At Zerodha, we run a multitude of internal and public-facing services that generate copious amounts of logs. While developers use these logs to debug or troubleshoot incidents, some services also emit logs that must be persisted for prolonged periods to comply with numerous regulatory requirements. In this post, I will delve into our experiences with the ELK stack, why it didn’t fit our needs and our migration to ClickHouse.

In 2018, we adopted the ELK stack as our de facto stack for storing application logs. Using Filebeat, we shipped logs from EC2 instances to a central Logstash server. Logstash served as the aggregation layer, where we applied multiple filters to add metadata, scrub sensitive data, parse the logs, and dump them into various indices in Elasticsearch. Kibana was the default query layer for writing queries and fetching results. However, as our traffic volumes and user base increased significantly post-2020, our ELK stack began to struggle with the rising load.

In 2018, we handled less than 1/10th of our current traffic, so the single-node Elasticsearch worked well for us. Although it still required a decently large EC2 instance because of the huge JVM memory requirements, we hosted it without facing many challenges. However, post-2020, when our traffic volumes spiked and our user base increased much faster, the underlying logging system also started to choke.

Leave a Comment