As a fintech organization, we move billions of dollars every month. Our customers and merchants have entrusted us with a paramount responsibility. To handle our ever-growing business, building a robust observability stack is not just “nice to have”, but absolutely essential. And all of this starts with better monitoring and metrics.
Over the past few years, Prometheus has become the de-facto platform for metric collection. Razorpay has perhaps been one of the first companies in the country to have gone with a production-grade Kubernetes infrastructure (somewhere late 2016). We wanted to ensure that applications and associated infrastructure should have very little friction to emit and consume metrics. As we have grown in traffic, we had to revisit some of our fundamental assumptions and ways for improving our metrics and observability platform. This article tries to lay out some of the challenges, our journey into building a robust metric collection and alerting platform and also provides insights into how we are constantly improving the same.
There are a bunch of older technologies like Nagios and Icinga which have traditionally been used to monitor applications and infrastructure. Unfortunately, none of these technologies are container aware. We also had a strong requirement of running the monitoring infrastructure in-house. The dynamic nature of Kubernetes demands monitoring systems to detect the changing infrastructure. By 2018, we had close to 30+ services in production on AWS.