Prometheus is being scaled to a 10 million metrics per second rate.

submited by

Style Pass

2021-05-17 12:50:42

During my time at Hotstar we wanted to make our monitoring systems better for the upcoming 2021 IPL season (Super Bowl of India).

The current setup had Prometheus pods running across our k8s clusters and each of these Prometheus setups had their own Grafana.

Looking at our current setup we were already using more than 60 CPU cores for our Prometheus setup in just one cluster (More than 200CPU Cores across all clusters)and scaling that to an even bigger node wouldn’t make sense as we want a solution which will last us for the next 2 seasons.

So w e looked at other OpenSource solution and found VictoriaMetrics which was a drop in replacement solution for Prometheus, which meant we don’t need to change our internal tooling or existing Prometheus rules.

We did internal benchmarks using tsbs and found that to ingest 10M metrics/s one of the most important factor was to distribute your writes across multiple vmstorage pods. In our case 11 pods did the job for us. Also you need to consider adding additional resources on vmstorage as all queries also consume resources.