You Will (Not) Scale Prometheus | probes

submited by
Style Pass
2024-09-28 09:00:07

To aid a memory obese prometheus, I recently helped in attempting to slowly shift a cluster over to prometheus agent mode sending data to thanos receive over the last couple of months. I have now personally given up on this goal due to a variety of reasons, and this post explores why.

With the 3 components replacing prometheus (agent, receive, ruler) in-theory having better scaling characteristics by themselves, with a cleaner, and more delineated area of responsibility.

Why chase better scaling characteristics? A single prometheus grows in size/requests with amount of time series it scrapes, and it can only grow as long as you have enough RAM available. Eventually you run out of super-sized cloud nodes to run them. Have personally had to provision a 300GB memory node during a cardinality explosion, and would like to not deal with this ticking time bomb in the future.

While the original setup can hardly be considered trivial, splitting one component into 3 sounds like a simple addition in theory.

Leave a Comment