InfoQ Homepage News Stripe Rearchitects Its Observability Platform with Managed Prometheus

Stripe Rearchitects Its Observability Platform with Managed Prometheus and Grafana on AWS

submited by

Style Pass

2024-11-27 23:30:02

InfoQ Homepage News Stripe Rearchitects Its Observability Platform with Managed Prometheus and Grafana on AWS

Stripe replaced its observability platform, which used a third-party vendor solution, with a new architecture utilizing managed services on AWS. The company made the move due to scalability limits, reliability issues, and increasing costs while transitioning to microservices. The migration involved dual-writing metrics, translating assets, validation, and user training.

After adopting microservices, Stripe’s architecture generated around 300 million metrics, 40,000 alerts, and 100,000 dashboard queries generated by seven thousand employees. With such a large footprint, the preexisting observability platform started to struggle, resulting in scalability and reliability issues and increasing costs.

The company decided to adopt a solution that would offer higher capacity and higher cost efficiency and chose to use Amazon managed service for Prometheus (AMP) and Grafana. The transitional architecture of the observability platform for metrics consisted of several components. Metric collected from compute hosts and scraped from Kubernetes clusters were delivered to the aggregation layer. Metrics from the aggregation layer and Amazon CloudWatch were ingested into Amazon Managed Prometheus by the Egress Proxy. Additionally, aggregated and unaggregated metrics were sent to the legacy time-series database to support the migration.