This story began when a routine update to one of our critical services caused a rise in errors. It looked like a simple issue—logs pointed to DNS an

It's always DNS . . . except when it's not: A deep dive through gRPC, Kubernetes, and AWS networking

submited by
Style Pass
2024-04-25 20:00:38

This story began when a routine update to one of our critical services caused a rise in errors. It looked like a simple issue—logs pointed to DNS and our metrics indicated that the impact to users was very low. But weeks later, our engineers were still puzzling over dropped packets, looking for clues in kernel code, and exploring the complexities of Kubernetes networking and gRPC client reconnect algorithms. However, no single team was able to fully understand the issue from their vantage point.

In this post, we’ll tell the story of how we investigated and ultimately resolved this incident. And we’ll share some of the knowledge we gained along the way, including:

Beginning in September 2021, we saw increased errors when we rolled out updates to our metrics query service, which is responsible for retrieving real-time and historical metric data from our data stores. It makes these metrics available to multiple clients, including our frontend web application and monitor evaluation clients, which use that data to determine whether to alert.

Clients automatically retry failed queries, so user-facing errors were minimized. But the retries introduced latency that affected the performance of dashboards and monitors.

Leave a Comment