On May 6th, 2022 between 1:30 AM PDT and 12:17 PM PDT, Sentry experienced a large-scale incident which resulted in the majority of our services being incessabile for 6+ hours. This had the following potential impact for customers:
The root cause of this incident was an issue within our Google Cloud Platform’s primary compute region, affecting persistent volumes attached to our compute infrastructure. In this post we’ll share more details about how this incident was handled by our SRE team and our future plans to increase our resilience to this class of issues.
At 1:38 AM PDT on May 6th, our Site Reliability Engineering (SRE) team was paged for multiple distinct alerts across varying services and infrastructure. Our team quickly identified that numerous persistent volumes (PV) across our Google Cloud Platform (GCP) compute infrastructure were experiencing abnormal levels of IO wait and IO latencies of more than a few minutes in some cases.
By 3:00 AM PDT much of Sentry’s core infrastructure residing in GCP was either heavily degraded due to IO performance or inaccessible due to the expanding scope of the GCP incident. These issues spanned across everything from our remote bastion access to our Kubernetes control plane, and greatly impacted the SREs team’s attempts to mitigate or improve the overall incident impact.