Here’s an error that no developer wants to see in their Postgres server logs: server process was terminated by signal 11: Segmentation fault. When our team was paged about one of our Postgres clusters crashing with this error, we were quite surprised, and quickly activated our incident response. The failing cluster was a new workload freshly deployed on our reliable internal Postgres-on-Kubernetes platform. What could possibly be causing this new and scary failure mode? Why was Postgres crashing with segmentation faults?
The problem was manifesting on multiple EC2 nodes, so we quickly ruled out hardware issues. We had core dumps from the crashes that appeared corrupted but provided little hint as to why. Looking more broadly, we correlated our query logs and pinpointed a query pattern that was occurring right before each crash. Following this lead, we jumped into a psql terminal to test the suspicious query and indeed found that running it triggered a segfault. We were taken aback to discover that we had a Postgres cluster—up to date with the latest minor version—that was crashing after running an innocuous-looking “query of death”!
Ultimately, by successfully isolating and debugging the crashes down to the assembly level, we were able to identify an issue in Postgres JIT compilation, and more specifically a bug in LLVM for Arm64 architectures. This post describes our investigation into the root cause of these crashes, which took us on a deep dive into JIT compilation and led us to an upstream fix resolution.