Amazon’s Aurora MySQL is Plaid’s primary relational database. It's proven to be a reliable and scalable storage layer that backs some of our

Exploring performance differences between Amazon Aurora and vanilla MySQL

submited by

Style Pass

2021-06-17 15:30:04

Amazon’s Aurora MySQL is Plaid’s primary relational database. It's proven to be a reliable and scalable storage layer that backs some of our most critical systems.

Back in 2019, however, we noticed severe performance degradation on the primary instance in one of our Aurora clusters. Our oncall runbooks suggested checking load on the primary instance using our Prometheus metrics and RDS performance insights, yet everything looked normal. We kept digging, and eventually found a clue: an errant long-running transaction had been open on one of our read replicas for weeks.

Surprisingly, killing the transaction mitigated the performance degradation—but we didn’t understand why. Soon after, we began to notice echoes of the same issue. Every time an Airflow job would ETL data from one of our read replicas to our data warehouse, we would see a small performance hit. This behavior seemed to violate one of our core assumptions about our system based on our knowledge of non-Aurora MySQL setups: it should not be possible, we thought, for read replica load to have any impact on writer performance.

It was clear that we needed to read the manual: we dug into the Aurora DB design paper, Aurora documentation, MySQL documentation, and several articles on databases. We were eventually able to identify the cause of the degradation and a handful of best practices to avoid it. But before we get to the solution, let's get back to databases 101.