An overview of how we investigated and solved the issue of some Kubernetes Pods running MySQL starting up and shutting down slower than other similar Pods with the same data set.
KateSQL is Shopify’s custom-built Database-as-a-Service platform, running on top of Google Cloud’s Kubernetes Engine (GKE), currently manages several hundred production MySQL instances across different Google Cloud regions and many GKE Clusters.
Earlier this year, we found a performance related issue with KateSQL: some Kubernetes Pods running MySQL would start up and shut down slower than other similar Pods with the same data set. This partially impaired our ability to replace MySQL instances quickly when executing maintenance tasks like config changes or upgrades. While investigating, we found several factors that could be contributing to this slowness.
The root cause was a bug in the Linux kernel memory cgroup controller. This post provides an overview of how we investigated the root cause and leveraged Shopify’s partnership with Google Cloud Platform to help us mitigate it.