In distributed systems, there’s plenty of occasions for things to go wrong. This is why resiliency and redundancy are important. But no matter the systems you put in place, no matter whether you did or didn’t touch your deployments, issues might arise. It makes it critical to acknowledge the near misses: the situations where something could have gone wrong and the situations where something did, but it could have been worse. When was the last time it happened to you? For us, at Shopify, it was on September 30th, 2021, when the expiration of Let’s Encrypt’s (old) root certificate almost led to a global outage of our platform.
In April 2021, Let’s Encrypt announced that the former root certificate was expiring. As we use Let’s Encrypt as our public certificate provider since we became a sponsor in 2016, we made sure that Shopify’s edge infrastructure was up to date with the different requirements, so we wouldn’t stop serving traffic to all of (y)our beloved shops. As always, Let’s Encrypt did their due diligence with communications and by providing a cross-signing of their new root certificate by the old one. What this means is that while clients didn’t trust the new root certificate yet, because that new root certificate was signed by the old one, they trusted the old one and would transfer their trust to the new one. Also, the period of time between the announcement and the expiration was sufficient for any Let’s Encrypt-emitted certificates, which expire after three months, to be signed by the new cross-signed root certificate and considered valid using any of the old or new root certificates. We didn’t expect anything bad to happen on September 30th, 2021, when the root certificate was set to expire at 10:00 a.m. Eastern Standard Time.
At 10:15 am that same day, our monitors started complaining about certificate errors, but not at Shopify’s edge—that is, between the public and Shopify—but between our services. As a member of the traffic team of Shopify that handles a number of matters related to bringing traffic safely and reliably into Shopify’s platform (including the provisioning and handling of certificates), I joined the incident response to try and help figure out what was happening. Our first response was to lock the deployments of the Shopify monolith (using spy, our chatops) while some of us connected to containers to try and figure out what was happening in there. In the meantime, we started looking at the deployments that were happening when this issue started manifesting. It didn’t make any sense as those changes didn’t have anything to do with the way services interconnected, nor with certificates. This is when the Let’s Encrypt root certificate expiry started clicking in our minds. An incident showing certificate validity errors happening right after the expiry date couldn’t be a coincidence, but we couldn’t reproduce the error in our browsers or even using curl. Using openssl, we could, however, observe the certificate expiry for the old root certificate: