At Runway, we've been using Kubernetes to run our production workloads since day one. Over nearly five years, we've operated, (re)created, and upgraded our K8s clusters at least a dozen times. Our last upgrade was the hardest, as we had to modernize some of our oldest resources and third-party software deployed in the cluster, without causing application downtime. This post presents a few tips and strategies we learned during the process.
We hope this one goes without saying, but you should perform your upgrades in a staging cluster first. If you don't have a safe pre-production cluster to prepare and test your upgrade in, you run a greater risk of breaking something in production.
The approach we've taken at Runway is to maintain one staging cluster for each production cluster (e.g. stage-aws1, prod-aws1, stage-aws2, etc...). The production and staging cluster environments are intended to be as identical as reasonably and economically possible. Our staging clusters serve two purposes:
There is no shame if you've found yourself in a scenario where you don't have a dummy/non-production cluster to try an upgrade on first, but we do strongly suggest you do something to change that. We've found that managing our clusters via a Terraform module simplifies the process of creating new clusters and keeping existing ones in sync over time.