This is one of those topics that hit experienced Postgres DBAs badly. So badly that hypothetical Socrates of 21th century could be one of those Postgr

Zero-downtime Postgres schema migrations need this: lock_timeout and retries

submited by
Style Pass
2021-09-27 10:00:07

This is one of those topics that hit experienced Postgres DBAs badly. So badly that hypothetical Socrates of 21th century could be one of those Postgres DBAs and the words "I know that I know nothing" would sound natural. I've seen dozens of projects with busy Postgres setups serving lots of TPS, in mission-critical systems in zillion-dollar companies, where database engineers were thinking that they are experienced and know what they are doing – and then suddenly this, quite a basic topic, made them say "ouch" and quickly fix their DDL deploy systems. Well, let's be frank: I was in this position myself, learning this after more than 10 (!) years of Postgres experience.

I'm sure some of you know this very well – if so, scroll down to see some bits of advanced material on the matter. However, I'm 100% sure that many of my readers will be really surprised right now.

We won't talk about "how to change a column's data type" or "how to add a foreign key" – those questions are all interesting too, and there are efforts to document and automate each of such steps for heavily-loaded systems with strict uptime requirements (a great example here is GitLab's "Migration Style Guide"). Instead, we will discuss something that affects any Postgres setup where schema needs to be changed from time to time, where downtime is considered as a huge problem, where DDLs are automated in one way or another (I mean DB migration tools such as Flyway, Sqitch, Liquibase, Ruby on Rails AR DB Migrations, and so on), but where DDLs are deployed without certain trick, so downtime can happen suddenly and unpredictably. Even if TPS numbers are not big. Without that trick in place, anyone using Postgres can (and will) hit that wall one day. Therefore, any engineer working with Postgres should know this trick and, perhaps, implement it in all systems to prevent downtime.

Leave a Comment