Recently, Vijay Chidambaram (a CS professor at UT Austin) asked me, “Why do so many outages involve configuration changes?” I didn’t

Why do config changes keep coming up in major incidents?

submited by

Style Pass

2021-05-31 07:00:07

Recently, Vijay Chidambaram (a CS professor at UT Austin) asked me, “Why do so many outages involve configuration changes?”

I didn’t have a good explanation for him, and I still don’t. I’m using this post as an exercise of thinking out loud about possible explanations for this phenomenon.

It might be that config changes are not somehow more dangerous, it just seems like they are. Perhaps we only notice the writeups where a config change is mentioned, but we don’t remember the writeups that don’t involve a config change. Or perhaps it’s a base rate illusion, where config changes tend to be involved in incidents more often than code changes simply because config changes are more common than code changes.

For many of Salesforce’s systems, the deployment pipelines have built-in stagger and canary requirements that are automated. For Salesforce’s DNS systems, the automation and enforcement of staggering through technology is still being built. For this configuration change and script, the stagger process was still manual.