How long does it take to read the time? How would you time time? These strange questions came to the fore back in 2014 when Netflix was switching services from CentOS Linux to Ubuntu, and I helped debug several weird performance issues including one I'll describe here. While you're unlikely to run into this specific issue anymore, what is interesting is this type of issue and the simple method of debugging it: a pragmatic mix of observability and experimentation tools. I've shared many posts about superpower observability tools, but often humble hacking is just as effective.
A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. A quick check of basic performance statistics showed over 30% higher CPU consumption. What on Earth is Ubuntu doing that results in 30% higher CPU time!?
The Cassandra systems were EC2 virtual machine (Xen) instances. I logged into one and went through some basic CLI tools to get started (my 60s checklist).