Recently we had an unexpected performance degradation in one of our go workloads to the tune of nearly 2x expected cpu usage. Turns out the issue was

Go Production Performance Gotcha - GOMAXPROCS

submited by
Style Pass
2024-11-26 12:30:04

Recently we had an unexpected performance degradation in one of our go workloads to the tune of nearly 2x expected cpu usage. Turns out the issue was that we didn’t explicitly set the GOMAXPROCS go runtime parameter. We’re going to run through the investigation, what we found and how we fixed it.

Metoro is an observability platform for systems running in Kubernetes. In order to collect telemetry about the cluster and the workloads running in it, we deploy a daemonset to monitored clusters. This deamonset creates a pod on each host called the node-agent which collects information about workloads and sends it out of the cluster to be stored.

The node agent instruments a number of kernel operations through eBPF to automatically generate distributed traces and other telemetry, this means that the cpu usage of the node-agent scales with the number of requests being made to / from pods on a node. Typically a node-agent can expect to use around a 1 second of cpu time (on up to date EC2 hosts) to process 12,000 HTTP requests.

We were deploying to a new customer cluster when we noticed that some of our node agents were using much more cpu than we expected. Hosts in this cluster were processing up to 200,000 requests per minute, so in a single minute we should expect the node-agent to use around 17 seconds of cpu time (28% of one core) to process these requests. Instead, we noticed that the node-agent was using 30 seconds of cpu time (50% of one core), this nearly 2x what we would expect for that workload.

Leave a Comment