Uber relies on a containerized microservice architecture. Our need for computational resources has grown significantly over the years, as a consequenc

Efficient and Reliable Compute Cluster Management at Scale | Uber Engineering Blog

submited by
Style Pass
2021-07-12 06:30:06

Uber relies on a containerized microservice architecture. Our need for computational resources has grown significantly over the years, as a consequence of business’ growth. It is an important goal now to increase the efficiency of our computing resources. Broadly speaking, the efficiency efforts in compute cluster management involve scheduling more workloads on the same number of machines. This approach is based on the observation that the average CPU utilization of a typical cluster is far lower than the CPU resources that have been allocated to it. The approach we have adopted is to overcommit CPU resources, without compromising the reliability of the platform, which is achieved by maintaining a safe headroom at all times. Another possible and complementary approach is to reduce the allocations of services that are overprovisioned, which we also do. The benefit of overcommitment is that we are able to free up machines that can be used to run non-critical, preemptible workloads, without purchasing extra machines. 

In order to achieve this, we need a system that provides a real-time view of the CPU utilization for all hosts and all containers across all clusters. This system runs in production across all of our clusters, and is internally referred to as cQoS (Container Quality of Service) . cQoS enables the scheduler to perform telemetry-aware scheduling decisions, such as load-aware placement of tasks, proactive elimination of hotspots in the cluster, and load-aware scaling of the cluster size. In addition to helping with efficient resource utilization, such a system also helps with container performance analysis. The per-container metrics help with identifying performance issues related to uneven load balancing and container right-sizing.

Leave a Comment