Containers, Kubernetes, and microservices have enabled operations teams to move ever faster, producing more applications, packaging them up, and scali

Observing containers with the three pillars of observability – Increment

submited by
Style Pass
2021-06-27 12:30:06

Containers, Kubernetes, and microservices have enabled operations teams to move ever faster, producing more applications, packaging them up, and scaling them to meet users’ needs. But as infrastructure scales, it generates more data, and developers have to work harder to cut through the noise. 

The three pillars of observability—metrics, logs, and tracing—are a great place to start. Let’s explore the benefits, challenges, and use cases of each to come away with a stronger understanding of how to make the most of our container data.

Metrics allow you to evaluate, at a high level, whether a system is behaving as expected. As such, they’re typically used for alerting. There are hundreds of metrics to describe each container, from memory to over-network activity to CPU, so to keep your alerts targeted and useful, focus on ones that describe the end user experience: server error rates, latency, saturation, and so on. Metrics that describe causes, rather than symptoms, of a service failure, like a specific error, CPU, or increased memory, are less helpful: Knowing CPU usage is at 90 percent, for example, doesn’t tell you anything about how users experience the service; knowing 99.9 percent of requests were successfully processed within 100 milliseconds, on the other hand, tells you your service is reliable and fast enough to provide the desired user experience.

Engineering organizations typically formalize their reliability goals as service-level objectives (SLOs) and assign metrics to them. This is an effective means of focusing the information generated across an entire infrastructure, providing concise, high-level insights as your organization deploys a growing number of containers. SLOs remind us that software, just like the humans who create it, is imperfect; defining SLOs—for example, requiring that 99.9 percent of all requests be processed successfully within 30 days—helps establish how imperfect a system can be while still providing a positive user experience. Metrics such as queue length, number of container crashes, and dropped network packages can also be useful for debugging, but they shine brightest when leveraged to inform user-centric SLOs.

Leave a Comment