Alex Hidalgo is a Site Reliability Engineer at Squarespace, and he’s currently writing a book called Implementing Service Level Objectives for O’

Q&A with Alex Hidalgo on SLOs

submited by

Style Pass

2024-10-26 23:30:38

Alex Hidalgo is a Site Reliability Engineer at Squarespace, and he’s currently writing a book called Implementing Service Level Objectives for O’Reilly Media. The first three chapters of the book are available now through O’Reilly’s early access program. I had a chance to read those chapters and ask Alex some questions about service level objectives and reliability. Thanks, Alex, for sharing your knowledge.

When people talk about service level objectives, they tend to use the term SLO to encompass the entire process, but really there are three primary components at play. You have service level indicators, or SLIs, which are measurements of the performance of your service from your users’ point of view. SLIs inform service level objectives, or SLOs, which are essentially just targets for how often your SLIs should be in a good state. SLOs in turn power error budgets, which are a way of measuring how you’ve performed against your target over a window of time.

When developing slides for a talk at DevOpsDays NYC 2019 I came up with a little graphic showing this, with SLIs on the bottom, SLOs in the middle, and error budgets at the top. I found myself staring at this diagram for a bit, thinking to myself that we can’t just keep referring to all of this as “doing SLOs,” since SLOs are really just one component. So, I came up with the term The Reliability Stack to refer to how these three components interact.