This is our first post in a series to demystify the metrics engine that underpins Shoreline’s real-time data. Each of these posts will be a deep dive into one aspect of the engine and how it benefits SREs/DevOps/Sysadmin while they're on call. We’ll showcase our most interesting applications in machine learning, compression, compilers, and distributed systems.
During an operational event, as an SRE, I run ad hoc queries to debug the system. I want these to be real-time no matter how complex the computation or volume of data. Shoreline’s metrics team has leveraged two machine learning technologies from Google, JAX and XLA, to accelerate metric query and data analysis. Within Shoreline, queries are automatically vectorized using JAX and compiled using XLA, offering optimal performance without any extra user work. This allows Shoreline to compute complex, ad hoc aggregates across 100,000s of data points in less than one second.
While metrics help SREs understand the health of their infrastructure and applications, crafting a metric query can be a complex task involving derivatives and vectorized mathematics. For example, using Kubernetes’ node exporter to compute CPU usage for a host involves a derivative, an average, and other vector arithmetic. JAX is a Python frontend to XLA and XLA is the linear algebra compiler backing TensorFlow, Google’s famous machine learning framework. Within Shoreline, we map metric queries to tensor operations and compile them, leveraging the same stack that is used in these contemporary machine learning systems.