This is our second post in a series to demystify the metrics engine that underpins Shoreline’s real time data. Each of these posts will be a deep di

Minimizing Mean Time to Detect: Real Time Alarms with IREE

submited by
Style Pass
2021-08-24 00:00:13

This is our second post in a series to demystify the metrics engine that underpins Shoreline’s real time data. Each of these posts will be a deep dive into one aspect of the engine and how it benefits SREs/DevOps/Sysadmin while they're on call. We’ll showcase our most interesting applications in machine learning, compression, compilers, and distributed systems. Check out the first post in our series: Shoreline Accelerates Ops with JAX & XLA

As operators, we need to know when our systems are broken - any delay in alarming increases mean time to detection and reduces availability. Longer delays mean higher chances of broken SLAs and the resultant correction-of-error and root cause analysis meetings. 

To address this, Shoreline has developed a real time alarm mechanism: Shoreline executes 1,000s of alarms on box, with 1 second of delay, so you’ll know immediately when systems are down. Shoreline’s incident automation plugs into real time alarms, allowing mitigation to automatically begin within 1 second of the alarm firing. Shoreline can detect and repair issues before other monitoring systems have even noticed something is going on. 

To do this, we extended our metric query compiler, detailed in the first blog post in this series, so that we can export compiled alarms from our backend to execute natively on the agent. Alarms execute as highly optimized machine code. We leveraged Google’s IREE project: a tool chain that allows for representing tensor transformation in a unified way and an export mechanism so computations can be easily sent to other computers. 

Leave a Comment