Loading…
Friday, May 3 • 2:45pm - 3:05pm
Conversation: Defending SLOs - Error Budget Burn Rate Alerting using Datadog

Sign up or log in to save this to your schedule and see who's attending!

While Datadog is not an ideal medium for creating error budget burn rate alerts, it is possible to approximate the SRE Workbook Chapter 5 recommended "Multiwindow, Multi-Burn-Rate Alerts". If you're using Datadog, come by and see the setup.

This talk walks through my setup of Multiwindow, Multi-Burn-Rate Alerts on Datadog (See section 6 of [Chapter 5](https://landing.google.com/sre/workbook/chapters/alerting-on-slos/)). Datadog has a lot of gotcha's, so it is not very straightforward to do. I'll show example Datadog alert (screenshots/slideware most likely) and explain what the various bits do. Leading up to the slideware demo will be an explanation of Multiwindow, Multi-Burn-Rate Alerts, which is likely to be regurgitation of what's in the SRE Workbook. The main takeaway is the details of what to put in the Datadog monitor fields to get what we want out of it. I'm not affiliated with Datadog, but we use it at work and I assume a lot of the audience may too.

While learning how to do this, I didn't find burn-rate alerts intuitive at all and had to spend a few days understanding what the formulas mean. For example, what are the units of "burn rate"? Why do we monitor for burn rate being greater than 7 if we want to be notified about running out of error budget within 48 hours on a 2 week SLO measuring period? The plan is to share all those details in Open Space as various questions come up.

Speakers
TS

Tristan Slominski

Tristan Slominski is interested in design, development and operation of autonomous self-directed teams and decentralized distributed systems.


Friday May 3, 2019 2:45pm - 3:05pm
Touchdown Club Left

Attendees (21)