The Ticking Time Bomb of Observability Expectations
In some ways, monitoring is the lifeblood of Site Reliability Engineering practices, providing critical insights into system performance and reliability, and driving the SLI/SLO framework so many of us work within. In recent years, many of us in SRE have observed a disturbing trend in the role vendors are playing in this market.
It’s very easy to convince Engineers and Managers to “monitor everything” — who doesn’t want as much information as they can possibly have about what’s happening in their system? At surface level, this sounds like a great plan. This has become the dominant approach by engineering teams: simply install an agent, sidecar, or SDK, and everything 🌈 will be monitored for you. Want to know how your Kubernetes cluster is doing? Here’s 10k “turnkey” metrics! The numbers become gigantic as architectures continue to fragment from monoliths to rocks(SOA) to pebbles (microservices) to…a gaseous cloud of lambdas? Doesn’t matter, add this line and ship dozens of metrics from every single lambda execution. Ship it all, monitor everything, and sort it out later. After all, we can’t possibly know what the cause of an incident in the future might be! Imagine the chaos, the crisis, of experiencing an incident for which you don’t have data! We’re a data driven organization after all! Every second of outage costs us money! For every single interaction we capture scientific levels of data, constantly vigilant, expecting at any moment we might need to comb through it to understand a complex outage.
The trouble is, this is extraordinarily expensive computationally, cognitively, and financially. The financial and computational cost of this has been subsidized by VC investment in the past, which were in turn subsidized by the historically low interest rates of the 2010s. As you’ve probably noticed, that party is over. The cognitive costs are still subsidized by simply putting on the confident “Serious Senior Engineer” face and pretending we know what all this stuff means.
Recently, the consequences have become even more pronounced, with bills reaching astronomical figures, such as Coinbase’s staggering $65 million bill from Datadog. The confluence of “ephemeral”, atomic architectures and vendor billing practices have created a crisis that’s quietly happening in SRE teams everywhere. It costs money to store logs, metrics, and traces. A hard reality that is seldom discussed is that it often costs much more than that data is worth. In fact it costs so much money that even monitoring companies that are running billions in annual sales are still losing hundreds of millions of dollars. It’s starting to look like this model might not work. These companies still have to pay the bills for bandwidth, compute, and storage. Resenting your accounting team or quitting the job isn’t the right way to respond to concerns about costs.
Still, the end results are dazzling — incredible amounts of data, expanding our horizons both in terms of big data back end systems and UX. These platforms are engineering marvels. They’re also the world’s most expensive spam generation platforms. Flapping alarms, auto-resolving before you can review them. Bogus anomaly detection spamming into a slack channel because “some day” you’re gonna get that tuned right. Thousands of metrics you’re paying for, often easier to add than to remove! Alerts that nobody does anything about, but nobody feels safe turning off. Hundreds of dashboards and still no “Single Pane of Glass”.
This can’t go on forever. As monitoring bills eat an increasingly large share of engineering budgets(often 10–25% of your also-huge cloud bill, where you’re also paying for many of the same metrics to be captured!), it’s time to start thinking about how we might get out of this spot. The Cost/Value equation here is not looking as hot as it did 5–10 years ago. It’s still often the case, just like it was 10+ years ago, that dev teams don’t know which metrics are the most valuable or important in understanding the health of their applications, but now there are sometimes millions of possible answers to that question.
In future posts, I’ll explore some ideas for how this might shake out, and what you might need to think about as an SRE in a market where saving money is likely to become a top priority. My mission? Stop letting vendors control the conversation about observability.