Delusion Soup: How Observability Got Here, and What We Can Do About It

David Caudill
12 min readMay 27, 2023

--

In this article we’ll talk about some of the unspoken, bad assumptions that got us into the situation I described in my recent article: The Ticking Time Bomb of Observability Expectations. We’re also going to talk about a handful of simple suggestions I can offer you for getting out of the high-dollar vendor trap you have likely found yourself in, and most importantly, get actual value out of these systems. This is a very integrative problem, so we’re going all over the map to build a broad perspective of the problem. Elegant technical solutions are often not optimal for a given real world scenario, but your desire for that to be true will be exploited by salespeople at every turn. You’ll notice that most of these assumptions benefit the believer’s ego. It’s almost like that’s the common weakness among engineers and their leadership.

Assumption #1: You have massive scale and you need the tools for that.

This works on engineers that want to work on web scale projects as well as it works on delusional startup founders who are certain they’ll scale up any second now.

The “influencers” of tech do have massive scale. The entirety of SRE as we know it came from environments with very particular circumstances: global scale, high traffic websites, users of the services are the product rather than the customer, failures are not directly tied to revenues, and a near-completely dominant market position. Your product or company may have absolutely nothing to do with these circumstances. Are you sure that’s where you should begin? It can seem like a good idea to mirror these strategies, because it adds words to your resume. You have a balancing act to think about here — keep your resume current without over-engineering for your actual problems. I want to say that “nothing will make you enjoy your work more than solving your problems well”, and I believe that to be true. Unfortunately the groupthink in tech is stronger than ever. Just be open to the idea that a simple solution might be the right one for you today.

Assumption #2: People are capable of consuming almost unlimited amounts of information.

This works on engineers and their leaders because people are not very good at assessing their own capabilities. It both pads our ego and confirms our brilliance to ourselves. This is an exploit of engineers’ insecurity.

Humans can only hold a tiny, tiny bit of information in our working memory at a time before we have to do what’s called “chunking”. Hierarchy is the default tool for this, that’s why you see hierarchical namespaces for metrics in well designed products. Another approach of the brain is called “schematizing” — mapping concepts into groupings that relate to each other. When you present information, you have to keep these limits in mind. Your end user cannot construct meaning from more than a couple metrics if they do not understand the system already — the more you can display these grouped together by meaning, overlaid on the same timeframe and summarized, the better results you will see. Research has indicated, repeatedly, that even the smallest amount of unnecessary information has deleterious effects on problem solving. This should make you rethink having millions of metrics. We’ll talk later about how to do that smartly.

Assumption #3: Your company’s developers care about customer experience, reliability and performance.

You can lead a horse to water, but if that horse does not have a clear incentive to monitor her software…uh, you get what I mean. Management culture has to align with this — and fixing that without buy in from those around you is impossible. This is groundfloor stuff, ensure that at the very least you understand where your organization’s culture stands on this. Developers carrying a pager has been the norm in the industry for almost ten years now, but that does not mean it’s the case for you. And carrying a pager doesn’t mean giving a damn. They may just not care because the job has embittered them. A lot of things have to fall into place for this one to be a valid assumption. You do not need this them to care in order to make observability better, but you do need an accurate assessment of this.

Assumption #4: During an incident, monitoring data will be the rate-limiting step in your quest to reach resolution.

This is a big, ugly one. All of us have been on an incident with ten people, furiously reviewing dashboards of our prized millions of metrics. Then one Engineer arrives and just knows what to do. Because they understand the system(or DNS). Do not underestimate the value of training and understanding, troubleshooting documentation, of breaking up silos, and of building fluency in addressing a few “likely suspects”. Knowing the subsystems of your product and applying deductive reasoning can be a very efficient path to remediation. Often your problem is obvious — knowing how to fix your problem isn’t. Paying $50k a month for metrics to definitively tell you “US-East-1 is having a problem” isn’t terribly satisfying, is it? My point here is just that observability is only a part of resolving your problems as quickly as possible. Humans are the other part, but we don’t talk about them on the conference circuit so much, because they’re a lot harder to monetize. That is a side effect of letting vendors and stealth-recruiting talks control these conversations, and not reflection of reality. Vendors really want you to believe that most of your incidents are incredibly novel, complex problems. You’ll be the hero, pulling a needle out a haystack and saying “look, anomaly detection has flagged that kubernetes_state.replicaset.replicas_desiredis abnormal!”. But seriously, you’re a lot more likely to round people up on a call and have one of them admit they deployed a bad config.

Assumption #5: You’ll be the same person at 2AM that you were at 2PM while designing your observability strategy.

Or worse yet, the assumption that you’re the same person during this sales pitch that you will be at 2AM!

The fact is, your capacity to understand and process information degrades with stress. You will be much dumber, angrier, and less patient than you are while reading this. Don’t kid yourself. Keep it simple and focused. Having too much information is a liability in these scenarios. You are more likely to trip over the query language in this state than you are to pull exactly the perfect signal out of a hat. Ditch the CSI fantasies — vendors reliably appeal to your desire to believe you are a genius. You might be during this sales call with Splunk, but you’re gonna be as dumb as the rest of us when you get paged on New Year’s Eve.

Outside of an incident, it’s easy to imagine you’ll have all the time in the world to review your metrics.

A Way Forward:

This is truly a wicked problem, and it is not entirely solvable. Here are some alternatives to the “monitor everything” approach you’ve been sold. Coincidentally, all of them can potentially end at a cheaper, better posture that you are in today. The following guidance will also help you build vendor neutral and future-proofed skills for your resume, and take you beyond buzzwords.

Know your monitoring solution.

If you’re working with a SaaS product, understand the billing scheme inside and out. These are full of dark patterns, expensive defaults, and perverse incentives. Generally, in telemetry systems, cardinality is billable. In Prometheus or InfluxDB, and likely other TSDBs, cardinality is costly in terms of storage and computation(each unique combination of tags is stored as an entirely new timeseries). For logs, individual “rows” are often costly, but not the actual amount of data. Understanding the quirks of either cost or technical efficiency is critical to making good decisions and driving down your COGS while getting maximum value from your tools. Most tracing tools are outside of my realm of experience, but almost definitely have their own novel billing/efficiency model. For the most part, established vendors will bill you for what costs them the most to do. So it’s unlikely that you will be able to lift and shift your monitoring burden into a similar OSS product without translating your billing problem into a capacity problem. (Unlikely, but not totally impossible.) If you understand distributed systems really well, you can usually find points of leverage where it’s cheaper to do things one way than another. This is a risky business — your vendor also knows distributed systems quite well, and their billing model almost always catches up. The core problem — wanting to store more data than you can afford, is unavoidable.

Work backwards from a vision.

Your observability strategy is, ultimately, a product. I’ve often found that putting pen to paper and writing down what you want the user experience to feel like is a good exercise to help you make sense of what is missing from your current approach. This works because humans can much more easily process stories and feeling than they can process a gigantic bulleted list of features. Most of us know what the experience should be like, but we stop ourselves from confidently expressing that when our inner critic begins to think about all the reasons it’s not possible. This is a “write drunk, edit sober” activity — encourage your participants to set aside practicality and have fun. Engineers will really struggle with this, from my experience. First drafts aim low, or echo vendor verbiage. Keep pushing. It’s worth doing, I promise. A good way to approach this is literally writing it as narrative fiction — “Sarah is a back end engineer and she’s just been notified of an outage in a service she owns. She looks down at her phone, where a notification has buzzed…”

After writing these, have your participants think through what would need to be true in order for this story to play out. How close can we get? What’s compelling about one story or another?

Vision is a powerful tool.

Start with what is important, not what is easy.

Your feature developers know what their product should do. Monitor that — how many times does it happen? How long does that take? Was it successful? I call these “work metrics”, but I’m sure you can think of a smarter name. Instrument your code purposefully, with the goal of understanding whether it is meeting its obligations to its customers(if you have an SLO or an SLA, aim for measuring it directly!) The only alternative to this idea is to construct meaning by interpretation of a bunch of semi-related signals. You do NOT need high-cardinality data in order for this to be meaningful, see the next bullet. This metric will not come “out of the box”, and it never will. There is no substitute for doing the work. If you cannot get this prioritized, simply imagine how many metrics or logs you’re paying to store that you could safely jettison if you could clearly, simply understand what’s important. That’s the cost of not doing this, when it comes time to sell the idea. The alternative is a familiar horror: SREs and ops teams reading the tea leaves of HTTP codes and infrastructure metrics. This is an extremely expensive dead end.

Separate status from diagnostic information.

This is the key to your desired “Single Pane of Glass”. The ONLY thing your single pane of glass should express is the current status. This can and should be as simple as a set of red-yellow-green indicators about sections of your system. Not line graphs, not logs, not traces, just boring blocks of color. Amazing things happen when executives can see how things are going.

When your car has a problem, it doesn’t show you the ODB code. It shows “CHECK ENGINE”. Then later you connect an ODB code reader. Same principle.

Once you’ve solved for understanding how things are going right now, everything else is diagnostic info. The other zillion metrics really just help you construct a hypothesis about where a possible fault might be. It’s up to you to decide where that cost/value tradeoff stops making sense on this — maybe you just need to know which region is having problems so you can update DNS and investigate. Maybe you just care at the level of microservice. Maybe you really do need all that fine grained stuff you’re paying for. Most of the time, you aren’t gonna need it. Maybe you just absolutely need to know how garbage collection is performing on etcd or whatever. You probably don’t.

Design for Observability.

When planning, think about how you separate functionality in your system, then think about how you can make those failure domains obvious. A good design will have a relatively consistent level of modularity, and a healthcheck built into the software at that level. For example — a pattern that I’ve learned to love over the years is building a /healthcheck endpoint into every web service. When that endpoint is hit, it returns a tiny little json object that indicates basic stuff like the status of its connection to its dependencies, an error count for the last n minutes, etc. Working backwards from this requirement, you can instrument your code to push the right data to this response(without causing performance issues). You can use this endpoint as your load balancer healthcheck, and you can scrape it with a monitoring synthetic or a lambda and push it into a monitoring system with only the tags you wanted, instead of the massive, high-cardinality collection of tags that would normally be shipped via an agent or integration. Defaulting to obscene cardinality is one of those “dark patterns” that runs bills up for no good reason. This method uses boring old HTTP, making it extremely versatile and neutral. You can inspect this directly on a given instance, if it comes to that.

Another simple pattern here is to surface status via metrics, but drop logs in logical chunks into S3/block storage. That way when you experience a problem, you know which chunk you’ll need to inspect. There is no rule that you have to log at any given level or in any method— you’re the boss here. Instant access to your logs is expensive, and there is a very high probability you won’t need it. S3 is dirt cheap and has built in archiving features, making it the ideal solution for some types of problems. Smart, intuitive design is key with this one, or you’ll make yourself a nasty pile of TBs of text files in random buckets. But in many simple scenarios? If you know that things went south at 9pm with certainty, so you pull the(clearly named)log file for 9:00–9:05PM, you’ll be surprised how much you can figure out and how fast. An intermediate (but sluggish) solution for this is logging into a bucket with a set schema, and having AWS Athena pre-configured to consume these logs. Definitely not suitable for all situations, but insanely cheap for what you get. Error aggregators, like Rollbar, tend to use similar back ends and they are often a tremendous value when compared to expensive log aggregators.

Conclusion:

Much has been written on this topic — in order to think for yourself and deprogram the vendor-speak, you’ll need to read it. I would strongly recommend that you consider going one level deeper than the tools, and exploring the basic sciences of cognitive psychology, elementary statistics, graphic design, resilience engineering, and ergonomics. These fields will benefit you in your SRE practice tremendously, and they will allow you to move forward across that crucial gap between “doing that thing you saw at Re:Invent” and “reasoning about your problems from first principles.” I hope this have been a fun read for you, and motivated you to get more involved and creative as an engineer. You do not have to choose between having a good product or a bad one, you have to pick and choose which data you put into the best solution.

Resources:

I was not born knowing any of this — my previous life and education as a school teacher taught me a ton about cog psych that has been very useful as an SRE. I spent a long, meandering 7 and a half years in my undergrad, studying education, behavioral science, communication disorders, ophthalmology, audiology, and more. Everything else, I learned on the job. (by doing it wrong, and crying) These are some of the topics and resources that really stuck out to me over the years and helped me build the kind of understanding that I used to write this.

Cognitive Psychology:

Statistics:

Design:

Safety Science and Resilience Engineering:

Ergonomics:

--

--

Responses (2)