The Ever-Changing World of Cloud Native Observability with Ian Smith

Episode Summary

Corey interviews Ian Smith, Field CTO at Chronosphere and the two dive into the world of observability software and how it differs from legacy monitoring solutions. Ian covers the three pillars of observability and how the right data solution can make engineering teams more effective. Corey and Ian then discuss how the move to SaaS has impacted the observability industry, leading to unexpectedly high bills and “the dreaded platform play”. Ian even reveals how you can gain more control over your data and costs using Chronosphere.

Episode Show Notes & Transcript

About Ian

Ian Smith is Field CTO at Chronosphere where he works across sales, marketing, engineering and product to deliver better insights and outcomes to observability teams supporting high-scale cloud-native environments. Previously, he worked with observability teams across the software industry in pre-sales roles at New Relic, Wavefront, PagerDuty and Lightstep.


Links Referenced:

Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Every once in a while, I find that something I’m working on aligns perfectly with a person that I wind up basically convincing to appear on this show. Today’s promoted guest is Ian Smith, who’s Field CTO at Chronosphere. Ian, thank you for joining me.

Ian: Thanks, Corey. Great to be here.

Corey: So, the coincidental aspect of what I’m referring to is that Chronosphere is, despite the name, not something that works on bending time, but rather an observability company. Is that directionally accurate?

Ian: That’s true. Although you could argue it probably bend a little bit of engineering time. But we can talk about that later.

Corey: [laugh]. So, observability is one of those areas that I think is suffering from too many definitions, if that makes sense. And at first, I couldn’t make sense of what it was that people actually meant when they said observability, this sort of clarified to me at least when I realized that there were an awful lot of, well, let’s be direct and call them ‘legacy monitoring companies’ that just chose to take what they were already doing and define that as, “Oh, this is observability.” I don’t know that I necessarily agree with that. I know a lot of folks in the industry vehemently disagree.

You’ve been in a lot of places that have positioned you reasonably well to have opinions on this sort of question. To my understanding, you were at interesting places, such as LightStep, New Relic, Wavefront, and PagerDuty, which I guess technically might count as observability in a very strange way. How do you view observability and what it is?

Ian: Yeah. Well, a lot of definitions, as you said, common ones, they talk about the three pillars, they talk really about data types. For me, it’s about outcomes. I think observability is really this transition from the yesteryear of monitoring where things were much simpler and you, sort of, knew all of the questions, you were able to define your dashboards, you were able to define your alerts and that was really the gist of it. And going into this brave new world where there’s a lot of unknown things, you’re having to ask a lot of sort of unique questions, particularly during a particular instance, and so being able to ask those questions in an ad hoc fashion layers on top of what we’ve traditionally done with monitoring. So, observability is sort of that more flexible, more dynamic kind of environment that you have to deal with.

Corey: This has always been something that, for me, has been relatively academic. Back when I was running production environments, things tended to be a lot more static, where, “Oh, there’s a problem with the database. I will SSH into the database server.” Or, “Hmm, we’re having a weird problem with the web tier. Well, there are ten or 20 or 200 web servers. Great, I can aggregate all of their logs to Syslog, and worst case, I can log in and poke around.”

Now, with a more ephemeral style of environment where you have Kubernetes or whatnot scheduling containers into place that have problems you can’t attach to a running container very easily, and by the time you see an error, that container hasn’t existed for three hours. And that becomes a problem. Then you’ve got the Lambda universe, which is a whole ‘nother world pain, where it becomes very challenging, at least for me, in order to reason using the old style approaches about what’s actually going on in your environment.

Ian: Yeah, I think there’s that and there’s also the added complexity of oftentimes you’ll see performance or behavioral changes based on even more narrow pathways, right? One particular user is having a problem and the traffic is spread across many containers. Is it making all of these containers perform badly? Not necessarily, but their user experience is being affected. It’s very common in say, like, B2B scenarios for you to want to understand the experience of one particular user or the aggregate experience of users at a particular company, particular customer, for example.

There’s just more complexity. There’s more complexity of the infrastructure and just the technical layer that you’re talking about, but there’s also more complexity in just the way that we’re handling use cases and trying to provide value with all of this software to the myriad of customers in different industries that software now serves.

Corey: For where I sit, I tend to have a little bit of trouble disambiguating, I guess, the three baseline data types that I see talked about again and again in observability. You have logs, which I think I’ve mostly I can wrap my head around. That seems to be the baseline story of, “Oh, great. Your application puts out logs. Of course, it’s in its own unique, beautiful format. Why wouldn’t it be?” In an ideal scenario, they’re structured. Things are never ideal, so great. You’re basically tailing log files in some cases. Great. I can reason about those.

Metrics always seem to be a little bit of a step beyond that. It’s okay, I have a whole bunch of log lines that are spitting out every 500 error that my app is throwing—and given my terrible code, it throws a lot—but I can then ideally count the number of times that appears and then that winds up incrementing counter, similar to the way that we used to see with StatsD, for example, and Collectd. Is that directionally correct? As far as the way I reason about, well so far, logs and metrics?

Ian: I think at a really basic level, yes. I think that, as we’ve been talking about, sort of greater complexity starts coming in when you have—particularly metrics in today’s world of containers—Prometheus—you mentioned StatsD—Prometheus has become sort of like the standard for expressing those things, so you get situations where you have incredibly high cardinality, so cardinality being the interplay between all the different dimensions. So, you might have, my container is a label, but also the type of endpoint is running on that container as a label, then maybe I want to track my customer organizations and maybe I have 5000 of those. I have 3000 containers, and so on and so forth. And you get this massive explosion, almost multiplicatively.

For those in the audience who really live and read cardinality, there’s probably someone screaming about well, it’s not truly multiplicative in every sense of the word, but, you know, it’s close enough from an approximation standpoint. As you get this massive explosion of data, which obviously has a cost implication but also has, I think, a really big implication on the core reason why you have metrics in the first place you alluded to, which is, so a human being can reason about it, right? You don’t want to go and look at 5000 log lines; you want to know, out of those 5000 log lines of 4000 errors and I have 1000, OKs. It’s very easy for human beings to reason about that from a numbers perspective. When your metrics start to re-explode out into thousands, millions of data points, and unique sort of time series more numbers for you to track, then you’re sort of losing that original goal of metrics.

Corey: I think I mostly have wrapped my head around the concept. But then that brings us to traces, and that tends to be I think one of the hardest things for me to grasp, just because most of the apps I build, for obvious reasons—namely, I’m bad at programming and most of these are proof of concept type of things rather than anything that’s large scale running in production—the difference between a trace and logs tends to get very muddled for me. But the idea being that as you have a customer session or a request that talks to different microservices, how do you collate across different systems all of the outputs of that request into a single place so you can see timing information, understand the flow that user took through your application? Is that again, directionally correct? Have I completely missed the plot here? Which is again, eminently possible. You are the expert.

Ian: No, I think that’s sort of the fundamental premise or expected value of tracing, for sure. We have something that’s akin to a set of logs; they have a common identifier, a trace ID, that tells us that all of these logs essentially belong to the same request. But importantly, there’s relationship information. And this is the difference between just having traces—sorry, logs—with just a trace ID attached to them. So, for example, if you have Service A calling Service B and Service C, the relatively simple thing, you could use time to try to figure this out.

But what if there are things happening in Service B at the same time there are things happening in Service C and D, and so on and so forth? So, one of the things that tracing brings to the table is it tells you what is currently happening, what called that. So oh, I know that I’m Service D. I was actually called by Service B and I’m not just relying on timestamps to try and figure out that connection. So, you have that information and ultimately, the data model allows you to fully sort of reflect what’s happening with the request, particularly in complex environments.

And I think this is where, you know, tracing needs to be sort of looked at as not a tool for—just because I’m operating in a modern environment, I’m using some Kubernetes, or I’m using Lambda, is it needs to be used in a scenario where you really have troubles grasping, from a conceptual standpoint, what is happening with the request because you need to actually fully document it. As opposed to, I have a few—let’s say three Lambda functions. I maybe have some key metrics about them; I have a little bit of logging. You probably do not need to use tracing to solve, sort of, basic performance problems with those. So, you can get yourself into a place where you’re over-engineering, you’re spending a lot of time with tracing instrumentation and tracing tooling, and I think that’s the core of observability is, like, using the right tool, the right data for the job.

But that’s also what makes it really difficult because you essentially need to have this, you know, huge set of experience or knowledge about the different data, the different tooling, and what influential architecture and the data you have available to be able to reason about that and make confident decisions, particularly when you’re under a time crunch which everyone is familiar with a, sort of like, you know, PagerDuty-style experience of my phone is going off and I have a customer-facing incident. Where is my problem? What do I need to do? Which dashboard do I need to look at? Which tool do I need to investigate? And that’s where I think the observability industry has become not serving the outcomes of the customers.

Corey: I had a, well, I wouldn’t say it’s a genius plan, but it was a passing fancy that I’ve built this online, freely available Twitter client for authoring Twitter threads—because that’s what I do is that of having a social life—and it’s available at lasttweetinaws.com. I’ve used that as a testbed for a few things. It’s now deployed to roughly 20 AWS regions simultaneously, and this means that I have a bit of a problem as far as how to figure out not even what’s wrong or what’s broken with this, but who’s even using it?

Because I know people are. I see invocations all over the planet that are not me. And sometimes it appears to just be random things crawling the internet—fine, whatever—but then I see people logging in and doing stuff with it. I’d kind of like to log and see who’s using it just so I can get information like, is there anyone I should talk to about what it could be doing differently? I love getting user experience reports on this stuff.

And I figured, ah, this is a perfect little toy application. It runs in a single Lambda function so it’s not that complicated. I could instrument this with OpenTelemetry, which then, at least according to the instructions on the tin, I could then send different types of data to different observability tools without having to re-instrument this thing every time I want to kick the tires on something else. That was the promise.

And this led to three weeks of pain because it appears that for all of the promise that it has, OpenTelemetry, particularly in a Lambda environment, is nowhere near ready for being able to carry a workload like this. Am I just foolish on this? Am I stating an unfortunate reality that you’ve noticed in the OpenTelemetry space? Or, let’s be clear here, you do work for a company with opinions on these things. Is OpenTelemetry the wrong approach?

Ian: I think OpenTelemetry is absolutely the right approach. To me, the promise of OpenTelemetry for the individual is, “Hey, I can go and instrument this thing, as you said and I can go and send the data, wherever I want.” The sort of larger view of that is, “Well, I’m no longer beholden to a vendor,”—including the ones that I’ve worked for, including the one that I work for now—“For the definition of the data. I am able to control that, I’m able to choose that, I’m able to enhance that, and any effort I put into it, it’s mine. I own that.”

Whereas previously, if you picked, say, for example, an APM vendor, you said, “Oh, I want to have some additional aspects of my information provider, I want to track my customer, or I want to track a particular new metric of how much dollars am I transacting,” that effort really going to support the value of that individual solution, it’s not going to support your outcomes. Which is I want to be able to use this data wherever I want, wherever it’s most valuable. So, the core premise of OpenTelemetry, I think, is great. I think it’s a massive undertaking to be able to do this for at least three different data types, right? Defining an API across a whole bunch of different languages, across three different data types, and then creating implementations for those.

Because the implementations are the thing that people want, right? You are hoping for the ability to, say, drop in something. Maybe one line of code or preferably just, like, attach a dependency, let’s say in Java-land at runtime, and be able to have the information flow through and have it complete. And this is the premise of, you know, vendors I’ve worked with in the past, like New Relic. That was what New Relic built on: the ability to drop in an agent and get visibility immediately.

So, having that out-of-the-box visibility is obviously a goal of OpenTelemetry where it makes sense—Go, it’s very difficult to attach things at runtime, for example—but then saying, well, whatever is provided—let’s say your gRPC connections, database, all these things—well, now I want to go and instrument; I want to add some additional value. As you said, maybe you want to track something like I want to have in my traces the email address of whoever it is or the Twitter handle of whoever is so I can then go and analyze that stuff later. You want to be able to inject that piece of information or that instrumentation and then decide, well, where is the best utilized? Is it best utilized in some tooling from AWS? Is it best utilized in something that you’ve built yourself? Is it best of utilized an open-source project? Is it best utilized in one of the many observability vendors, or is even becoming more common, I want to shove everything in a data lake and run, sort of, analysis asynchronously, overlay observability data for essentially business purposes.

All of those things are served by having a very robust, open-source standard, and simple-to-implement way of collecting a really good baseline of data and then make it easy for you to then enhance that while still owning—essentially, it’s your IP right? It’s like, the instrumentation is your IP, whereas in the old world of proprietary agents, proprietary APIs, that IP was basically building it, but it was tied to that other vendor that you were investing in.

Corey: One thing that I was consistently annoyed by in my days of running production infrastructures at places, like, you know, large banks, for example, one of the problems I kept running into is that this, there’s this idea that, “Oh, you want to use our tool. Just instrument your applications with our libraries or our instrumentation standards.” And it felt like I was constantly doing and redoing a lot of instrumentation for different aspects. It’s not that we were replacing one vendor with another; it’s that in an observability, toolchain, there are remarkably few, one-size-fits-all stories. It feels increasingly like everyone’s trying to sell me a multifunction printer, which does one thing well, and a few other things just well enough to technically say they do them, but badly enough that I get irritated every single time.

And having 15 different instrumentation packages in an application, that’s either got security ramifications, for one, see large bank, and for another it became this increasingly irritating and obnoxious process where it felt like I was spending more time seeing the care and feeding of the instrumentation then I was the application itself. That’s the gold—that’s I guess the ideal light at the end of the tunnel for me in what OpenTelemetry is promising. Instrument once, and then you’re just adjusting configuration as far as where to send it.

Ian: That’s correct. The organization’s, and you know, I keep in touch with a lot of companies that I’ve worked with, companies that have in the last two years really invested heavily in OpenTelemetry, they’re definitely getting to the point now where they’re generating the data once, they’re using, say, pieces of the OpenTelemetry pipeline, they’re extending it themselves, and then they’re able to shove that data in a bunch of different places. Maybe they’re putting in a data lake for, as I said, business analysis purposes or forecasting. They may be putting the data into two different systems, even for incident and analysis purposes, but you’re not having that duplication effort. Also, potentially that performance impact, right, of having two different instrumentation packages lined up with each other.

Corey: There is a recurring theme that I’ve noticed in the observability space that annoys me to no end. And that is—I don’t know if it’s coming from investor pressure, from folks never being satisfied with what they have, or what it is, but there are so many startups that I have seen and worked with in varying aspects of the observability space that I think, “This is awesome. I love the thing that they do.” And invariably, every time they start getting more and more features bolted onto them, where, hey, you love this whole thing that winds up just basically doing a tail-F on a log file, so it just streams your logs in the application and you can look for certain patterns. I love this thing. It’s great.

Oh, what’s this? Now, it’s trying to also be the thing that alerts me and wakes me up in the middle of the night. No. That’s what PagerDuty does. I want PagerDuty to do that thing, and I want other things—I want you just to be the log analysis thing and the way that I contextualize logs. And it feels like they keep bolting things on and bolting things on, where everything is more or less trying to evolve into becoming its own version of Datadog. What’s up with that?

Ian: Yeah, the sort of, dreaded platform play. I—[laugh] I was at New Relic when there were essentially two products that they sold. And then by the time I left, I think there was seven different products that were being sold, which is kind of a crazy, crazy thing when you think about it. And I think Datadog has definitely exceeded that now. And I definitely see many, many vendors in the market—and even open-source solutions—sort of presenting themselves as, like, this integrated experience.

But to your point, even before about your experience of these banks it oftentimes become sort of a tick-a-box feature approach of, “Hey, I can do this thing, so buy more. And here’s a shared navigation panel.” But are they really integrated? Like, are you getting real value out of it? One of the things that I do in my role is I get to work with our internal product teams very closely, particularly around new initiatives like tracing functionality, and the constant sort of conversation is like, “What is the outcome? What is the value?”

It’s not about the feature; it’s not about having a list of 19 different features. It’s like, “What is the user able to do with this?” And so, for example, there are lots of platforms that have metrics, logs, and tracing. The new one-upmanship is saying, “Well, we have events as well. And we have incident response. And we have security. And all these things sort of tie together, so it’s one invoice.”

And constantly I talk to customers, and I ask them, like, “Hey, what are the outcomes that you’re getting when you’ve invested so heavily in one vendor?” And oftentimes, the response is, “Well, I only need to deal with one vendor.” Okay, but that’s not an outcome. [laugh]. And it’s like the business having a single invoice.

Corey: Yeah, that is something that’s already attainable today. If you want to just have one vendor with a whole bunch of crappy offerings, that’s what AWS is for. They have AmazonBasics versions of everything you might want to use in production. Oh, you want to go ahead and use MongoDB? Well, use AmazonBasics MongoDB, but they call it DocumentDB because of course they do. And so, on and so forth.

There are a bunch of examples of this, but those companies are still in business and doing very well because people often want the genuine article. If everyone was trying to do just everything to check a box for procurement, great. AWS has already beaten you at that game, it seems.

Ian: I do think that, you know, people are hoping for that greater value and those greater outcomes, so being able to actually provide differentiation in that market I don’t think is terribly difficult, right? There are still huge gaps in let’s say, root cause analysis during an investigation time. There are huge issues with vendors who don’t think beyond sort of just the one individual who’s looking at a particular dashboard or looking at whatever analysis tool there is. So, getting those things actually tied together, it’s not just, “Oh, we have metrics, and logs, and traces together,” but even if you say we have metrics and tracing, how do you move between metrics and tracing? One of the goals in the way that we’re developing product at Chronosphere is that if you are alerted to an incident—you as an engineer; doesn’t matter whether you are massively sophisticated, you’re a lead architect who has been with the company forever and you know everything or you’re someone who’s just come out of onboarding and is your first time on call—you should not have to think, “Is this a tracing problem, or a metrics problem, or a logging problem?”

And this is one of those things that I mentioned before of requiring that really heavy level of knowledge and understanding about the observability space and your data and your architecture to be effective. And so, with the, you know, particularly observability teams and all of the engineers that I speak with on a regular basis, you get this sort of circumstance where well, I guess, let’s talk about a real outcome and a real pain point because people are like, okay, yeah, this is all fine; it’s all coming from a vendor who has a particular agenda, but the thing that constantly resonates is for large organizations that are moving fast, you know, big startups, unicorns, or even more traditional enterprises that are trying to undergo, like, a rapid transformation and go really cloud-native and make sure their engineers are moving quickly, a common question I will talk about with them is, who are the three people in your organization who always get escalated to? And it’s usually, you know, between two and five people—

Corey: And you can almost pick those perso—you say that and you can—at least anyone who’s worked in environments or through incidents like this more than a few times, already have thought of specific people in specific companies. And they almost always fall into some very predictable archetypes. But please, continue.

Ian: Yeah. And people think about these people, they always jump to mind. And one of the things I asked about is, “Okay, so when you did your last innovation around observably”—it’s not necessarily buying a new thing, but it maybe it was like introducing a new data type or it was you’re doing some big investment in improving instrumentation—“What changed about their experience?” And oftentimes, the most that can come out is, “Oh, they have access to more data.” Okay, that’s not great.

It’s like, “What changed about their experience? Are they still getting woken up at 3 am? Are they constantly getting pinged all the time?” One of the vendors that I worked at, when they would go down, there were three engineers in the company who were capable of generating list of customers who are actually impacted by damage. And so, every single incident, one of those three engineers got paged into the incident.

And it became borderline intolerable for them because nothing changed. And it got worse, you know? The platform got bigger and more complicated, and so there were more incidents and they were the ones having to generate that. But from a business level, from an observability outcomes perspective, if you zoom all the way up, it’s like, “Oh, were we able to generate the list of customers?” “Yes.”

And this is where I think the observability industry has sort of gotten stuck—you know, at least one of the ways—is that, “Oh, can you do it?” “Yes.” “But is it effective?” “No.” And by effective, I mean those three engineers become the focal point for an organization.

And when I say three—you know, two to five—it doesn’t matter whether you’re talking about a team of a hundred or you’re talking about a team of a thousand. It’s always the same number of people. And as you get bigger and bigger, it becomes more and more of a problem. So, does the tooling actually make a difference to them? And you might ask, “Well, what do you expect from the tooling? What do you expect to do for them?” Is it you give them deeper analysis tools? Is it, you know, you do AI Ops? No.

The answer is, how do you take the capabilities that those people have and how do you spread it across a larger population of engineers? And that, I think, is one of those key outcomes of observability that no one, whether it be in open-source or the vendor side is really paying a lot of attention to. It’s always about, like, “Oh, we can just shove more data in. By the way, we’ve got petabyte scale and we can deal with, you know, 2 billion active time series, and all these other sorts of vanity measures.” But we’ve gotten really far away from the outcomes. It’s like, “Am I getting return on investment of my observability tooling?”

And I think tracing is this—as you’ve said, it can be difficult to reason about right? And people are not sure. They’re feeling, “Well, I’m in a microservices environment; I’m in cloud-native; I need tracing because my older APM tools appear to be failing me. I’m just going to go and wriggle my way through implementing OpenTelemetry.” Which has significant engineering costs. I’m not saying it’s not worth it, but there is a significant engineering cost—and then I don’t know what to expect, so I’m going to go on through my data somewhere and see whether we can achieve those outcomes.

And I do a pilot and my most sophisticated engineers are in the pilot. And they’re able to solve the problems. Okay, I’m going to go buy that thing. But I’ve just transferred my problems. My engineers have gone from solving problems in maybe logs and grepping through petabytes worth of logs to using some sort of complex proprietary query language to go through your tens of petabytes of trace data but actually haven’t solved any problem. I’ve just moved it around and probably just cost myself a lot, both in terms of engineering time and real dollars spent as well.

Corey: One of the challenges that I’m seeing across the board is that observability, for certain use cases, once you start to see what it is and its potential for certain applications—certainly not all; I want to hedge that a little bit—but it’s clear that there is definite and distinct value versus other ways of doing things. The problem is, is that value often becomes apparent only after you’ve already done it and can see what that other side looks like. But let’s be honest here. Instrumenting an application is going to take some significant level of investment, in many cases. How do you wind up viewing any return on investment that it takes for the very real cost, if only in people’s time, to go ahead instrumenting for observability in complex environments?

Ian: So, I think that you have to look at the fundamentals, right? You have to look at—pretend we knew nothing about tracing. Pretend that we had just invented logging, and you needed to start small. It’s like, I’m not going to go and log everything about every application that I’ve had forever. What I need to do is I need to find the points where that logging is going to be the most useful, most impactful, across the broadest audience possible.

And one of the useful things about tracing is because it’s built in distributed environments, primarily for distributed environments, you can look at, for example, the biggest intersection of requests. A lot of people have things like API Gateways, or they have parts of a monolith which is still handling a lot of requests routing; those tend to be areas to start digging into. And I would say that, just like for anyone who’s used Prometheus or decided to move away from Prometheus, no one’s ever gone and evaluated Prometheus solution without having some sort of Prometheus data, right? You don’t go, “Hey, I’m going to evaluate a replacement for Prometheus or my StatsD without having any data, and I’m simultaneously going to generate my data and evaluate the solution at the same time.” It doesn’t make any sense.

With tracing, you have decent open-source projects out there that allow you to visualize individual traces and understand sort of the basic value you should be getting out of this data. So, it’s a good starting point to go, “Okay, can I reason about a single request? Can I go and look at my request end-to-end, even in a relatively small slice of my environment, and can I see the potential for this? And can I think about the things that I need to be able to solve with many traces?” Once you start developing these ideas, then you can have a better idea of, “Well, where do I go and invest more in instrumentation? Look, databases never appear to be a problem, so I’m not going to focus on database instrumentation. What’s the real problem is my external dependencies. Facebook API is the one that everyone loves to use. I need to go instrument that.”

And then you start to get more clarity. Tracing has this interesting network effect. You can basically just follow the breadcrumbs. Where is my biggest problem here? Where are my errors coming from? Is there anything else further down the call chain? And you can sort of take that exploratory approach rather than doing everything up front.

But it is important to do something before you start trying to evaluate what is my end state. End state obviously being sort of nebulous term in today’s world, but where do I want to be in two years’ time? I would like to have a solution. Maybe it’s open-source solution, maybe it’s a vendor solution, maybe it’s one of those platform solutions we talked about, but how do I get there? It’s really going to be I need to take an iterative approach and I need to be very clear about the value and outcomes.

There’s no point in doing a whole bunch of instrumentation effort in things that are just working fine, right? You want to go and focus your time and attention on that. And also you don’t want to go and burn just singular engineers. The observability team’s purpose in life is probably not to just write instrumentation or just deploy OpenTelemetry. Because then we get back into the land where engineers themselves know nothing about the monitoring or observability they’re doing and it just becomes a checkbox of, “I dropped in an agent. Oh, when it comes time for me to actually deal with an incident, I don’t know anything about the data and the data is insufficient.”

So, a level of ownership supported by the observability team is really important. On that return on investment, sort of, though it’s not just the instrumentation effort. There’s product training and there are some very hard costs. People think oftentimes, “Well, I have the ability to pay a vendor; that’s really the only cost that I have.” There’s things like egress costs, particularly volumes of data. There’s the infrastructure costs. A lot of the times there will be elements you need to run in your own environment; those can be very costly as well, and ultimately, they’re sort of icebergs in this overall ROI conversation.

The other side of it—you know, return and investment—return, there’s a lot of difficulty in reasoning about, as you said, what is the value of this going to be if I go through all this effort? Everyone knows a sort of, you know, meme or archetype of, “Hey, here are three options; pick two because there’s always going to be a trade off.” Particularly for observability, it’s become an element of, I need to pick between performance, data fidelity, or cost. Pick two. And when data fidelity—particularly in tracing—I’m talking about the ability to not sample, right?

If you have edge cases, if you have narrow use cases and ways you need to look at your data, if you heavily sample, you lose data fidelity. But oftentimes, cost is a reason why you do that. And then obviously, performance as you start to get bigger and bigger datasets. So, there’s a lot of different things you need to balance on that return. As you said, oftentimes you don’t get to understand the magnitude of those until you’ve got the full data set in and you’re trying to do this, sort of, for real. But being prepared and iterative as you go through this effort and not saying, “Okay, well, I’m just going to buy everything from one vendor because I’m going to assume that’s going to solve my problem,” is probably that undercurrent there.

Corey: As I take a look across the entire ecosystem, I can’t shake the feeling—and my apologies in advance if this is an observation, I guess, that winds up throwing a stone directly at you folks—

Ian: Oh, please.

Corey: But I see that there’s a strong observability community out there that is absolutely aligned with the things I care about and things I want to do, and then there’s a bunch of SaaS vendors, where it seems that they are, in many cases, yes, advancing the state of the art, I am not suggesting for a second that money is making observability worse. But I do think that when the tool you sell is a hammer, then every problem starts to look like a nail—or in my case, like my thumb. Do you think that there’s a chance that SaaS vendors are in some ways making this entire space worse?

Ian: As we’ve sort of gone into more cloud-native scenarios and people are building things specifically to take advantage of cloud from a complexity standpoint, from a scaling standpoint, you start to get, like, vertical issues happening. So, you have things like we’re going to charge on a per-container basis; we’re going to charge on a per-host basis; we’re going to charge based off the amount of gigabytes that you send us. These are sort of like more horizontal pricing models, and the way the SaaS vendors have delivered this is they’ve made it pretty opaque, right? Everyone has experiences, or has jerks about overages from observability vendors’ massive spikes. I’ve worked with customers who have used—accidentally used some features and they’ve been billed a quarter million dollars on a monthly basis for accidental overages from a SaaS vendor.

And these are all terrible things. Like, but we’ve gotten used to this. Like, we’ve just accepted it, right, because everyone is operating this way. And I really do believe that the move to SaaS was one of those things. Like, “Oh, well, you’re throwing us more data, and we’re charging you more for it.” As a vendor—

Corey: Which sort of erodes your own value proposition that you’re bringing to the table. I mean, I don’t mean to be sitting over here shaking my fist yelling, “Oh, I could build a better version in a weekend,” except that I absolutely know how to build a highly available Rsyslog cluster. I’ve done it a handful of times already and the technology is still there. Compare and contrast that with, at scale, the fact that I’m paying 50 cents per gigabyte ingested to CloudWatch logs, or a multiple of that for a lot of other vendors, it’s not that much harder for me to scale that fleet out and pay a much smaller marginal cost.

Ian: And so, I think the reaction that we’re seeing in the market and we’re starting to see—we’re starting to see the rise of, sort of, a secondary class of vendor. And by secondary, I don’t mean that they’re lesser; I mean that they’re, sort of like, specifically trying to address problems of the primary vendors, right? Everyone’s aware of vendors who are attempting to reduce—well, let’s take the example you gave on logs, right? There are vendors out there whose express purpose is to reduce the cost of your logging observability. They just sit in the middle; they are a middleman, right?

Essentially, hey, use our tool and even though you’re going to pay us a whole bunch of money, it’s going to generate an overall return that is greater than if you had just continued pumping all of your logs over to your existing vendor. So, that’s great. What we think really needs to happen, and one of the things we’re doing at Chronosphere—unfortunate plug—is we’re actually building those capabilities into the solution so it’s actually end-to-end. And by end-to-end, I mean, a solution where I can ingest my data, I can preprocess my data, I can store it, query it, visualize it, all those things, aligned with open-source standards, but I have control over that data, and I understand what’s going on with particularly my cost and my usage. I don’t just get a bill at the end of the month going, “Hey, guess what? You’ve spent an additional $200,000.”

Instead, I can know in real time, well, what is happening with my usage. And I can attribute it. It’s this team over here. And it’s because they added this particular label. And here’s a way for you, right now, to address that and cap it so it doesn’t cost you anything and it doesn’t have a blast radius of, you know, maybe degraded performance or degraded fidelity of the data.

That though is diametrically opposed to the way that most vendors are set up. And unfortunately, the open-source projects tend to take a lot of their cues, at least recently, from what’s happening in the vendor space. One of the ways that you can think about it is a sort of like a speed of light problem. Everyone knows that, you know, there’s basic fundamental latency; everyone knows how fast disk is; everyone knows the, sort of like, you can’t just make your computations happen magically, there’s a cost of running things horizontally. But a lot of the way that the vendors have presented efficiency to the market is, “Oh, we’re just going to incrementally get faster as AWS gets faster. We’re going to incrementally get better as compression gets better.”

And of course, you can’t go and fit a petabyte worth of data into a kilobyte, unless you’re really just doing some sort of weird dictionary stuff, so you feel—you’re dealing with some fundamental constraints. And the vendors just go, “I’m sorry, you know, we can’t violate the speed of light.” But what you can do is you can start taking a look at, well, how is the data valuable, and start giving the people controls on how to make it more valuable. So, one of the things that we do with Chronosphere is we allow you to reshape Prometheus metrics, right? You go and express Prometheus metrics—let’s say it’s a business metric about how many transactions you’re doing as a business—you don’t need that on a per-container basis, particularly if you’re running 100,000 containers globally.

When you go and take a look at that number on a dashboard, or you alert on it, what is it? It's one number, one time series. Maybe you break it out per region. You have five regions, you don’t need 100,000 data points every minute behind that. It’s very expensive, it’s not very performant, and as we talked about earlier, it’s very hard to reason about as a human being.

So, giving the tools to be able to go and condense that data down and make it more actionable and more valuable, you get performance, you get cost reduction, and you get the value that you ultimately need out of the data. And it’s one of the reasons why, I guess, I work at Chronosphere. Which I’m hoping is the last observability [laugh] venture I ever work for.

Corey: Yeah, for me a lot of the data that I see in my logs, which is where a lot of this stuff starts and how I still contextualize these things, is nonsense that I don’t care about and will never care about. I don’t care about load balance or health checks. I don’t particularly care about 200 results for the favicon when people visit the site. I care about other things, but just weed out the crap, especially when I’m paying by the pound—or at least by the gigabyte—in order to get that data into something. Yeah. It becomes obnoxious and difficult to filter out.

Ian: Yeah. And the vendors just haven’t done any of that because why would they, right? If you went and reduced the amount of log—

Corey: Put engineering effort into something that reduces how much I can charge you? That sounds like lunacy. Yeah.

Ian: Exactly. They’re business models entirely based off it. So, if you went and reduced every one’s logging bill by 30%, or everyone’s logging volume by 30% and reduced the bills by 30%, it’s not going to be a great time if you’re a publicly traded company who has built your entire business model on essentially a very SaaS volume-driven—and in my eyes—relatively exploitative pricing and billing model.

Corey: Ian, I want to thank you for taking so much time out of your day to talk to me about this. If people want to learn more, where can they find you? I mean, you are a Field CTO, so clearly you’re outstanding in your field. But if, assuming that people don’t want to go to farm country, where’s the best place to find you?

Ian: Yeah. Well, it’ll be a bunch of different conferences. I’ll be at KubeCon this year. But chronosphere.io is the company website. I’ve had the opportunity to talk to a lot of different customers, not from a hard sell perspective, but you know, conversations like this about what are the real problems you’re having and what are the things that you sort of wish that you could do?

One of the favorite things that I get to ask people is, “If you could wave a magic wand, what would you love to be able to do with your observability solution?” That’s, A, a really great part, but oftentimes be being able to say, “Well, actually, that thing you want to do, I think I have a way to accomplish that,” is a really rewarding part of this particular role.

Corey: And we will, of course, put links to that in the show notes. Thank you so much for being so generous with your time. I appreciate it.

Ian: Thanks, Corey. It’s great to be here.

Corey: Ian Smith, Field CTO at Chronosphere on this promoted guest episode. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment, which going to be super easy in your case, because it’s just one of the things that the omnibus observability platform that your company sells offers as part of its full suite of things you’ve never used.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.
Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.