The Evolution of OpenTelemetry with Austin Parker

Episode Summary

Austin Parker, Community Maintainer at OpenTelemetry, joins Corey on Screaming in the Cloud to discuss OpenTelemetry’s mission in the world of observability. Austin explains how the OpenTelemetry community was able to scale the OpenTelemetry project to a commercial offering, and the way Open Telemetry is driving innovation in the data space. Corey and Austin also discuss why Austin decided to write a book on OpenTelemetry, and the book’s focus on the evergreen applications of the tool.

Episode Show Notes & Transcript

About Austin

Austin Parker is the OpenTelemetry Community Maintainer, as well as an event organizer, public speaker, author, and general bon vivant. They've been a part of OpenTelemetry since its inception in 2019.

Links Referenced:

OpenTelemetry: https://opentelemetry.io/
Learning OpenTelemetry early release: https://www.oreilly.com/library/view/learning-opentelemetry/9781098147174/
Page with Austin’s social links: https://social.ap2.io

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Look, I get it. Folks are being asked to do more and more. Most companies don’t have a dedicated DBA because that person now has a full time job figuring out which one of AWS’s multiple managed database offerings is right for every workload. Instead, developers and engineers are being asked to support, and heck, if time allows, optimize their databases. That’s where OtterTune comes in. Their AI is your database co-pilot for MySQL and PostgresSQL on Amazon RDS or Aurora. It helps improve performance by up to four x OR reduce costs by 50 percent – both of those are decent options. Go to ottertune dot com to learn more and start a free trial. That’s O-T-T-E-R-T-U-N-E dot com.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. It’s been a few hundred episodes since I had Austin Parker on to talk about the things that Austin cares about. But it’s time to rectify that. Austin is the community maintainer for OpenTelemetry, which is a CNCF project. If you’re unfamiliar with, we’re probably going to fix that in short order. Austin, Welcome back, it’s been a month of Sundays.

Austin: It has been a month-and-a-half of Sundays. A whole pandemic-and-a-half.

Corey: So, much has happened since then. I tried to instrument something with OpenTelemetry about a year-and-a-half ago, and in defense to the project, my use case is always very strange, but it felt like—a lot of things have sharp edges, but it felt like this had so many sharp edges that you just pivot to being a chainsaw, and I would have been at least a little bit more understanding of why it hurts so very much. But I have heard from people that I trust that the experience has gotten significantly better. Before we get into the nitty-gritty of me lobbing passive-aggressive bug reports at you have for you to fix in a scenario in which you can’t possibly refuse me, let’s start with the beginning. What is OpenTelemetry?

Austin: That’s a great question. Thank you for asking it. So, OpenTelemetry is an observability framework. It is run by the CNCF, you know, home of such wonderful award-winning technologies as Kubernetes, and you know, the second biggest source of YAML in the known universe [clear throat].

Corey: On some level, it feels like that is right there with hydrogen as far as unlimited resources in our universe.

Austin: It really is. And, you know, as we all know, there are two things that make, sort of, the DevOps and cloud world go around: one of them being, as you would probably know, AWS bills; and the second being YAML. But OpenTelemetry tries to kind of carve a path through this, right, because we’re interested in observability. And observability, for those that don’t know or have been living under a rock or not reading blogs, it’s a lot of things. It’s a—but we can generally sort of describe it as, like, this is how you understand what your system is doing.

I like to describe it as, it’s a way that we can model systems, especially complex, distributed, or decentralized software systems that are pretty commonly found in larg—you know, organizations of every shape and size, quite often running on Kubernetes, quite often running in public or private clouds. And the goal of observability is to help you, you know, model this system and understand what it’s doing, which is something that I think we can all agree, a pretty important part of our job as software engineers. Where OpenTelemetry fits into this is as the framework that helps you get the telemetry data you need from those systems, put it into a universal format, and then ship it off to some observability back-end, you know, a Prometheus or a Datadog or whatever, in order to analyze that data and get answers to your questions you have.

Corey: From where I sit, the value of OTel—or OpenTelemetry; people in software engineering love abbreviations that are impenetrable from the outside, so of course, we’re going to lean into that—but what I found for my own use case is the shining value prop was that I could instrument an application with OTel—in theory—and then send whatever I wanted that was emitted in terms of telemetry, be it events, be it logs, be it metrics, et cetera, and send that to any or all of a curation of vendors on a case-by-case basis, which meant that suddenly it was the first step in, I guess, an observability pipeline, which increasingly is starting to feel like a milit—like an industrial-observability complex, where there’s so many different companies out there, it seems like a good approach to use, to start, I guess, racing vendors in different areas to see which performs better. One of the challenges I’ve had with that when I started down that path is it felt like every vendor who was embracing OTel did it from a perspective of their implementation. Here’s how to instrument it to—send it to us because we’re the best, obviously. And you’re a community maintainer, despite working at observability vendors yourself. You have always been one of those community-first types where you care more about the user experience than you do this quarter for any particular employer that you have, which to be very clear, is intended as a compliment, not a terrifying warning. It’s why you have this authentic air to you and why you are one of those very few voices that I trust in a space where normally I need to approach it with significant skepticism. How do you see the relationship between vendors and OpenTelemetry?

Austin: I think the hard thing is that I know who signs my paychecks at the end of the day, right, and you always have, you know, some level of, you know, let’s say bias, right? Because it is a bias to look after, you know, them who brought you to the dance. But I think you can be responsible with balancing, sort of, the needs of your employer, and the needs of the community. You know, the way I’ve always described this is that if you think about observability as, like, a—you know, as a market, what’s the total addressable market there? It’s literally everyone that uses software; it’s literally every software company.

Which means there’s plenty of room for people to make their numbers and to buy and sell and trade and do all this sort of stuff. And by taking that approach, by taking sort of the big picture approach and saying, “Well, look, you know, there’s going to be—you know, of all these people, there are going to be some of them that are going to use our stuff and there are some of them that are going to use our competitor’s stuff.” And that’s fine. Let’s figure out where we can invest… in an OpenTelemetry, in a way that makes sense for everyone and not just, you know, our people. So, let’s build things like documentation, right?

You know, one of the things I’m most impressed with, with OpenTelemetry over the past, like, two years is we went from being, as a project, like, if you searched for OpenTelemetry, you would go and you would get five or six or ten different vendor pages coming up trying to tell you, like, “This is how you use it, this is how you use it.” And what we’ve done as a community is we’ve said, you know, “If you go looking for documentation, you should find our website. You should find our resources.” And we’ve managed to get the OpenTelemetry website to basically rank above almost everything else when people are searching for help with OpenTelemetry. And that’s been really good because, one, it means that now, rather than vendors or whoever coming in and saying, like, “Well, we can do this better than you,” we can be like, “Well, look, just, you know, put your effort here, right? It’s already the top result. It’s already where people are coming, and we can prove that.”

And two, it means that as people come in, they’re going to be put into this process of community feedback, where they can go in, they can look at the docs, and they can say, “Oh, well, I had a bad experience here,” or, “How do I do this?” And we get that feedback and then we can improve the docs for everyone else by acting on that feedback, and the net result of this is that more people are using OpenTelemetry, which means there are more people kind of going into the tippy-tippy top of the funnel, right, that are able to become a customer of one of these myriad observability back ends.

Corey: You touched on something very important here, when I first was exploring this—you may have been looking over my shoulder as I went through this process—my impression initially was, oh, this is a ‘CNCF project’ in quotes, where—this is not true universally, of course, but there are cases where it clearly—is where this is an, effectively, vendor-captured project, not necessarily by one vendor, but by an almost consortium of them. And that was my takeaway from OpenTelemetry. It was conversations with you, among others, that led me to believe no, no, this is not in that vein. This is clearly something that is a win. There are just a whole bunch of vendors more-or-less falling all over themselves, trying to stake out thought leadership and imply ownership, on some level, of where these things go. But I definitely left with a sense that this is bigger than any one vendor.

Austin: I would agree. I think, to even step back further, right, there’s almost two different ways that I think vendors—or anyone—can approach OpenTelemetry, you know, from a market perspective, and one is to say, like, “Oh, this is socializing, kind of, the maintenance burden of instrumentation.” Which is a huge cost for commercial players, right? Like, if you’re a Datadog or a Splunk or whoever, you know, you have these agents that you go in and they rip telemetry out of your web servers, out of your gRPC libraries, whatever, and it costs a lot of money to pay engineers to maintain those instrumentation agents, right? And the cynical take is, oh, look at all these big companies that are kind of like pushing all that labor onto the open-source community, and you know, I’m not casting any aspersions here, like, I do think that there’s an element of truth to it though because, yeah, that is a huge fixed cost.

And if you look at the actual lived reality of people and you look at back when SignalFx was still a going concern, right, and they had their APM agents open-sourced, you could go into the SignalFx repo and diff, like, their [Node Express 00:10:15] instrumentation against the Datadog Node Express instrumentation, and it’s almost a hundred percent the same, right? Because it’s truly a commodity. There’s no—there’s nothing interesting about how you get that telemetry out. The interesting stuff all happens after you have the telemetry and you’ve sent it to some back-end, and then you can, you know, analyze it and find interesting things. So, yeah, like, it doesn’t make sense for there to be five or six or eight different companies all competing to rebuild the same wheels over and over and over and over when they don’t have to.

I think the second thing that some people are starting to understand is that it’s like, okay, let’s take this a step beyond instrumentation, right? Because the goal of OpenTelemetry really is to make sure that this instrumentation is native so that you don’t need a third-party agent, you don’t need some other process or jar or whatever that you drop in and it instruments stuff for you. The JVM should provide this, your web framework should provide this, your RPC library should provide this right? Like, this data should come from the code itself and be in a normalized fashion that can then be sent to any number of vendors or back ends or whatever. And that changes how—sort of, the competitive landscape a lot, I think, for observability vendors because rather than, kind of, what you have now, which is people will competing on, like, well, how quickly can I throw this agent in and get set up and get a dashboard going, it really becomes more about, like, okay, how are you differentiating yourself against every other person that has access to the same data, right? And you get more interesting use cases and how much more interesting analysis features, and that results in more innovation in, sort of, the industry than we’ve seen in a very long time.

Corey: For me, just from the customer side of the world, one of the biggest problems I had with observability in my career as an SRE-type for years was you would wind up building your observability pipeline around whatever vendor you had selected and that meant emphasizing the things they were good at and de-emphasizing the things that they weren’t. And sometimes it’s worked to your benefit; usually not. But then you always had this question when it got things that touched on APM or whatnot—or Application Performance Monitoring—where oh, just embed our library into this. Okay, great. But a year-and-a-half ago, my exposure to this was on an application that I was running in distributed fashion on top of AWS Lambda.

So great, you can either use an extension for this or you can build in the library yourself, but then there’s always a question of precedence where when you have multiple things that are looking at this from different points of view, which one gets done first? Which one is going to see the others? Which one is going to enmesh the other—enclose the others in its own perspective of the world? And it just got incredibly frustrating. One of the—at least for me—bright lights of OTel was that it got away from that where all of the vendors receiving telemetry got the same view.

Austin: Yeah. They all get the same view, they all get the same data, and you know, there’s a pretty rich collection of tools that we’re starting to develop to help you build those pipelines yourselves and really own everything from the point of generation to intermediate collection to actually outputting it to wherever you want to go. For example, a lot of really interesting work has come out of the OpenTelemetry collector recently; one of them is this feature called Connectors. And Connectors let you take the output of certain pipelines and route them as inputs to another pipeline. And as part of that connection, you can transform stuff.

So, for example, let’s say you have a bunch of [spans 00:14:05] or traces coming from your API endpoints, and you don’t necessarily want to keep all those traces in their raw form because maybe they aren’t interesting or maybe there’s just too high of a volume. So, with Connectors, you can go and you can actually convert all of those spans into metrics and export them to a metrics database. You could continue to save that span data if you want, but you have options now, right? Like, you can take that span data and put it into cold storage or put it into, like, you know, some sort of slow blob storage thing where it’s not actively indexed and it’s slow lookups, and then keep a metric representation of it in your alerting pipeline, use metadata exemplars or whatever to kind of connect those things back. And so, when you do suddenly see it’s like, “Oh, well, there’s some interesting p99 behavior,” or we’re hitting an alert or violating an SLO or whatever, then you can go back and say, like, “Okay, well, let’s go dig through the slow da—you know, let’s look at the cold data to figure out what actually happened.”

And those are features that, historically, you would have needed to go to a big, important vendor and say, like, “Hey, here’s a bunch of money,” right? Like, “Do this for me.” Now, you have the option to kind of do all that more interesting pipeline stuff yourself and then make choices about vendors based on, like, who is making a tool that can help me with the problem that I have? Because most of the time, I don’t—I feel like we tend to treat observability tools as—it depends a lot on where you sit in the org—but you certainly seen this movement towards, like, “Well, we don’t want a tool; we want a platform. We want to go to Lowe’s and we want to get the 48-in-one kit that has a bunch of things in it. And we’re going to pay for the 48-in-one kit, even if we only need, like, two things or three things out of it.”

OpenTelemetry lets you kind of step back and say, like, “Well, what if we just got, like, really high-quality tools for the two or three things we need, and then for the rest of the stuff, we can use other cheaper options?” Which is, I think, really attractive, especially in today’s macroeconomic conditions, let’s say.

Corey: One thing I’m trying to wrap my head around because we all find when it comes to observability, in my experience, it’s the parable of three blind people trying to describe an elephant by touch; depending on where you are on the elephant, you have a very different perspective. What I’m trying to wrap my head around is, what is the vision for OpenTelemetry? Is it specifically envisioned to be the agent that runs wherever the workload is, whether it’s an agent on a host or a layer in a Lambda function, or a sidecar or whatnot in a Kubernetes cluster that winds up gathering and sending data out? Or is the vision something different? Because part of what you’re saying aligns with my perspective on it, but other parts of it seem to—that there’s a misunderstanding somewhere, and it’s almost certainly on my part.

Austin: I think the long-term vision is that you as a developer, you as an SRE, don’t even have to think about OpenTelemetry, that when you are using your container orchestrator or you are using your API framework or you’re using your Managed API Gateway, or any kind of software that you’re building something with, that the telemetry data from that software is emitted in an OpenTelemetry format, right? And when you are writing your code, you know, and you’re using gRPC, let’s say, you could just natively expect that OpenTelemetry is kind of there in the background and it’s integrated into the actual libraries themselves. And so, you can just call the OpenTelemetry API and it’s part of the standard library almost, right? You add some additional metadata to a span and say, like, “Oh, this is the customer ID,” or, “This is some interesting attribute that I want to track for later on,” or, “I’m going to create a histogram here or counter,” whatever it is, and then all that data is just kind of there, right, invisible to you unless you need it. And then when you need it, it’s there for you to kind of pick up and send off somewhere to any number of back-ends or databases or whatnot that you could then use to discover problems or better model your system.

That’s the long-term vision, right, that it’s just there, everyone uses it. It is a de facto and du jour standard. I think in the medium term, it does look a little bit more like OpenTelemetry is kind of this Swiss army knife agent that’s running on—inside cars in Kubernetes or it’s running on your EC2 instance. Until we get to the point of everyone just agrees that we’re going to use OpenTelemetry protocol for the data and we’re going to use all your stuff and we just natively emit it, then that’s going to be how long we’re in that midpoint. But that’s sort of the medium and long-term vision I think. Does that track?

Corey: It does. And I’m trying to equate this to—like the evolution back in the Stone Age was back when I was first getting started, Nagios was the gold standard. It was kind of the original Call of Duty. And it was awful. There were a bunch of problems with it, but it also worked.

And I’m not trying to dunk on the people who built that. We all stand on the shoulders of giants. It was an open-source project that was awesome doing exactly what it did, but it was a product built for a very different time. It completely had the wheels fall off as soon as you got to things were even slightly ephemeral because it required this idea of the server needed to know where all of the things that was monitoring lived as an individual host basis, so there was this constant joy of, “Oh, we’re going to add things to a cluster.” Its perspective was, “What’s a cluster?” Or you’d have these problems with a core switch going down and suddenly everything else would explode as well.

And even setting up an on-call rotation for who got paged when was nightmarish. And a bunch of things have evolved since then, which is putting it mildly. Like, you could say that about fire, the invention of the wheel. Yeah, a lot of things have evolved since the invention of the wheel, and here we are tricking sand into thinking. But we find ourselves just—now it seems that the outcome of all of this has been instead of one option that’s the de facto standard that’s kind of terrible in its own ways, now, we have an entire universe of different products, many of which are best-of-breed at one very specific thing, but nothing’s great at everything.

It’s the multifunction printer conundrum, where you find things that are great at one or two things at most, and then mediocre at best at the rest. I’m excited about the possibility for OpenTelemetry to really get to a point of best-of-breed for everything. But it also feels like the money folks are pushing for consolidation, if you believe a lot of the analyst reports around this of, “We already pay for seven different observability vendors. How about we knock it down to just one that does all of these things?” Because that would be terrible. What do you land on that?

Austin: Well, as I intu—or alluded to this earlier, I think the consolidation in the observability space, in general, is very much driven by that force you just pointed out, right? The buyers want to consolidate more and more things into single tools. And I think there’s a lot of… there are reasons for that that—you know, there are good reasons for that, but I also feel like a lot of those reasons are driven by fundamentally telemetry-side concerns, right? So like, one example of this is if you were Large Business X, and you see—you are an engineering director and you get a report, that’s like, “We have eight different metrics products.” And you’re like, “That seems like a lot. Let’s just use Brand X.”

And Brand X will tell you very, very happily tell you, like, “Oh, you just install our thing everywhere and you can get rid of all these other tools.” And usually, there’s two reasons that people pick tools, right? One reason is that they are forced to and then they are forced to do a bunch of integration work to get whatever the old stuff was working in the new way, but the other reason is because they tried a bunch of different things and they found the one tool that actually worked for them. And what happens invariably in these sort of consolidation stories is, you know, the new vendor comes in on a shining horse to consolidate, and you wind up instead of eight distinct metrics tools, now you have nine distinct metrics tools because there’s never any bandwidth for people to go back and, you know—you’re Nagios example, right, Nag—people still use Nagios every day. What’s the economic justification to take all those Nagios installs, if they’re working, and put them into something else, right?

What’s the economic justification to go and take a bunch of old software that hasn’t been touched for ten years that still runs and still does what needs to do, like, where’s the incentive to go and re-instrument that with OpenTelemetry or anything else? It doesn’t necessarily exist, right? And that’s a pretty, I think, fundamental decision point in everyone’s observability journey, which is what do you do about all the old stuff? Because most of the stuff is the old stuff and the worst part is, most of the stuff that you make money off of is the old stuff as well. So, you can’t ignore it, and if you’re spending, you know, millions of millions of dollars on the new stuff—like, there was a story that went around a while ago, I think, Coinbase spent something like, what, $60 million on Datadog… I hope they asked for it in real money and not Bitcoin. But—

Corey: Yeah, something I’ve noticed about all the vendors, and even Coinbase themselves, very few of them actually transact in cryptocurrency. It’s always cash on the barrelhead, so to speak.

Austin: Yeah, smart. But still, like, that’s an absurd amount of money [laugh] for any product or service, I would argue, right? But that’s just my perspective. I do think though, it goes to show you that you know, it’s very easy to get into these sort of things where you’re just spending over the barrel to, like, the newest vendor that’s going to come in and solve all your problems for you. And just, it often doesn’t work that way because most places aren’t—especially large organizations—just aren’t built in is sort of like, “Oh, we can go through and we can just redo stuff,” right? “We can just roll out a new agent through… whatever.”

We have mainframes [unintelligible 00:25:09], mainframes to thinking about, you have… in many cases, you have an awful lot of business systems that most, kind of, cloud people don’t like, think about, right, like SAP or Salesforce or ServiceNow, or whatever. And those sort of business process systems are actually responsible for quite a few things that are interesting from an observability point of view. But you don’t see—I mean, hell, you don’t even see OpenTelemetry going out and saying, like, “Oh, well, here’s the thing to let you know, observe Apex applications on Salesforce,” right? It’s kind of an undiscovered country in a lot of ways and it’s something that I think we will have to grapple with as we go forward. In the shorter term, there’s a reason that OpenTelemetry mostly focuses on cloud-native applications because that’s a little bit easier to actually do what we’re trying to do on them and that’s where the heat and light is. But once we get done with that, then the sky is the limit.

[midroll 00:26:11]

Corey: It still feels like OpenTelemetry is evolving rapidly. It’s certainly not, I don’t want to say it’s not feature complete, which, again, what—software is never done. But it does seem like even quarter-to-quarter or month-to-month, its capabilities expand massively. Because you apparently enjoy pain, you’re in the process of writing a book. I think it’s in early release or early access that comes out next year, 2024. Why would you do such a thing?

Austin: That’s a great question. And if I ever figure out the answer I will tell you.

Corey: Remember, no one wants to write a book; they want to have written the book.

Austin: And the worst part is, is I have written the book and for some reason, I went back for another round. I—

Corey: It’s like childbirth. No one remembers exactly how horrible it was.

Austin: Yeah, my partner could probably attest to that. Although I was in the room, and I don’t think I’d want to do it either. So, I think the real, you know, the real reason that I decided to go and kind of write this book—and it’s Learning OpenTelemetry; it’s in early release right now on the O’Reilly learning platform and it’ll be out in print and digital next year, I believe, we’re targeting right now, early next year.

But the goal is, as you pointed out so eloquently, OpenTelemetry changes a lot. And it changes month to month sometimes. So, why would someone decide—say, “Hey, I’m going to write the book about learning this?” Well, there’s a very good reason for that and it is that I’ve looked at a lot of the other books out there on OpenTelemetry, on observability in general, and they talk a lot about, like, here’s how you use the API. Here’s how you use the SDK. Here’s how you make a trace or a span or a log statement or whatever. And it’s very technical; it’s very kind of in the weeds.

What I was interested in is saying, like, “Okay, let’s put all that stuff aside because you don’t necessarily…” I’m not saying any of that stuff’s going to change. And I’m not saying that how to make a span is going to change tomorrow; it’s not, but learning how to actually use something like OpenTelemetry isn’t just knowing how to create a measurement or how to create a trace. It’s, how do I actually use this in a production system? To my point earlier, how do I use this to get data about, you know, these quote-unquote, “Legacy systems?” How do I use this to monitor a Kubernetes cluster? What’s the important parts of building these observability pipelines? If I’m maintaining a library, how should I integrate OpenTelemetry into that library for my users? And so on, and so on, and so forth.

And the answers to those questions actually probably aren’t going to change a ton over the next four or five years. Which is good because that makes it the perfect thing to write a book about. So, the goal of Learning OpenTelemetry is to help you learn not just how to use OpenTelemetry at an API or SDK level, but it’s how to build an observability pipeline with OpenTelemetry, it’s how to roll it out to an organization, it’s how to convince your boss that this is what you should use, both for new and maybe picking up some legacy development. It’s really meant to give you that sort of 10,000-foot view of what are the benefits of this, how does it bring value and how can you use it to build value for an observability practice in an organization?

Corey: I think that’s fair. Looking at the more quote-unquote, “Evergreen,” style of content as opposed to—like, that’s the reason for example, I never wind up doing tutorials on how to use an AWS service because one console change away and suddenly I have to redo the entire thing. That’s a treadmill I never had much interest in getting on. One last topic I want to get into before we wind up wrapping the episode—because I almost feel obligated to sprinkle this all over everything because the analysts told me I have to—what’s your take on generative AI, specifically with an eye toward observability?

Austin: [sigh], gosh, I’ve been thinking a lot about this. And—hot take alert—as a skeptic of many technological bubbles over the past five or so years, ten years, I’m actually pretty hot on AI—generative AI, large language models, things like that—but not for the reasons that people like to kind of hold them up, right? Not so that we can all make our perfect, funny [sigh], deep dream, meme characters or whatever through Stable Fusion or whatever ChatGPT spits out at us when we ask for a joke. I think the real win here is that this to me is, like, the biggest advance in human-computer interaction since resistive touchscreens. Actually, probably since the mouse.

Corey: I would agree with that.

Austin: And I don’t know if anyone has tried to get someone that is, you know, over the age of 70 to use a computer at any time in their life, but mapping human language to trying to do something on an operating system or do something on a computer on the web is honestly one of the most challenging things that faces interface design, face OS designers, faces anyone. And I think this also applies for dev tools in general, right? Like, if you think about observability, if you think about, like, well, what are the actual tasks involved in observability? It’s like, well, you’re making—you’re asking questions. You’re saying, like, “Hey, for this metric named HTTPrequestsByCode,” and there’s four or five dimensions, and you say, like, “Okay, well break this down for me.” You know, you have to kind of know the magic words, right? You have to know the magic promQL sequence or whatever else to plug in and to get it to graph that for you.

And you as an operator have to have this very, very well developed, like, depth of knowledge and math and statistics to really kind of get a lot of—

Corey: You must be at least this smart to ride on this ride.

Austin: Yeah. And I think that, like that, to me is the real—the short-term win for certainly generative AI around using, like, large language models, is the ability to create human language interfaces to observability tools, that—

Corey: As opposed to learning your own custom SQL dialect, which I see a fair number of times.

Austin: Right. And, you know, and it’s actually very funny because there was a while for the—like, one of my kind of side projects for the past [sigh] a little bit [unintelligible 00:32:31] idea of, like, well, can we make, like, a universal query language or universal query layer that you could ship your dashboards or ship your alerts or whatever. And then it’s like, generative AI kind of just, you know, completely leapfrogs that, right? It just says, like, well, why would you need a query language, if we can just—if you can just ask the computer and it works, right?

Corey: The most common programming language is about to become English.

Austin: Which I mean, there’s an awful lot of externalities there—

Corey: Which is great. I want to be clear. I’m not here to gatekeep.

Austin: Yeah. I mean, I think there’s a lot of externalities there, and there’s a lot—and the kind of hype to provable benefit ratio is very skewed right now towards hype. That said, one of the things that is concerning to me as sort of an observability practitioner is the amount of people that are just, like, whole-hog, throwing themselves into, like, oh, we need to integrate generative AI, right? Like, we need to put AI chatbots and we need to have ChatGPT built into our products and da-da-da-da-da. And now you kind of have this perfect storm of people that really don’t ha—because they’re just using these APIs to integrate gen AI stuff with, they really don’t understand what it’s doing because a lot you know, it is very complex, and I’ll be the first to admit that I really don’t understand what a lot of it is doing, you know, on the deep, on the foundational math side.

But if we’re going to have trust in, kind of, any kind of system, we have to understand what it’s doing, right? And so, the only way that we can understand what it’s doing is through observability, which means it’s incredibly important for organizations and companies that are building products on generative AI to, like, drop what—you know, walk—don’t walk, run towards something that is going to give you observability into these language models.

Corey: Yeah. “The computer said so,” is strangely dissatisfying.

Austin: Yeah. You need to have that base, you know, sort of, performance [goals and signals 00:34:31], obviously, but you also need to really understand what are the questions being asked. As an example, let’s say you have something that is tokenizing questions. You really probably do want to have some sort of observability on the hot path there that lets you kind of break down common tokens, especially if you were using, like, custom dialects or, like, vectors or whatever to modify the, you know, neural network model, like, you really want to see, like, well, what’s the frequency of the certain tokens that I’m getting they’re hitting the vectors versus not right? Like, where can I improve these sorts of things? Where am I getting, like, unexpected results?

And maybe even have some sort of continuous feedback mechanism that it could be either analyzing the tone and tenor of end-user responses or you can have the little, like, frowny and happy face, whatever it is, like, something that is giving you that kind of constant feedback about, like, hey, this is how people are actually like interacting with it. Because I think there’s way too many stories right now people just kind of like saying, like, “Oh, okay. Here’s some AI-powered search,” and people just, like, hating it. Because people are already very primed to distrust AI, I think. And I can’t blame anyone.

Corey: Well, we’ve had an entire lifetime of movies telling us that’s going to kill us all.

Austin: Yeah.

Corey: And now you have a bunch of, also, billionaire tech owners who are basically intent on making that reality. But that’s neither here nor there.

Austin: It isn’t, but like I said, it’s difficult. It’s actually one of the first times I’ve been like—that I’ve found myself very conflicted.

Corey: Yeah, I’m a booster of this stuff; I love it, but at the same time, you have some of the ridiculous hype around it and the complete lack of attention to safety and humanity aspects of it that it’s—I like the technology and I think it has a lot of promise, but I want to get lumped in with that set.

Austin: Exactly. Like, the technology is great. The fan base is… ehh, maybe something a little different. But I do think that, for lack of a better—not to be an inevitable-ist or whatever, but I do think that there is a significant amount of, like, this is a genie you can’t put back in the bottle and it is going to have, like, wide-ranging, transformative effects on the discipline of, like, software development, software engineering, and white collar work in general, right? Like, there’s a lot of—if your job involves, like, putting numbers into Excel and making pretty spreadsheets, then ooh, that doesn’t seem like something that’s going to do too hot when I can just have Excel do that for me.

And I think we do need to be aware of that, right? Like, we do need to have that sort of conversation about, like… what are we actually comfortable doing here in terms of displacing human labor? When we do displace human labor, are we doing it so that we can actually give people leisure time or so that we can just cram even more work down the throats of the humans that are left?

Corey: And unfortunately, I think we might know what that answer is, at least on our current path.

Austin: That’s true. But you know, I’m an optimist.

Corey: I… don’t do well with disappointment. Which the show has certainly not been. I really want to thank you for taking the time to speak with me today. If people want to learn more, where’s the best place for them to find you?

Austin: Welp, I—you can find me on most social media. Many, many social medias. I used to be on Twitter a lot, and we all know what happened there. The best place to figure out what’s going on is check out my bio, social.ap2.io will give you all the links to where I am. And yeah, been great talking with you.

Corey: Likewise. Thank you so much for taking the time out of your day. Austin Parker, community maintainer for OpenTelemetry. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment pointing out that actually, physicists say the vast majority of the universe’s empty space, so that we can later correct you by saying ah, but it’s empty whitespace. That’s right. YAML wins again.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

The Evolution of OpenTelemetry with Austin Parker

Episode Summary

Episode Show Notes & Transcript

You might also like

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

The Latest State of IaC with Ido Neeman

Get the Newsletter

Sponsor an Episode