The Evolution of Cloud Services with Richard Hartmann

Episode Summary

Richard Hartmann, Director of Community at Grafana Labs, joins Corey to discuss the evolution of cloud services and the infrastructure behind them. Whether it’s how monitoring became observability, or how predicting server breakdowns at scale has become more of a science than simply grunting and pointing, Richard provides a balanced perspective on what users actually want and why they want it. Richard also reveals how pricing impacts community-building efforts, creating psychological safety for your users to promote long-term adoption, and why not optimizing your data makes about as much sense as storing 10,000 copies of Lord of the Rings in your living room.

Episode Show Notes & Transcript

About Richard

Richard "RichiH" Hartmann is the Director of Community at Grafana Labs, Prometheus team member, OpenMetrics founder, OpenTelemetry member, CNCF Technical Advisory Group Observability chair, CNCF Technical Oversight Committee member, CNCF Governing Board member, and more. He also leads, organizes, or helps run various conferences from hundreds to 18,000 attendess, including KubeCon, PromCon, FOSDEM, DENOG, DebConf, and Chaos Communication Congress. In the past, he made mainframe databases work, ISP backbones run, kept the largest IRC network on Earth running, and designed and built a datacenter from scratch. Go through his talks, podcasts, interviews, and articles at https://github.com/RichiH/talks or follow him on Twitter at https://twitter.com/TwitchiH for musings on the intersection of technology and society.

Links Referenced:

Grafana Labs: https://grafana.com/
Twitter: https://twitter.com/TwitchiH
Richard Hartmann list of talks: https://github.com/richih/talks

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it’s an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That’s why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it’s ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig. That’s snark.cloud/appconfig.

Corey: This episode is brought to us in part by our friends at Datadog. Datadog's SaaS monitoring and security platform that enables full stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500 plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third party services in a single pane of glass.

Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary T-shirt when you install the agent.

To learn more, visit datadoghq/screaminginthecloud to get. That's www.datadoghq/screaminginthecloud

Corey: Welcome to Screaming in the Cloud, I’m Corey Quinn. There are an awful lot of people who are incredibly good at understanding the ins and outs and the intricacies of the observability world. But they didn’t have time to come on the show today. Instead, I am talking to my dear friend of two decades now, Richard Hartmann, better known on the internet as RichiH, who is the Director of Community at Grafana Labs, here to suffer—in a somewhat atypical departure for the theme of this show—personal attacks for once. Richie, thank you for joining me.

Richard: And thank you for agreeing on personal attacks.

Corey: Exactly. It was one of your riders. Like, there have to be the personal attacks back and forth or you refuse to appear on the show. You’ve been on before. In fact, the last time we did a recording, I believe you were here in person, which was a long time ago. What have you been up to?

You’re still at Grafana Labs. And in many cases, I would point out that, wow, you’ve been there for many years; that seems to be an atypical thing, which is an American tech industry perspective because every time you and I talk about this, you look at folks who—wow, you were only at that company for five years. What’s wrong with you—you tend to take the longer view and I tend to have the fast twitch, time to go ahead and leave jobs because it’s been more than 20 minutes approach. I see that you’re continuing to live what you preach, though. How’s it been?

Richard: Yeah, so there’s a little bit of Covid brains, I think. When we talked in 2018, I was still working at SpaceNet, building a data center. But the last two-and-a-half years didn’t really happen for many people, myself included. So, I guess [laugh] that includes you.

Corey: No, no you’re right. You’ve only been at Grafana Labs a couple of years. One would think I would check the notes for shooting my mouth off. But then, one wouldn’t know me.

Richard: What notes? Anyway, I’ve been around Prometheus and Grafana Since 2015. But it’s like, real, full-time everything is 2020. There was something in between. Since 2018, I contracted to do vulnerability handling and everything for Grafana Labs because they had something and they didn’t know how to deal with it.

But no, full time is 2020. But as to the space in the [unintelligible 00:02:45] of itself, it’s maybe a little bit German of me, but trying to understand the real world and trying to get an overview of systems and how they actually work, and if they are working correctly and as intended, and if not, how they’re not working as intended, and how to fix this is something which has always been super important to me, in part because I just want to understand the world. And this is a really, really good way to automate understanding of the world. So, it’s basically a work-saving mechanism. And that’s why I’ve been sticking to it for so long, I guess.

Corey: Back in the early days of monitoring systems—so we called it monitoring back then because, you know, are using simple words that lack nuance was sort of de rigueur back then—we wound up effectively having tools. Nagios is the one that springs to mind, and it was terrible in all the ways you would expect a tool written in janky Perl in the early-2000s to be. But it told you what was going on. It tried to do a thing, generally reach a server or query it about things, and when things fell out of certain specs, it screamed its head off, which meant that when you had things like the core switch melting down—thinking of one very particular incident—you didn’t get a Nagios alert; you got 4000 Nagios alerts. But start to finish, you could wrap your head rather fully around what Nagios did and why it did the sometimes strange things that it did.

These days, when you take a look at Prometheus, which we hear a lot about, particularly in the Kubernetes space and Grafana, which is often mentioned in the same breath, it’s never been quite clear to me exactly where those start and stop. It always feels like it’s a component in a larger system to tell you what’s going on rather than a one-stop shop that’s going to, you know, shriek its head off when something breaks in the middle of the night. Is that the right way to think about it? The wrong way to think about it?

Richard: It’s a way to think about it. So personally, I use the terms monitoring and observability pretty much interchangeably. Observability is a relatively well-defined term, even though most people won’t agree. But if you look back into the ’70s into control theory where the term is coming from, it is the measure of how much you’re able to determine the internal state of a system by looking at its inputs and its outputs. Depending on the definition, some people don’t include the inputs, but that is the OG definition as far as I’m aware.

And from this, there flow a lot of things. This question of—or this interpretation of the difference between telling that, yes, something’s broken versus why something’s broken. Or if you can’t ask new questions on the fly, it’s not observability. Like all of those things are fundamentally mapped to this definition of, I need enough data to determine the internal state of whatever system I have just by looking at what is coming in, what is going out. And that is at the core the thing. Now, obviously, it’s become a buzzword, which is oftentimes the fate of successful things. So, it’s become a buzzword, and you end up with cargo culting.

Corey: I would argue periodically, that observability is hipster monitoring. If you call it monitoring, you get yelled at by Charity Majors. Which is tongue and cheek, but she has opinions, made, nonetheless shall I say, frustrating by the fact that she is invariably correct in those opinions, which just somehow makes it so much worse. It would be easy to dismiss things she says if she weren’t always right. And the world is changing, especially as we get into the world of distributed systems.

Is the server that runs the app working or not working loses meaning when we’re talking about distributed systems, when we’re talking about containers running on top of Kubernetes, which turns every outage into a murder mystery. We start having distributed applications composed of microservices, so you have no idea necessarily where an issue is. Okay, is this one microservice having an issue related to the request coming into a completely separate microservice? And it seems that for those types of applications, the answer has been tracing for a long time now, where originally that was something that felt like it was sprung, fully-formed from the forehead of some God known as one of the hyperscalers, but now is available to basically everyone, in theory.

In practice, it seems that instrumenting applications still one of the hardest parts of all of this. I tried hooking up one of my own applications to be observed via OTEL, the open telemetry project, and it turns out that right now, OTEL and AWS Lambda have an intersection point that makes everything extremely difficult to work with. It’s not there yet; it’s not baked yet. And someday, I hope that changes because I would love to interchangeably just throw metrics and traces and logs to all the different observability tools and see which ones work, which ones don’t, but that still feels very far away from current state of the art.

Richard: Before we go there, maybe one thing which I don’t fully agree with. You said that previously, you were told if a service up or down, that’s the thing which you cared about, and I don’t think that’s what people actually cared about. At that time, also, what they fundamentally cared about: is the user-facing service up, or down, or impacted? Is it slow? Does it return errors every X percent for requests, something like this?

Corey: Is the site up? And—you’re right, I was hand-waving over a whole bunch of things. It was, “Okay. First, the web server is returning a page, yes or no? Great. Can I ping the server?” Okay, well, there are ways of server can crash and still leave enough of the TCP/IP stack up or it can respond to pings and do little else.

And then you start adding things to it. But the Nagios thing that I always wanted to add—and had to—was, is the disk full? And that was annoying. And, on some level, like, why should I care in the modern era how much stuff is on the disk because storage is cheap and free and plentiful? The problem is, after the third outage in a month because the disk filled up, you start to not have a good answer for well, why aren’t you monitoring whether the disk is full?

And that was the contributors to taking down the server. When the website broke, there were what felt like a relatively small number of reasonably well-understood contributors to that at small to midsize applications, which is what I’m talking about, the only things that people would let me touch. I wasn’t running hyperscale stuff where you have a fleet of 10,000 web servers and, “Is the server up?” Yeah, in that scenario, no one cares. But when we’re talking about the database server and the two application servers and the four web servers talking to them, you think about it more in terms of pets than you do cattle.

Richard: Yes, absolutely. Yet, I think that was a mistake back then, and I tried to do it differently, as a specific example with the disk. And I’m absolutely agreeing that previous generation tools limit you in how you can actually work with your data. In particular, once you’re with metrics where you can do actual math on the data, it doesn’t matter if the disk is almost full. It matters if that disk is going to be full within X amount of time.

If that disk is 98% full and it sits there at 98% for ten years and provides the service, no one cares. The thing is, will it actually run out in the next two hours, in the next five hours, what have you. Depending on this, is this currently or imminently a customer-impacting or user-impacting then yes, alert on it, raise hell, wake people, make them fix it, as opposed to this thing can be dealt with during business hours on the next workday. And you don’t have to wake anyone up.

Corey: Yeah. The big filer with massive amounts of storage has crossed the 70% line. Okay, now it’s time to start thinking about that, what do you want to do? Maybe it’s time to order another shelf of discs for it, which is going to take some time. That’s a radically different scenario than the 20 gigabyte root volume on your server just started filling up dramatically; the rate of change is such that’ll be full in 20 minutes.

Yeah, one of those is something you want to wake people up for. Generally speaking, you don’t want to wake people up for what is fundamentally a longer-term strategic business problem. That can be sorted out in the light of day versus, “[laugh] we’re not going to be making money in two hours, so if I don’t wake up and fix this now.” That’s the kind of thing you generally want to be woken up for. Well, let’s be honest, you don’t want that to happen at all, but if it does happen, you kind of want to know in advance rather than after the fact.

Richard: You’re literally describing linear predict from Prometheus, which is precisely for this, where I can look back over X amount of time and make a linear prediction because everything else breaks down at scale, blah, blah, blah, to detail. But the thing is, I can draw a line with my pencil by hand on my data and I can predict when is this thing going to it. Which is obviously precisely correct if I have a TLS certificate. It’s a little bit more hand-wavy when it’s a disk. But still, you can look into the future and you say, “What will be happening if current trends for the last X amount of time continue in Y amount of time.” And that’s precisely a thing where you get this more powerful ability of doing math with your data.

Corey: See, when you say it like that, it sounds like it actually is a whole term of art, where you’re focusing on an in-depth field, where salaries are astronomical. Whereas the tools that I had to talk about this stuff back in the day made me sound like, effectively, the sysadmin that I was grunting and pointing: “This is gonna fill up.” And that is how I thought about it. And this is the challenge where it’s easy to think about these things in narrow, defined contexts like that, but at scale, things break.

Like the idea of anomaly detection. Well, okay, great if normally, the CPU and these things are super bored and suddenly it gets really busy, that’s atypical. Maybe we should look into it, assuming that it has a challenge. The problem is, that is a lot harder than it sounds because there are so many factors that factor into it. And as soon as you have something, quote-unquote, “Intelligent,” making decisions on this, it doesn’t take too many false positives before you start ignoring everything it has to say, and missing legitimate things. It’s this weird and obnoxious conflation of both hard technical problems and human psychology.

Richard: And the breaking up of old service boundaries. Of course, when you say microservices, and such, fundamentally, functionally a microservice or nanoservice, picoservice—but the pendulum is already swinging back to larger units of complexity—but it fundamentally does not make any difference if I have a monolith on some mainframe or if I have a bunch of microservices. Yes, I can scale differently, I can scale horizontally a lot more easily, vertically, it’s a little bit harder, blah, blah, blah, but fundamentally, the logic and the complexity, which is being packaged is fundamentally the same. More users, everything, but it is fundamentally the same. What’s happening again, and again, is I’m breaking up those old boundaries, which means the old tools which have assumptions built in about certain aspects of how I can actually get an overview of a system just start breaking down, when my complexity unit or my service or what have I, is usually congruent with a physical piece, of hardware or several services are congruent with that piece of hardware, it absolutely makes sense to think about things in terms of this one physical server. The fact that you have different considerations in cloud, and microservices, and blah, blah, blah, is not inherently that it is more complex.

On the contrary, it is fundamentally the same thing. It scales with users' everything, but it is fundamentally the same thing, but I have different boundaries of where I put interfaces onto my complexity, which basically allow me to hide all of this complexity from the downstream users.

Corey: That’s part of the challenge that I think we’re grappling with across this entire industry from start to finish. Where we originally looked at these things and could reason about it because it’s the computer and I know how those things work. Well, kind of, but okay, sure. But then we start layering levels of complexity on top of layers of complexity on top of layers of complexity, and suddenly, when things stop working the way that we expect, it can be very challenging to unpack and understand why. One of the ways I got into this whole space was understanding, to some degree, of how system calls work, of how the kernel wound up interacting with userspace, about how Linux systems worked from start to finish. And these days, that isn’t particularly necessary most of the time for the care and feeding of applications.

The challenge is when things start breaking, suddenly having that in my back pocket to pull out could be extremely handy. But I don’t think it’s nearly as central as it once was and I don’t know that I would necessarily advise someone new to this space to spend a few years as a systems person, digging into a lot of those aspects. And this is why you need to know what inodes are and how they work. Not really, not anymore. It’s not front and center the way that it once was, in most environments, at least in the world that I live in. Agree? Disagree?

Richard: Agreed. But it’s very much unsurprising. You probably can’t tell me how to precisely grow sugar cane or corn, you can’t tell me how to refine the sugar out of it, but you can absolutely bake a cake. But you will not be able to tell me even a third of—and I’m—for the record, I’m also not able to tell you even a third about the supply chain which just goes from I have a field and some seeds and I need to have a package of refined sugar—you’re absolutely enabled to do any of this. The thing is, you’ve been part of the previous generation of infrastructure where you know how this underlying infrastructure works, so you have more ability to reason about this, but it’s not needed for cloud services nearly as much.

You need different types of skill sets, but that doesn’t mean the old skill set is completely useless, at least not as of right now. It’s much more a case of you need fewer of those people and you need them in different places because those things have become infrastructure. Which is basically the cloud play, where a lot of this is just becoming infrastructure more and more.

Corey: Oh, yeah. Back then I distinctly remember my elders looking down their noses at me because I didn’t know assembly, and how could I possibly consider myself a competent systems admin if I didn’t at least have a working knowledge of assembly? Or at least C, which I, over time, learned enough about to know that I didn’t want to be a C programmer. And you’re right, this is the value of cloud and going back to those days getting a web server up and running just to compile Apache’s httpd took a week and an in-depth knowledge of GCC flags.

And then in time, oh, great. We’re going to have rpm or debs. Great, okay, then in time, you have apt, if you’re in the dev land because I know you are a Debian developer, but over in Red Hat land, we had yum and other tools. And then in time, it became oh, we can just use something like Puppet or Chef to wind up ensuring that thing is installed. And then oh, just docker run. And now it’s a checkbox in a web console for S3.

These things get easier with time and step by step by step we’re standing on the shoulders of giants. Even in the last ten years of my career, I used to have a great challenge question that I would interview people with of, “Do you know what TinyURL is? It takes a short URL and then expands it to a longer one. Great, on the whiteboard, tell me how you would implement that.” And you could go up one side and down the other, and then you could add constraints, multiple data centers, now one goes offline, how do you not lose data? Et cetera, et cetera.

But these days, there are so many ways to do that using cloud services that it almost becomes trivial. It’s okay, multiple data centers, API Gateway, a Lambda, and a global DynamoDB table. Now, what? “Well, now it gets slow. Why is it getting slow?”

“Well, in that scenario, probably because of something underlying the cloud provider.” “And so now, you lose an entire AWS region. How do you handle that?” “Seems to me when that happens, the entire internet’s kind of broken. Do people really need longer URLs?”

And that is a valid answer, in many cases. The question doesn’t really work without a whole bunch of additional constraints that make it sound fake. And that’s not a weakness. That is the fact that computers and cloud services have never been as accessible as they are now. And that’s a win for everyone.

Richard: There’s one aspect of accessibility which is actually decreasing—or two. A, you need to pay for them on an ongoing basis. And B, you need an internet connection which is suitably fast, low latency, what have you. And those are things which actually do make things harder for a variety of reasons. If I look at our back-end systems—as in Grafana—all of them have single binary modes where you literally compile everything into a single binary and you can run it on your laptop because if you’re stuck on a plane, you can’t do any work on it. That kind of is not the best of situations.

And if you have a huge CI/CD pipeline, everything in this cloud and fine and dandy, but your internet breaks. Yeah, so I do agree that it is becoming generally more accessible. I disagree that it is becoming more accessible along all possible axes.

Corey: I would agree. There is a silver lining to that as well, where yes, they are fraught and dangerous and I would preface this with a whole bunch of warnings, but from a cost perspective, all of the cloud providers do have a free tier offering where you can kick the tires on a lot of these things in return for no money. Surprisingly, the best one of those is Oracle Cloud where they have an unlimited free tier, use whatever you want in this subset of services, and you will never be charged a dime. As opposed to the AWS model of free tier where well, okay, it suddenly got very popular or you misconfigured something, and surprise, you now owe us enough money to buy Belize. That doesn’t usually lead to a great customer experience.

But you’re right, you can’t get away from needing an internet connection of at least some level of stability and throughput in order for a lot of these things to work. The stuff you would do locally on a Raspberry Pi, for example, if your budget constrained and want to get something out here, or your laptop. Great, that’s not going to work in the same way as a full-on cloud service will.

Richard: It’s not free unless you have hard guarantees that you’re not going to ever pay anything. It’s fine to send warning, it’s fine to switch the thing off, it’s fine to have you hit random hard and soft quotas. It is not a free service if you can’t guarantee that it is free.

Corey: I agree with you. I think that there needs to be a free offering where, “Well, okay, you want us to suddenly stop serving traffic to the world?” “Yes. When the alternative is you have to start charging me through the nose, yes I want you to stop serving traffic.” That is definitionally what it says on the tin.

And as an independent learner, that is what I want. Conversely, if I’m an enterprise, yeah, I don’t care about money; we’re running our Superbowl ad right now, so whatever you do, don’t stop serving traffic. Charge us all the money. And there’s been a lot of hand wringing about, well, how do we figure out which direction to go in? And it’s, have you considered asking the customer?

So, on a scale of one to bank, how serious is this account going to be [laugh]? Like, what are your big concerns: never charge me or never go down? Because we can build for either of those. Just let’s make sure that all of those expectations are aligned. Because if you guess you’re going to get it wrong and then no one’s going to like you.

Richard: I would argue this. All those services from all cloud providers actually build to address both of those. It’s a deliberate choice not to offer certain aspects.

Corey: Absolutely. When I talk to AWS, like, “Yeah, but there is an eventual consistency challenge in the billing system where it takes”—as anyone who’s looked at the billing system can see—“Multiple days, sometimes for usage data to show up. So, how would we be able to stop things if the usage starts climbing?” To which my relatively direct responses, that sounds like a huge problem. I don’t know how you’d fix that, but I do know that if suddenly you decide, as a matter of policy, to okay, if you’re in the free tier, we will not charge you, or even we will not charge you more than $20 a month.

So, you build yourself some headroom, great. And anything that people are able to spin up, well, you’re just going to have to eat the cost as a provider. I somehow suspect that would get fixed super quickly if that were the constraint. The fact that it isn’t is a conscious choice.

Richard: Absolutely.

Corey: And the reason I’m so passionate about this, about the free space, is not because I want to get a bunch of things for free. I assure you I do not. I mean, I spend my life fixing AWS bills and looking at AWS pricing, and my argument is very rarely, “It’s too expensive.” It’s that the billing dimension is hard to predict or doesn’t align with a customer’s experience or prices a service out of a bunch of use cases where it’ll be great. But very rarely do I just sit here shaking my fist and saying, “It costs too much.”

The problem is when you scare the living crap out of a student with a surprise bill that’s more than their entire college tuition, even if you waive it a week or so later, do you think they’re ever going to be as excited as they once were to go and use cloud services and build things for themselves and see what’s possible? I mean, you and I met on IRC 20 years ago because back in those days, the failure mode and the risk financially was extremely low. It’s yeah, the biggest concern that I had back then when I was doing some of my Linux experimentation is if I typed the wrong thing, I’m going to break my laptop. And yeah, that happened once or twice, and I’ve learned not to make those same kinds of mistakes, or put guardrails in so the blast radius was smaller, or use a remote system instead. Yeah, someone else’s computer that I can destroy. Wonderful. But that was on we live and we learn as we were coming up. There was never an opportunity for us, to my understanding, to wind up accidentally running up an $8 million charge.

Richard: Absolutely. And psychological safety is one of the most important things in what most people do. We are social animals. Without this psychological safety, you’re not going to have long-term, self-sustaining groups. You will not make someone really excited about it. There’s two basic ways to sell: trust or force. Those are the two ones. There’s none else.

Corey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomomento.co/screaming That's GO M-O-M-E-N-T-O dot co slash screaming

Corey: Yeah. And it also looks ridiculous. I was talking to someone somewhat recently who’s used to spending four bucks a month on their AWS bill for some S3 stuff. Great. Good for them. That’s awesome. Their credentials got compromised. Yes, that is on them to some extent. Okay, great.

But now after six days, they were told that they owed $360,000 to AWS. And I don’t know how, as a cloud company, you can sit there and ask a student to do that. That is not a realistic thing. They are what is known, in the United States at least, in the world of civil litigation as quote-unquote, “Judgment proof,” which means, great, you could wind up finding that someone owes you $20 billion. Most of the time, they don’t have that, so you’re not able to recoup it. Yeah, the judgment feels good, but you’re never going to see it.

That’s the problem with something like that. It’s yeah, I would declare bankruptcy long before, as a student, I wound up paying that kind of money. And I don’t hear any stories about them releasing the collection agency hounds against people in that scenario. But I couldn’t guarantee that. I would never urge someone to ignore that bill and see what happens.

And it’s such an off-putting thing that, from my perspective, is beneath of the company. And let’s be clear, I see this behavior at times on Google Cloud, and I see it on Azure as well. This is not something that is unique to AWS, but they are the 800-pound gorilla in the space, and that’s important. Or as I just to mention right now, like, as I—because I was about to give you crap for this, too, but if I go to grafana.com, it says, and I quote, “Play around with the Grafana Stack. Experience Grafana for yourself, no registration or installation needed.”

Good. I was about to yell at you if it’s, “Oh, just give us your credit card and go ahead and start spinning things up and we won’t charge you. Honest.” Even your free account does not require a credit card; you’re doing it right. That tells me that I’m not going to get a giant surprise bill.

Richard: You have no idea how much thought and work went into our free offering. There was a lot of math involved.

Corey: None of this is easy, I want to be very clear on that. Pricing is one of the hardest things to get right, especially in cloud. And it also, when you get it right, it doesn’t look like it was that hard for you to do. But I fix [sigh] I people’s AWS bills for a living and still, five or six years in, one of the hardest things I still wrestle with is pricing engagements. It’s incredibly nuanced, incredibly challenging, and at least for services in the cloud space where you’re doing usage-based billing, that becomes a problem.

But glancing at your pricing page, you do hit the two things that are incredibly important to me. The first one is use something for free. As an added bonus, you can use it forever. And I can get started with it right now. Great, when I go and look at your pricing page or I want to use your product and it tells me to ‘click here to contact us.’ That tells me it’s an enterprise sales cycle, it’s got to be really expensive, and I’m not solving my problem tonight.

Whereas the other side of it, the enterprise offering needs to be ‘contact us’ and you do that, that speaks to the enterprise procurement people who don’t know how to sign a check that doesn’t have to commas in it, and they want to have custom terms and all the rest, and they’re prepared to pay for that. If you don’t have that, you look to small-time. When it doesn’t matter what price you put on it, you wind up offering your enterprise tier at some large number, it’s yeah, for some companies, that’s a small number. You don’t necessarily want to back yourself in, depending upon what the specific needs are. You’ve gotten that right.

Every common criticism that I have about pricing, you folks have gotten right. And I definitely can pick up on your fingerprints on a lot of this. Because it sounds like a weird thing to say of, “Well, he’s the Director of Community, why would he weigh in on pricing?” It’s, “I don’t think you understand what community is when you ask that question.”

Richard: Yes, I fully agree. It’s super important to get pricing right, or to get many things right. And usually the things which just feel naturally correct are the ones which took the most effort and the most time and everything. And yes, at least from the—like, I was in those conversations or part of them, and the one thing which was always clear is when we say it’s free, it must be free. When we say it is forever free, it must be forever free. No games, no lies, do what you say and say what you do. Basically.

We have things where initially you get certain pro features and you can keep paying and you can keep using them, or after X amount of time they go away. Things like these are built in because that’s what people want. They want to play around with the whole thing and see, hey, is this actually providing me value? Do I want to pay for this feature which is nice or this and that plugin or what have you? And yeah, you’re also absolutely right that once you leave these constraints of basically self-serve cloud, you are talking about bespoke deals, but you’re also talking about okay, let’s sit down, let’s actually understand what your business is: what are your business problems? What are you going to solve today? What are you trying to solve tomorrow?

Let us find a way of actually supporting you and invest into a mutual partnership and not just grab the money and run. We have extremely low churn for, I would say, pretty good reasons. Because this thing about our users, our customers being successful, we do take it extremely seriously.

Corey: It’s one of those areas that I just can’t shake the feeling is underappreciated industry-wide. And the reason I say that this is your fingerprints on it is because if this had been wrong, you have a lot of… we’ll call them idiosyncrasies, where there are certain things you absolutely will not stand for, and misleading people and tricking them into paying money is high on that list. One of the reasons we’re friends. So yeah, but I say I see your fingerprints on this, it’s yeah, if this hadn’t been worked out the way that it is, you would not still be there. One other thing that I wanted to call out about, well, I guess it’s a confluence of pricing and logging in the rest, I look at your free tier, and it offers up to 50 gigabytes of ingest a month.

And it’s easy for me to sit here and compare that to other services, other tools, and other logging stories, and then I have to stop and think for a minute that yeah, discs have gotten way bigger, and internet connections have gotten way faster, and even the logs have gotten way wordier. I still am not sure that most people can really contextualize just how much logging fits into 50 gigs of data. Do you have any, I guess, ballpark examples of what that looks like? Because it’s been long enough since I’ve been playing in these waters that I can’t really contextualize it anymore.

Richard: Lord of the Rings is roughly five megabytes. It’s actually less. So, we’re talking literally 10,000 Lord of the Rings, which you can just shove in us and we’re just storing this for you. Which also tells you that you’re not going to be reading any of this. Or some of it, yes, but not all of it. You need better tooling and you need proper tooling.

And some of this is more modern. Some of this is where we actually pushed the state of the art. But I’m also biased. But I, for myself, do claim that we did push the state of the art here. But at the same time you come back to those absolute fundamentals of how humans deal with data.

If you look back basically as far as we have writing—literally 6000 years ago, is the oldest writing—humans have always dealt with information with the state of the world in very specific ways. A, is it important enough to even write it down, to even persist it in whatever persistence mechanisms I have at my disposal? If yes, write a detailed account or record a detailed account of whatever the thing is. But it turns out, this is expensive and it’s not what you need. So, over time, you optimize towards only taking down key events and only noting key events. Maybe with their interconnections, but fundamentally, the key events.

As your data grows, as you have more stuff, as this still is important to your business and keeps being more important to—or doesn’t even need to be a business; can be social, can be whatever—whatever thing it is, it becomes expensive, again, to retain all of those key events. So, you turn them into numbers and you can do actual math on them. And that’s this path which you’ve seen again, and again, and again, and again, throughout humanity’s history. Literally, as long as we have written records, this has played out again, and again, and again, and again, for every single field which humans actually cared about. At different times, like, power networks are way ahead of this, but fundamentally power networks work on metrics, but for transient load spike, and everything, they have logs built into their power measurement devices, but those are only far in between. Of course, the main thing is just metrics, time-series. And you see this again, and again.

You also were sysadmin in internet-related all switches have been metrics-based or metrics-first for basically forever, for 20, 30 years. But that stands to reason. Of course the internet is running at by roughly 20 years scale-wise in front of the cloud because obviously you need the internet because as you wouldn’t be having a cloud. So, all of those growing pains why metrics are all of a sudden the thing, “Or have been for a few years now,” is basically, of course, people who were writing software, providing their own software services, hit the scaling limitations which you hit for Internet service providers two decades, three decades ago. But fundamentally, you have this complete system. Basically profiles or distributed tracing depending on how you view distributed tracing.

You can also argue that distributed tracing is key events which are linked to each other. Logs sit firmly in the key event thing and then you turn this into numbers and that is metrics. And that’s basically it. You have extremes at the and where you can have valid, depending on your circumstances, engineering trade-offs of where you invest the most, but fundamentally, that is why those always appear again in humanity’s dealing with data, and observability is no different.

Corey: I take a look at last month’s AWS bill. Mine is pretty well optimized. It’s a bit over 500 bucks. And right around 150 of that is various forms of logging and detecting change in the environment. And on the one hand, I sit here, and I think, “Oh, I should optimize that,” because the value of those logs to me is zero.

Except that whenever I have to go in and diagnose something or respond to an incident or have some forensic exploration, they then are worth an awful lot. And I am prepared to pay 150 bucks a month for that because the potential value of having that when the time comes is going to be extraordinarily useful. And it basically just feels like a tax on top of what it is that I’m doing. The same thing happens with application observability where, yeah, when you just want the big substantial stuff, yeah, until you’re trying to diagnose something. But in some cases, yeah, okay, then crank up the verbosity and then look for it.

But if you’re trying to figure it out after an event that isn’t likely or hopefully won’t recur, you’re going to wish that you spent a little bit more on collecting data out of it. You’re always going to be wrong, you’re always going to be unhappy, on some level.

Richard: Ish. You could absolutely be optimizing this. I mean, for $500, it’s probably not worth your time unless you take it as an exercise, but outside of due diligence where you need specific logs tied to—or specific events tied to specific times, I would argue that a lot of the problems with logs is just dealing with it wrong. You have this one extreme of full-text indexing everything, and you have this other extreme of a data lake—which is just a euphemism of never looking at the data again—to keep storage vendors happy. There is an in between.

Again, I’m biased, but like for example, with Loki, you have those same label sets as you have on your metrics with Prometheus, and you have literally the same, which means you only index that part and you only extract on ingestion time. If you don’t have structured logs yet, only put the metadata about whatever you care about extracted and put it into your label set and store this, and that’s the only thing you index. But it goes further than just this. You can also turn those logs into metrics.

And to me this is a path of optimization. Where previously I logged this and that error. Okay, fine, but it’s just a log line telling me it’s HTTP 500. No one cares that this is at this precise time. Log levels are also basically an anti-pattern because they’re just trying to deal with the amount of data which I have, and try and get a handle on this on that level whereas it would be much easier if I just counted every time I have an HTTP 500, I just up my counter by one. And again, and again, and again.

And all of a sudden, I have literally—and I did the math on this—over 99.8% of the data which I have to store just goes away. It’s just magic the way—and we’re only talking about the first time I’m hitting this logline. The second time I’m hitting this logline is functionally free if I turn this into metrics. It becomes cheap enough that one of the mantras which I have, if you need to onboard your developers on modern observability, blah, blah, blah, blah, blah, the whole bells and whistles, usually people have logs, like that’s what they have, unless they were from ISPs or power companies, or so; there they usually start with metrics.

But most users, which I see both with my Grafana and with my Prometheus [unintelligible 00:38:46] tend to start with logs. They have issues with those logs because they’re basically unstructured and useless and you need to first make them useful to some extent. But then you can leverage on this and instead of having a debug statement, just put a counter. Every single time you think, “Hey, maybe I should put a debug statement,” just put a counter instead. In two months time, see if it was worth it or if you delete that line and just remove that counter.

It’s so much cheaper, you can just throw this on and just have it run for a week or a month or whatever timeframe and done. But it goes beyond this because all of a sudden, if I can turn my logs into metrics properly, I can start rewriting my alerts on those metrics. I can actually persist those metrics and can more aggressively throw my logs away. But also, I have this transition made a lot easier where I don’t have this huge lift, where this day in three months is to be cut over and we’re going to release the new version of this and that software and it’s not going to have that, it’s going to have 80% less logs and everything will be great and then you missed the first maintenance window or someone is ill or what have you, and then the next Big Friday is coming so you can’t actually deploy there. I mean Black Friday. But we can also talk about deploying on Fridays.

But the thing is, you have this huge thing, whereas if you have this as a continuous improvement process, I can just look at, this is the log which is coming out. I turn this into a number, I start emitting metrics directly, and I see that those numbers match. And so, I can just start—I build new stuff, I put it into a new data format, I actually emit the new data format directly from my code instrumentation, and only then do I start removing the instrumentation for the logs. And that allows me to, with full confidence, with psychological safety, just move a lot more quickly, deliver much more quickly, and also cut down on my costs more quickly because I’m just using more efficient data types.

Corey: I really want to thank you for spending as much time as you have. If people want to learn more about how you view the world and figure out what other personal attacks they can throw your way, where’s the best place for them to find you?

Richard: Personal attacks, probably Twitter. It’s, like, the go-to place for this kind of thing. For actually tracking, I stopped maintaining my own website. Maybe I’ll do again, but if you go on github.com/richih/talks, you’ll find a reasonably up-to-date list of all the talks, interviews, presentations, panels, what have you, which I did over the last whatever amount of time. [laugh].

Corey: And we will, of course, put links to that in the [show notes 00:41:23]. Thanks again for your time. It’s always appreciated.

Richard: And thank you.

Corey: Richard Hartmann, Director of Community at Grafana Labs. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment. And then when someone else comes along with an insulting comment they want to add, we’ll just increment the counter by one.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

The Evolution of Cloud Services with Richard Hartmann

Episode Summary

Episode Show Notes & Transcript

You might also like

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

The Latest State of IaC with Ido Neeman

Get the Newsletter

Sponsor an Episode