Episode Show Notes & Transcript
Previously, Harry ran, and later sold, a cloud hosting provider where he was working hands on with systems administration. He studied information security and lives in the UK.
- Sysdig: https://sysdig.com/
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: This episode is brought to us by our friends at Pinecone. They believe that all anyone really wants is to be understood, and that includes your users. AI models combined with the Pinecone vector database let your applications understand and act on what your users want… without making them spell it out. Make your search application find results by meaning instead of just keywords, your personalization system make picks based on relevance instead of just tags, and your security applications match threats by resemblance instead of just regular expressions. Pinecone provides the cloud infrastructure that makes this easy, fast, and scalable. Thanks to my friends at Pinecone for sponsoring this episode. Visit Pinecone.io to understand more.
Corey: This episode is brought to you in part by our friends at Veeam. Do you care about backups? Of course you don’t. Nobody cares about backups. Stop lying to yourselves! You care about restores, usually right after you didn’t care enough about backups. If you’re tired of the vulnerabilities, costs, and slow recoveries when using snapshots to restore your data, assuming you even have them at all living in AWS-land, there is an alternative for you. Check out Veeam, that's V-E-E-A-M for secure, zero-fuss AWS backup that won’t leave you high and dry when it’s time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast.
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. This promoted episode has been brought to us by our friends at Sysdig, and they have sent one of their principal product managers to suffer my slings and arrows. Please welcome Harry Perks.
Harry: Hey, Corey, thanks for hosting me. Good to meet you.
Corey: An absolute pleasure and thanks for basically being willing to suffer all of the various nonsense about to throw your direction. Let’s start with origin stories; I find that those tend to wind up resonating the most. Back when I first noticed Sysdig coming into the market, because it was just launching at that point, it seemed like it was a… we’ll call it an innovative approach to observability, though I don’t recall that we use the term observability back then. It more or less took a look at whatever an application was doing almost at a system call level and tracing what was going on as those requests worked on an individual system, and then providing those in a variety of different forms to reason about. Is that directionally correct as far as the origin story goes, where my misremembering an evening event I went to what feels like half a lifetime ago?
Harry: I’d say the latter, but just because it’s a funnier answer. But that’s correct. So, Sysdig was created by Loris Degioanni, one of the founders of Wireshark. And when containers and Kubernetes was being incepted, you know, it kind of created this problem where you kind of lacked visibility into what’s going on inside these opaque boxes, right? These black boxes which are containers.
So, we started using system calls as a source of truth for… I don’t want to say observability, but observability, and using those system calls to essentially see what’s going on inside containers from the outside. And leveraging system calls, we were able to pull up metrics, such as what are the golden signals of applications running in containers, network traffic. So, it’s a very simple way to instrument applications. And that was really how monitoring started. And then Sysdig kind of morphed into a security product.
Corey: What was it that drove that transformation? Because generally speaking, when you have a product that’s in a particular space that’s aimed at a particular niche pivots into something that feels as orthogonal as security don’t tend to be something that you see all that often. What did you folks see that wound up driving that change?
Harry: The same challenges that were being presented by containers and microservices for monitoring were the same challenges for security. So, for runtime security, it was very difficult for our customers to be able to understand what the heck is going on inside the container. Is a crypto miner being spun up? Is there malicious activity going on? So, it made logical sense to use that same data source - system calls - to understand the monitoring and the security posture of applications.
Corey: One of the big challenges out there is that security tends to be one of those pervasive things—I would argue that observability does too—where once you have a position of being able to see what is going on inside of an environment and be able to reason about it. And this goes double for inside of containers, which from a cloud provider perspective, at least seems to be, “Oh, yeah, just give us the containers, we don’t care what’s going on inside, so we’re never going to ask, notice, or care.” And being able to bridge between that lack of visibility between—from the outside of container land and inside of container land has been a perennial problem. There are security implications, there are cost implications, there are observability challenges to be sure, and of course, reliability concerns that flow directly from that, which is, I think, most people, at least historically, contextualize observability. It’s a fancy word to describe is the site about to fall over and crash into the sea. At least in my experience. Is that your definition of observability, or if I basically been hijacked by a number of vendors who have decided to relabel what they’d been doing for 15 years as observability?
Harry: [laugh]. I think observability is one of those things that is down to interpretation depending on what is the most recent vendor you’ve been speaking with. But to me, observability is: am I happy? Am I sad? Are my applications happy? Are they sad?
Am I able to complete business-critical transactions that keep me online, and keep me afloat? So, it’s really as simple as that. There are different ways to implement observability, but it’s really, you know, you can’t improve the performance, and you can’t improve the security posture of things, you can’t see, right? So, how do I make sure I can see everything? And what do I do with that data is really what observability means to me.
Corey: The entire observability space across the board is really one of those areas that is defined, on some level, by outliers within it. It’s easy to wind up saying that any given observability tool will—oh, it alerts you when your application breaks. The problem is that the interesting stuff is often found in the margins, in the outlier products that wind up emerging from it. What is the specific area of that space where Sysdig tends to shine the most?
Harry: Yeah, so you’re right. The outliers typically cause problems and often you don’t know what you don’t know. And I think if you look at Kubernetes specifically, there is a whole bunch of new problems and challenges and things that you need to be looking at that didn’t exist five to ten years ago, right? There are new things that can break. You know, you’ve got a pod that’s stuck in a CrashLoopBackOff.
And hey, I’m a developer who’s running my application on Kubernetes. I’ve got this pod in a CrashLoopBackOff. I don’t know what that means. And then suddenly I’m being expected to alert on these problems. Well, how can I alert on things that I didn’t even know were a problem?
So, one of the things that Sysdig is doing on the observability side is we’re looking at all of this data and we’re actually presenting opinionated views that help customers make sense of that data. Almost like, you know, I could present this data and give it to my grandma, and she would say, “Oh, yeah, okay. You’ve got these pods in CrashLoopBackoff you’ve got these pods that are being CPU throttled. Hey, you know, I didn’t know I had to worry about CPU limits, or, you know, memory limits and now I’m suffering, kind of, OOM kills.” So, I think one of the things that’s quite unique about Sysdig on the monitoring side that a lot of customers are getting value from is kind of demystifying some of those challenges and making a lot of that data actionable.
Corey: At the time of this recording, I’ve not yet bothered to run Kubernetes in anger by which I, of course, mean production. My production environment is of course called ‘Anger’ similarly to the way that my staging environment is called ‘Theory’ because things work in theory, but not in production. That is going to be changing in the first quarter of next year, give or take. The challenge with that, though, is that so much has changed—we’ll say—since the evolution of Kubernetes into something that is mainstream production in most shops. I stopped working in production environments before that switch really happened, so I’m still at a relatively amateurish level of understanding around a lot of these things.
I’m still thinking about old-school problems, like, “Okay, how big do I make each one of the nodes in my Kubernetes cluster?” Yeah, if I get big systems, it’s likelier that there will be economies of scale that start factoring in fewer nodes to manage, but it does increase the blast radius if one of those nodes gets affected by something that takes it offline for a while. I’m still at the very early stages of trying to wrap my head around the nuances of running these things in a production environment. Cost is, of course, a separate argument. My clients run it everywhere and I can reason about it surprisingly well for something that is not lending itself to easy understanding it by any sense of the word and you almost have to intuit its existence just by looking at the AWS bill.
Harry: No, I like your observations. And I think the last part there around costs is something that I’m seeing a lot in the industry and in our customers is, okay, suddenly, you know, I’ve got a great monitoring posture, or observability posture, whatever that really means. I’ve got a great security posture. As customers are maturing in their journey to Kubernetes, suddenly there are a bunch of questions that are being asked from atop—and we’ve kind of seen this internally—such as, “Hey, what is the ROI of each customer?”Or, “What is the ROI of a specific product line or feature that we deliver to our customers?”
And we couldn’t answer those problems. And we couldn’t answer those problems because we’re running a bunch of applications and software on Kubernetes and when we receive our billing reports from the multiple different cloud providers we use— Azure, AWS, and GCP—we just received a big fat bill that was compute, and we were unable to kind of break that down by the different teams and business units, which is a real problem. And one of the problems that we really wanted to start solving, both for internal uses, but also for our customers, as well.
Corey: Yeah, when you have a customer coming in, the easy part of the equation is well how much revenue are we getting from a customer? Well, that’s easy enough to just wind up polling your finance group and, “Yeah, how much have they paid us this year?” “Great. Good to know.” Then it gets really confusing over on the cost side because it gets into a unit economic model that I think most shops don’t have a particularly advanced understanding of.
If we have another hundred customers sign up this month, what will it cost us to service them? And what are the variables that change those numbers? It really gets into a fascinating model where people more or less, do some gut checks and some rounding, but there are a bunch of areas where people get extraordinarily confused, start to finish. Kubernetes is very much one of them because from a cloud provider’s perspective, it’s just a single-tenant app that is really gnarly in terms of its behavior, it does a bunch of different things, and from the bill alone, it’s hard to tell that you’re even running Kubernetes unless you ask.
Harry: Yeah, absolutely. And there was a survey from the CNCF recently that said 68% of folks are seeing increased Kubernetes costs—of course—and 69% of respondents said that they have no cost monitoring in place or just cost estimates, which is simply not good enough, right? People want to break down that line item to those individual business units and in teams. Which is a huge challenge that cloud providers aren’t fulfilling today.
Corey: Where do you see most of the cost issue breaking down? I mean, there’s some of the stuff that we are never allowed to talk about when it comes to cost, which is the realistic assessment that people to work on technology cost more than the technology itself. There’s a certain—how do we put this—unflattering perspective that a lot of people are deploying Kubernetes into environments because they want to bolster their own resume, not because it’s the actual right answer to anything that they have going on. So, that’s a little hit or miss, on some level. I don’t know that I necessarily buy into that, but you take a look at the compute storage, you look at the data transfer side, which it seems that almost everyone mostly tends to ignore, despite the fact that Kubernetes itself has no zone affinity, so it has no idea whether its internal communication is free or expensive, and it just adds up to a giant question mark.
Then you look at Kubernetes architecture diagrams, or God forbid the CNCF landscape diagram, and realize, oh, my God, they have more of these things, and they do Pokemon, and people give up any hope of understanding it other than just saying, “It’s complicated,” and accepting that that’s just the way that it is. I’m a little less fatalistic, but I also think it’s a heck of a challenge.
Harry: Absolutely. I mean, the economics of cloud, right? Why is ingress free, but egress is not free? Why is it so difficult to [laugh] understand that intra AZ traffic is completely billed separately to public traffic, for example? And I think network costs is one thing that is extremely challenging for customers.
One, they don’t even have that visibility into what is the network traffic: what is internal traffic, what is public traffic. But then there’s also a whole bunch of other challenges that are causing Kubernetes costs to rise, right? You’ve got folks that struggle with setting the right requests for Kubernetes, which ultimately blows up the scale of a Kubernetes cluster. You’ve got the complexity of AWS, for example, economics of instance types, you know? I don’t know whether I need to be running ten m5.xlarge versus four, Graviton instances.
And this ability to, kind of, size a cluster correctly as well as size a workload correctly is very, very difficult and customers are not able to establish that baseline today. And obviously, you can’t optimize what you can’t see, right, so I think a lot of customers struggle with both that visibility. But then the complexity means that it’s incredibly difficult to optimize those costs.
Corey: You folks are starting to dip your toes in the Kubernetes costing space. What approach are you taking?
Harry: Sysdig builds products to Kubernetes first. So, if you look at what we’re doing on the monitoring space, we were really kind of pioneered what customers want to get out of Kubernetes observability, and then we were doing similar things for security? So, making sure our security product is, [I want to say,] Kubernetes-native. And what we’re doing on the cost side of the things is, of course, there are a lot of cost products out there that will give you the ability to slice and dice by AWS service, for example, but they don’t give you that Kubernetes context to then break those costs down by teams and business units. So at Sysdig, we’ve already been collecting usage information, resource usage information–requests, the container CPU, the memory usage–and a lot of customers have been using that data today for right-sizing, but one of the things they said was, “Hey, I need to quantify this. I need to put a big fat dollar sign in front of some of these numbers we’re seeing so I can go to these teams and management and actually prompt them to right-size.”
So, it’s quite simple. We’re essentially augmenting that resource usage information with cost data from cloud providers. So, instead of customers saying, “Hey, I’m wasting one terabyte of memory, they can say, hey, I’m wasting 500 bucks on memory each month,” So, it's very much Kubernetes specific, using a lot of Kubernetes context and metadata.
Corey: This episode is sponsored in part by our friends at Uptycs, because they believe that many of you are looking to bolster your security posture with CNAPP and XDR solutions. They offer both cloud and endpoint security in a single UI and data model. Listeners can get Uptycs for up to 1,000 assets through the end of 2023 (that is next year) for $1. But this offer is only available for a limited time on UptycsSecretMenu.com. That’s U-P-T-Y-C-S Secret Menu dot com.
Corey: Part of the whole problem that I see across the space is that the way to solve some of these problems internally has been when you start trying to divide costs between different teams is well, we’re just going to give each one their own cluster, or their own environment. That does definitely solve the problem of shared services. The counterpoint is it solves them by making every team individually incur them. That doesn’t necessarily seem like the best approach in every scenario. One thing I have learned, though, is that, for some customers, that is the right approach. Sounds odd, but that’s the world we live in where context absolutely matters a lot. I’m very reluctant these days to say at a glance, “Oh, you’re doing it wrong.” You eat a whole lot of crow when you do that, it turns out.
Harry: I see this a lot. And I see customers giving their own business units, their own AWS account, which I kind of feel like is a step backwards, right? I don’t think you’re properly harnessing the power of Kubernetes and creating this, kind of, shared tenancy model, when you’re giving a team their own AWS account. I think it’s important we break down those silos. You know, there’s so much operational overhead with maintaining these different accounts, but there must be a better way to address some of these challenges.
Corey: It’s one of those areas where “it depends” becomes the appropriate answer to almost anything. I’m a fan of having almost every workload have its own AWS account within the same shared AWS organization, then with shared VPCs, which tend to work out. But that does add some complexity to observing how things interact there. One of the guidances that I’ve given people is assume in the future that in any architecture diagram you ever put up there, that there will be an AWS account boundary between any two resources because someone’s going to be doing it somewhere. And that seems to be something that AWS themselves are just slowly starting to awaken to as well. It’s getting easier and easier every week to wind up working with multiple accounts in a more complicated structure.
Harry: Absolutely. And I think when you start to adopt a multi-cloud strategy, suddenly, you’ve got so many more increased dimensions. I’m running an application in AWS, Azure, and GCP, and now suddenly, I’ve got all of these subaccounts. That is an operational overhead that I don’t think jives very well, considering there is such a shortage of folks that are real experts—I want to say experts—in operating these environments. And that’s really, you know, I think one of the challenges that isn’t being spoken enough about today.
Corey: It feels like so much of the time that the Kubernetes is winding up being an expression of the same way that getting into microservices was, which is, “Well, we have a people problem, we’re going to solve it with this approach.” Great, but then you wind up with people adopting it where they don’t have the context that applied when the stuff was originally built and designed for. Like with mono repos. Yeah, it was a problem when you had 5000 developers all try to work on the same thing and stomping each other, so breaking that apart made sense. But the counterpoint of where you wind up with companies with 20 developers and 200 microservices starts to be a little… okay, has this pendulum swung too far?
Harry: Yeah, absolutely. And I think that when you’ve got so many people being thrown at a problem, there’s lots of kinds of changes being made, there’s new deployments, and I think things can spiral out of control pretty quickly, especially when it comes to costs. “Hey, I’m a developer and I’ve just made this change. And how do I understand, you know, what is the financial impact of this change?”
“Has this blown up my network costs because suddenly, I’m not traversing the right network path?” Or, suddenly, I’m consuming so much more CPU, and actually, there is a physical compute cost of this. There’s a lot of cooks in the kitchen and I think that is causing a lot of challenges for organizations.
Corey: You’ve been working in product for a while and one of my favorite parts of being in a position where you are so close to the core of what it is your company does, is that you find it’s almost impossible to not continue learning things just based upon how customers take what you built and the problems that they experienced, both that they bring you in to solve, and of course, the new and exciting problems that you wind up causing for them—or to be more charitable surfacing that they didn’t realize already existed. What have you learned lately from your customers that you didn’t see coming?
Harry: One of the biggest problems that I’ve been seeing is—I speak to a lot of customers and I’ve maybe spoken to 40 or 50 customers over the last, you know, few months, about a variety of topics, whether it’s observability, in general, or, you know, on the financial side, Kubernetes costs–and what I hear about time and time again, regardless as to the vertical or the size of the organization, is the platform teams, the people closest to Kubernetes know their stuff. They get it. But a lot of their internal customers,so the internal business units and teams, they, of course, don’t have the same kind of clarity and understanding, and these are the people that are getting the most frustrated. I’ve been shipping software for 20 years and now I’m modernizing applications, I’m starting to use Kubernetes, I’ve got so many new different things to learn about that I’m simply drowning, in problems, in cloud-native problems.
And I think we forget about that, right? Too often, we kind of spend time throwing fancy technology at the people, such as the, you know, the DevOps engineers, the platform teams, but a lot of internal customers are struggling to leverage that technology to actually solve their own problems. They can’t make sense of this data and they can’t make the right changes based off of that data.
Corey: I would say that is a very common affliction of Kubernetes where so often it winds up handling things that are now abstracted away to the point where we don’t need to worry about that. That’s true right up until the point where they break and now you have to go diving into the magic. That’s one of the reasons that I was such a fan of Sysdig when it first came out was the idea that it was getting into what I viewed at the time as operating system fundamentals and actually seeing what was going on, abstracted away from the vagaries of the code and a lot more into what system calls is it making. Great, okay, now I’m starting to see a lot of calls that it shouldn’t necessarily be making, or it’s thrashing in a particular way. And it’s almost impossible to get to that level of insight—historically—through traditional observability tools, but being able to take a look at what’s going on from a more fundamentals point of view was extraordinarily helpful.
I’m optimistic if you can get to a point where you’re able to do that with Kubernetes, given its enraging ecosystem, for lack of a better term. Whenever you wind up rolling out Kubernetes, you’ve also got to pick some service delivery stuff, some observability tooling, some log routers, and so on and so forth. It feels like by the time you’re running anything in production, you’ve made so many choices along the way that the odds that anyone else has made the same choices you have are vanishingly small, so you’re running your own bespoke unicorn somewhere.
Harry: Absolutely. Flip a coin. And that’s probably one [laugh] of the solutions that you’re going to throw at a problem, right? And you keep flipping that coin and then suddenly, you’re going to reach a combination that nobody else has done before. And you’re right, the knowledge that you have gained from, I don’t know, Corey Quinn Enterprises is probably not going to ring true at Harry Perks Enterprise Limited, right?
There is a whole different set of problems and technology and people that, you know, of course, you can bring some of that knowledge along—there are some common denominators—but every organization is ultimately using technology in different ways. Which is problematic, right to the people that are actually pioneering some of these cloud native applications.
Corey: Given my professional interest, I am curious about what it is you’re doing as you start moving a little bit away from the security and observability sides and into cost observability. How are you approaching that? What are the mistakes that you see people making and how are you meeting them where they are?
Harry: The biggest challenge that I am seeing is with sizing workloads and sizing clusters. And I see this time and time again. Our product shines the light on the capacity utilization of compute. And what it really boils down to is two things. Platform teams are not using the correct instance types or the combination of instance types to run the workloads for their teams, their application teams, but also application developers are not setting things like requests correctly.
Which makes sense. Again, I flip a coin and maybe that’s the request I’m going to set. I used to size a VM with one gig of memory, so now I’m going to size my pod with one gig of memory. But it doesn’t really work like that. And of course, when you request usage is essentially my slice of the pizza that’s been carved out.
And even if I don’t see that entire slice of pizza, it’s for me, nobody else can use it. So, what we’re trying to do is really help customers with that challenge. So, if I’m a developer, I would be looking at the historical usage of our workloads. Maybe it’s the maximum usage or, you know, the p99 or the p95 and then setting my workload request to that. You keep doing that over the course of the different team’s applications you have and suddenly, you start to establish this baseline of what is the compute actually needed to run all of these applications.
And that helps me answer the question, what should I size my cluster to? And that’s really important because until you’ve established that baseline, you can’t start to do things like cluster reshaping, to pick a different combination of instance types to power your cluster.
Corey: Some level, a lack of diversity in instance types is a bit of a red flag, just because it generally means that someone said, “Oh, yeah, we’re going to start with this default instance size and then we’ll adjust as time goes on,” and spoilers just like anything else labeled ‘TODO’ in your codebase, it never gets done. So, you find yourself pretty quickly in a scenario where some workloads are struggling to get the resources they need inside of whatever that default instance size is, and on the other, you wind up with some things that are more or less running a cron job once a day and sitting there completely idle but running the whole time, regardless. And optimization and right-sizing on a lot of these scenarios is a little bit tricky. I’ve been something of a, I’ll say, a pessimist, when it comes to the idea of right-sizing EC2 instances, just because so many historical workloads are challenging to get recertified on newer instance families and the rest, whereas when we’re running on Kubernetes already, presumably everything’s built in such a way that it can stop existing in a stateless way and the service still continues to work. If not, it feels like there are some necessary Kubernetes prerequisites that may not have circulated fully internally yet.
Harry: Right. And to make this even more complicated, you’ve got applications that may be more memory intensive or CPU intensive, so understanding the ratio of CPU to memory requirements for their applications depending on how they’ve been architected makes this more challenging, right? I mean, pods are jumping around and that makes it incredibly difficult to track these movements and actually pick the instances that are going to be most appropriate for my workloads and for my clusters.
Corey: I really want to thank you for being so generous with your time. If people want to learn more, where’s the best place for them to find you?
Harry: sysdig.com is where you can learn more about what Sysdig is doing as a company and our platform in general.
Corey: And we will, of course, put a link to that in the show notes. Thank you so much for your time. I appreciate it.
Harry: Thank you, Corey. Hope to speak to you again soon.
Corey: Harry Perks, principal product manager at Sysdig. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment that we will lose track of because we don’t know where it was automatically provisioned.
Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Announcer: This has been a HumblePod production. Stay humble.