Benchmarking Security Attack Response Times in the Age of Automation with Anna Belak

Episode Summary

Anna Belak, Director of the Office of Cybersecurity Strategy at Sysdig, joins Corey on Screaming in the Cloud to discuss the newest benchmark for responding to security threats, 5/5/5. Anna describes why it was necessary to set a new benchmark for responding to security threats in a timely manner, and how the Sysdig team did research to determine the best practices for detecting, correlating, and responding to potential attacks. Corey and Anna discuss the importance of focusing on improving your own benchmarks towards a goal, as well as how prevention and threat detection are both essential parts of a solid security program. 

Episode Show Notes & Transcript

About Anna

Anna has nearly ten years of experience researching and advising organizations on cloud adoption with a focus on security best practices. As a Gartner Analyst, Anna spent six years helping more than 500 enterprises with vulnerability management, security monitoring, and DevSecOps initiatives. Anna's research and talks have been used to transform organizations' IT strategies and her research agenda helped to shape markets. Anna is the Director of Thought Leadership at Sysdig, using her deep understanding of the security industry to help IT professionals succeed in their cloud-native journey. 
Anna holds a PhD in Materials Engineering from the University of Michigan, where she developed computational methods to study solar cells and rechargeable batteries.

Links Referenced:


Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. I am joined again—for another time this year—on this promoted guest episode brought to us by our friends at Sysdig, returning is Anna Belak, who is their director of the Office of Cybersecurity Strategy at Sysdig. Anna, welcome back. It’s been a hot second.

Anna: Thank you, Corey. It’s always fun to join you here.

Corey: Last time we were here, we were talking about your report that you folks had come out with, the, “Cybersecurity Threat Landscape for 2022.” And when I saw you were doing another one of these to talk about something, I was briefly terrified. “Oh, wow, please tell me we haven’t gone another year and the cybersecurity threat landscape is moving that quickly.” And it sort of is, sort of isn’t. You’re here today to talk about something different, but it also—to my understanding—distills down to just how quickly that landscape is moving. What have you got for us today?

Anna: Exactly. For those of you who remember that episode, one of the key findings in the Threat Report for 2023 was that the average length of an attack in the cloud is ten minutes. To be clear, that is from when you are found by an adversary to when they have caused damage to your system. And that is really fast. Like, we talked about how that relates to on-prem attacks or other sort of averages from other organizations reporting how long it takes to attack people.

And so, we went from weeks or days to minutes, potentially seconds. And so, what we’ve done is we looked at all that data, and then we went and talked to our amazing customers and our many friends at analyst firms and so on, to kind of get a sense for if this is real, like, if everyone is seeing this or if we’re just seeing this. Because I’m always like, “Oh, God. Like, is this real? Is it just me?”

And as it turns out, everyone’s not only—I mean, not necessarily everyone’s seeing it, right? Like, there’s not really been proof until this year, I would say because there’s a few reports that came out this year, but lots of people sort of anticipated this. And so, when we went to our customers, and we asked for their SLAs, for example, they were like, “Oh, yeah, my SLA for a [PCRE 00:02:27] cloud is like 10, 15 minutes.” And I was like, “Oh, okay.” So, what we set out to do is actually set a benchmark, essentially, to see how well are you doing. Like, are you equipped with your cloud security program to respond to the kind of attack that a cloud security attacker is going to—sorry, an anti-cloud security—I guess—attacker is going to perpetrate against you.

And so, the benchmark is—drumroll—5/5/5. You have five seconds to detect a signal that is relevant to potentially some attack in the cloud—hopefully, more than one such signal—you have five minutes to correlate all such relevant signals to each other so that you have a high fidelity detection of this activity, and then you have five more minutes to initiate an incident response process to hopefully shut this down, or at least interrupt the kill chain before your environments experience any substantial damage.

Corey: To be clear, that is from a T0, a starting point, the stopwatch begins, the clock starts when the event happens, not when an event shows up in your logs, not once someone declares an incident. From J. Random Hackerman, effectively, we’re pressing the button and getting the response from your API.

Anna: That’s right because the attackers don’t really care how long it takes you to ship logs to wherever you’re mailing them to. And that’s why it is such a short timeframe because we’re talking about, they got in, you saw something hopefully—and it may take time, right? Like, some of the—which we’ll describe a little later, some of the activities that they perform in the early stages of the attack are not necessarily detectable as malicious right away, which is why your correlation has to occur, kind of, in real time. Like, things happen, and you’re immediately adding them, sort of like, to increase the risk of this detection, right, to say, “Hey, this is actually something,” as opposed to, you know, three weeks later, I’m parsing some logs and being like, “Oh, wow. Well, that’s not good.” [laugh].

Corey: The number five seemed familiar to me in this context, so I did a quick check, and sure enough, allow me to quote from chapter and verse from the CloudTrail documentation over an AWS-land. “CloudTrail typically delivers logs within an average of about five minutes of an API call. This time is not guaranteed.” So effectively, if you’re waiting for anything that’s CloudTrail-driven to tell you that you have a problem, it is almost certainly too late by the time that pops up, no matter what that notification vector is.

Anna: That is, unfortunately or fortunately, true. I mean, it’s kind of a fact of life. I guess there is a little bit of a veiled [unintelligible 00:04:43] at our cloud provider friends because, really, they have to do better ultimately. But the flip side to that argument is CloudTrail—or your cloud log source of choice—cannot be your only source of data for detecting security events, right? So, if you are operating purely on the basis of, “Hey, I have information in CloudTrail; that is my security information,” you are going to have a bad time, not just because it’s not fast enough, but also because there’s not enough data in there, right? Which is why part of the first, kind of, benchmark component is that you must have multiple data sources for the signals, and they—ideally—all will be delivered to you within five seconds of an event occurring or a signal being generated.

Corey: And give me some more information on that because I have my own alerter, specifically, it’s a ClickOps detector. Whenever someone in one of my accounts does something in the console, that has a write aspect to it rather than just a read component—which again, look at what you want in the console, that’s fine—if you’re changing things that is not being managed by code, I want to know that it’s happening. It’s not necessarily bad, but I want to at least have visibility into it. And that spits out the principal, the IP address it emits from, and the rest. I haven’t had a whole lot where I need to correlate those between different areas. Talk to me more about the triage step.

Anna: Yeah, so I believe that the correlation step is the hardest, actually.

Corey: Correlation step. My apologies.

Anna: Triage is fine. It’s [crosstalk 00:06:06]—

Corey: Triage, correlations, the words we use matter on these things.

Anna: Dude, we argued about the words on this for so long, you could even imagine. Yeah, triage, correlation, detection, you name it, we are looking at multiple pieces of data, we’re going to connect them to each other meaningfully, and that is going to provide us with some insight about the fact that a bad thing is happening, and we should respond to it. Perhaps automatically respond to it, but we’ll get to that. So, a correlation, okay. The first thing is, like I said, you must have more than one data source because otherwise, I mean, you could correlate information from one data source; you actually should do that, but you are going to get richer information if you can correlate multiple data sources, and if you can access, for example, like through an API, some sort of enrichment for that information.

Like, I’ll give you an example. For SCARLETEEL, which is an attack we describe in the thread report, and we actually described before, this is—we’re, like—on SCARLETEEL, I think, version three now because there’s so much—this particular certain actor is very active [laugh].

Corey: And they have a better versioning scheme than most companies I’ve spoken to, but that’s neither here nor there.

Anna: [laugh]. Right? So, one of the interesting things about SCARLETEEL is you could eventually detect that it had happened if you only had access to CloudTrail, but you wouldn’t have the full picture ever. In our case, because we are a company that relies heavily on system calls and machine learning detections, we [are able to 00:07:19] connect the system call events to the CloudTrail events, and between those two data sources, we’re able to figure out that there’s something more profound going on than just what you see in the logs. And I’ll actually tell you, which, for example, things are being detected.

So, in SCARLETEEL, one thing that happens is there’s a crypto miner. And a crypto miner is one of these events where you’re, like, “Oh, this is obviously malicious,” because as we wrote, I think, two years ago, it costs $53 to mine $1 of Bitcoin in AWS, so it is very stupid for you to be mining Bitcoin in AWS, unless somebody else is—

Corey: In your own accounts.

Anna: —paying the cloud bill. Yeah, yeah [laugh] in someone else’s account, absolutely. Yeah. So, if you are a sysadmin or a security engineer, and you find a crypto miner, you’re like, “Obviously, just shut that down.” Great. What often happens is people see them, and they think, “Oh, this is a commodity attack,” like, people are just throwing crypto miners whatever, I shut it down, and I’m done.

But in the case of this attack, it was actually a red herring. So, they deployed the miner to see if they could. They could, then they determined—presumably; this is me speculating—that, oh, these people don’t have very good security because they let random idiots run crypto miners in their account in AWS, so they probed further. And when they probed further, what they did was some reconnaissance. So, they type in commands, listing, you know, like, list accounts or whatever. They try to list all the things they can list that are available in this account, and then they reach out to an EC2 metadata service to kind of like, see what they can do, right?

And so, each of these events, like, each of the things that they do, like, reaching out to a EC2 metadata service, assuming a role, doing a recon, even lateral movement is, like, by itself, not necessarily a scary, big red flag malicious thing because there are lots of, sort of, legitimate reasons for someone to perform those actions, right? Like, reconnaissance, for one example, is you’re, like, looking around the environment to see what’s up, right? So, you’re doing things, like, listing things, [unintelligible 00:09:03] things, whatever. But a lot of the graphical interfaces of security tools also perform those actions to show you what’s, you know, there, so it looks like reconnaissance when your tool is just, like, listing all the stuff that’s available to you to show it to you in the interface, right? So anyway, the point is, when you see them independently, these events are not scary. They’re like, “Oh, this is useful information.”

When you see them in rapid succession, right, or when you see them alongside a crypto miner, then your tooling and/or your process and/or your human being who’s looking at this should be like, “Oh, wait a minute. Like, just the enumeration of things is not a big deal. The enumeration of things after I saw a miner, and you try and talk to the metadata service, suddenly I’m concerned.” And so, the point is, how can you connect those dots as quickly as possible and as automatically as possible, so a human being doesn’t have to look at, like, every single event because there’s an infinite number of them.

Corey: I guess the challenge I’ve got is that in some cases, you’re never going to be able to catch up with this. Because if it’s an AWS call to one of the APIs that they manage for you, they explicitly state there’s no guarantee of getting information on this until the show’s all over, more or less. So, how is there… like, how is there hope?

Anna: [laugh]. I mean, there’s always a forensic analysis, I guess [laugh] for all the things that you’ve failed to respond to.

Corey: Basically we’re doing an after-action thing because humans aren’t going to react that fast. We’re just assuming it happened; we should know about it as soon as possible. On some level, just because something is too late doesn’t necessarily mean there’s not value added to it. But just trying to turn this into something other than a, “Yeah, they can move faster than you, and you will always lose. The end. Have a nice night.” Like, that tends not to be the best narrative vehicle for these things. You know, if you’re trying to inspire people to change.

Anna: Yeah, yeah, yeah, I mean, I think one clear point of hope here is that sometimes you can be fast enough, right? And a lot of this—I mean, first of all, you’re probably not going to—sorry, cloud providers—you don’t go into just the cloud provider defaults for that level of performance, you are going with some sort of third-party tool. On the, I guess, bright side, that tool can be open-source, like, there’s a lot of open-source tooling available now that is fast and free. For example, is our favorite, of course, Falco, which is looking at system calls on endpoints, and containers, and can detect things within seconds of them occurring and let you know immediately. There is other EBPF-based instrumentation that you can use out there from various vendors and/or open-source providers, and there’s of course, network telemetry.

So, if you’re into the world of service mesh, there is data you can get off the network, also very fast. So, the bad news or the flip side to that is you have to be able to manage all that information, right? So, that means—again, like I said, you’re not expecting a SOC analyst to look at thousands of system calls and thousands of, you know, network packets or flow logs or whatever you’re looking at, and just magically know that these things go together. You are expecting to build, or have built for you by a vendor or the open-source community, some sort of dissection content that is taking this into account and then is able to deliver that alert at the speed of 5/5/5.

Corey: When you see the larger picture stories playing out, as far as what customers are seeing, what the actual impact is, what gave rise to the five-minute number around this? Just because that tends to feel like it’s a… it is both too long and also too short on some level. I’m just wondering how you wound up at—what is this based on?

Anna: Man, we went through so many numbers. So, we [laugh] started with larger numbers, and then we went to smaller numbers, then we went back to medium numbers. We align ourselves with the timeframes we’re seeing for people. Like I said, a lot of folks have an SLA of responding to a P0 within 10 or 15 minutes because their point basically—and there’s a little bit of bias here into our customer base because our customer base is, A, fairly advanced in terms of cloud adoption and in terms of security maturity, and also, they’re heavily in let’s say, financial industries and other industries that tend to be early adopters of new technology. So, if you are kind of a laggard, like, you probably aren’t that close to meeting this benchmark as you are if you’re saying financial, right? So, we asked them how they operate, and they basically pointed out to us that, like, knowing 15 minutes later is too late because I’ve already lost, like, some number of millions of dollars if my environment is compromised for 15 minutes, right? So, that’s kind of where the ten minutes comes from. Like, we took our real threat research data, and then we went around and talked to folks to see kind of what they’re experiencing and what their own expectations are for their incident response in SOC teams, and ten minutes is sort of where we landed.

Corey: Got it. When you see this happening, I guess, in various customer environments, assuming someone has missed that five-minute window, is a game over effectively? How should people be thinking about this?

Anna: No. So, I mean, it’s never really game over, right? Like until your company is ransomed to bits, and you have to close your business, you still have many things that you can do, hopefully, to save yourself. And also, I want to be very clear that 5/5/5 as a benchmark is meant to be something aspirational, right? So, you should be able to meet this benchmark for, let’s say, your top use cases if you are a fairly high maturity organization, in threat detection specifically, right?

So, if you’re just beginning your threat detection journey, like, tomorrow, you’re not going to be close. Like, you’re going to be not at all close. The point here, though, is that you should aspire to this level of greatness, and you’re going to have to create new processes and adopt new tools to get there. Now, before you get there, I would argue that if you can do, like, 10-10-10 or, like, whatever number you start with, you’re on a mission to make that number smaller, right? So, if today, you can detect a crypto miner in 30 minutes, that’s not great because crypto miners are pretty detectable these days, but give yourself a goal of, like, getting that 30 minutes down to 20, or getting that 30 minutes down to 10, right?

Because we are so obsessed with, like, measuring ourselves against our peers and all this other stuff that we sometimes lose track of what actually is improving our security program. So yes, compare it to yourself first. But ultimately, if you can meet the 5/5/5 benchmark, then you are doing great. Like, you are faster than the attackers in theory, so that’s the dream.

Corey: So, I have to ask, and I suspect I might know the answer to this, but given that it seems very hard to move this quickly, especially at scale, is there an argument to be made that effectively prevention obviates the need for any of this, where if you don’t misconfigure things in ways that should be obvious, if you practice defense-in-depth to a point where you can effectively catch things that the first layer meets with successive layers, as opposed to, “Well, we have a firewall. Once we’re inside of there, well [laugh], it’s game over for us.” Is prevention sufficient in some ways to obviate this?

Anna: I think there are a lot of people that would love to believe that that’s true.

Corey: Oh, I sure would. It’s such a comforting story.

Anna: And we’ve done, like, I think one of my opening sentences in the benchmark, kind of, description, actually, is that we’ve done a pretty good job of advertising prevention in Cloud as an important thing and getting people to actually, like, start configuring things more carefully, or like, checking how those things have been configured, and then changing that configuration should they discover that it is not compliant with some mundane standard that everyone should know, right? So, we’ve made great progress, I think, in cloud prevention, but as usual, like, prevention fails, right? Like I still have smoke detectors in my house, even though I have done everything possible to prevent it from catching fire and I don’t plan to set it on fire, right? But like, threat detection is one of these things that you’re always going to need because no matter what you do, A, you will make a mistake because you’re a human being, and there are too many things, and you’ll make a mistake, and B, the bad guys are literally in the business of figuring ways around your prevention and your protective systems.

So, I am full on on defense-in-depth. I think it’s a beautiful thing. We should only obviously do that. And I do think that prevention is your first step to a holistic security program—otherwise, what even is the point—but threat detection is always going to be necessary. And like I said, even if you can’t go 5/5/5, you don’t have threat detection at that speed, you need to at least be able to know what happened later so you can update your prevention system.

Corey: This might be a dangerous question to get into, but why not, that’s what I do here. This [could 00:17:27] potentially an argument against Cloud, by which I mean that if I compromise someone’s Cloud account on any of the major cloud providers, once I have access of some level, I know where everything else in the environment is as a general rule. I know that you’re using S3 or its equivalent, and what those APIs look like and the rest, whereas as an attacker, if I am breaking into someone’s crappy data center-hosted environment, everything is going to be different. Maybe they don’t have a SAN at all, for example. Maybe they have one that hasn’t been patched in five years. Maybe they’re just doing local disk for some reason.

There’s a lot of discovery that has to happen that is almost always removed from Cloud. I mean, take the open S3 bucket problem that we’ve seen as a scourge for 5, 6, 7 years now, where it’s not that S3 itself is insecure, but once you make a configuration mistake, you are now in line with a whole bunch of other folks who may have much more valuable data living in that environment. Where do you land on that one?

Anna: This is the ‘leave cloud to rely on security through obscurity’ argument?

Corey: Exactly. Which I’m not a fan of, but it’s also hard to argue against from time-to-time.

Anna: My other way of phrasing it is ‘the attackers are ripping up the stack’ argument. Yeah, so—and there is some sort of truth in that, right? Part of the reason that attackers can move that fast—and I think we say this a lot when we talk about the threat report data, too, because we literally see them execute this behavior, right—is they know what the cloud looks like, right? They have access to all the API documentation, they kind of know what all the constructs are that you’re all using, and so they literally can practice their attack and create all these scripts ahead of time to perform their reconnaissance because they know exactly what they’re looking at, right? On-premise, you’re right, like, they’re going to get into—even to get through my firewall, whatever, they’re getting into my data center, they don’t do not know what disaster I have configured, what kinds of servers I have where, and, like, what the network looks like, they have no idea, right?

In Cloud, this is kind of all gifted to them because it’s so standard, which is a blessing and a curse. It’s a blessing because—well for them, I mean, because they can just programmatically go through this stuff, right? It’s a curse for them because it’s a blessing for us in the same way, right? Like, the defenders… A, have a much easier time knowing what they even have available to them, right? Like, the days of there’s a server in a closet I’ve never heard of are kind of gone, right? Like, you know what’s in your Cloud account because, frankly, AWS tells you. So, I think there is a trade-off there.

The other thing is—about the moving up the stack thing, right—like no matter what you do, they will come after you if you have something worth exploiting you for, right? So, by moving up the stack, I mean, listen, we have abstracted all the physical servers, all of the, like, stuff we used to have to manage the security of because the cloud just does that for us, right? Now, we can argue about whether or not they do a good job, but I’m going to be generous to them and say they do a better job than most companies [laugh] did before. So, in that regard, like, we say, thank you, and we move on to, like, fighting this battle at a higher level in the stack, which is now the workloads and the cloud control plane, and the you name it, whatever is going on after that. So, I don’t actually think you can sort of trade apples for oranges here. It’s just… bad in a different way.

Corey: Do you think that this benchmark is going to be used by various companies who will learn about it? And if so, how do you see that playing out?

Anna: I hope so. My hope when we created it was that it would sort of serve as a goalpost or a way to measure—

Corey: Yeah, it would just be marketing words on a page and never mentioned anywhere, that’s our dream here.

Anna: Yeah, right. Yeah, I was bored. So, I wrote some—[laugh].

Corey: I had a word minimum to get out the door, so there we are. It’s how we work.

Anna: Right. As you know, I used to be a Gartner analyst, and my desire is always to, like, create things that are useful for people to figure out how to do better in security. And my, kind of, tenure at the vendor is just a way to fund that [laugh] more effectively [unintelligible 00:21:08].

Corey: Yeah, I keep forgetting you’re ex-Gartner. Yeah, it’s one of those fun areas of, “Oh, yeah, we just want to basically talk about all kinds of things because there’s a—we have a chart to fill out here. Let’s get after it.”

Anna: I did not invent an acronym, at least. Yeah, so my goal was the following. People are always looking for a benchmark or a goal or standard to be like, “Hey, am I doing a good job?” Whether I’m, like a SOC analyst or director, and I’m just looking at my little SOC empire, or I’m a full on CSO, and I’m looking at my entire security program to kind of figure out risk, I need some way to know whether what is happening in my organization is, like, sufficient, or on par, or anything. Is it good or is it bad? Happy face? Sad face? Like, I need some benchmark, right?

So normally, the Gartner answer to this, typically, is like, “You can only come up with benchmarks that are—” they’re, like, “Only you know what is right for your company,” right? It’s like, you know, the standard, ‘it depends’ answer. Which is true, right, because I can’t say that, like, oh, a huge multinational bank should follow the same benchmark as, like, a donut shop, right? Like, that’s unreasonable. So, this is also why I say that our benchmark is probably more tailored to the more advanced organizations that are dealing with kind of high maturity phenomena and are more cloud-native, but the donut shops should kind of strive in this direction, right?

So, I hope that people will think of it this way: that they will, kind of, look at their process and say, “Hey, like, what are the things that would be really bad if they happened to me, in terms of sort detection?” Like, “What are the threats I’m afraid of where if I saw this in my cloud environment, I would have a really bad day?” And, “Can I detect those threats in 5/5/5?” Because if I can, then I’m actually doing quite well. And if I can’t, then I need to set, like, some sort of roadmap for myself on how I get from where I am now to 5/5/5 because that implies you would be doing a good job.

So, that’s sort of my hope for the benchmark is that people think of it as something to aspire to, and if they’re already able to meet it, then that they’ll tell us how exactly they’re achieving it because I really want to be friends with them.

Corey: Yeah, there’s a definite lack of reasonable ways to think about these things, at least in ways that can be communicated to folks outside of the bounds of the security team. I think that’s one of the big challenges currently facing the security industry is that it is easy to get so locked into the domain-specific acronyms, philosophies, approaches, and the rest, that even coming from, “Well, I’m a cloud engineer who ostensibly needs to know about these things.” Yeah, wander around the RSA floor with that as your background, and you get lost very quickly.

Anna: Yeah, I think that’s fair. I mean, it is a very, let’s say, dynamic and rapidly evolving space. And by the way, like, it was really hard for me to pick these numbers, right, because I… very much am on that whole, ‘it depends’ bandwagon of I don’t know what the right answer is. Who knows what the right answer is [laugh]? So, I say 5/5/5 today. Like, tomorrow, the attack takes five minutes, and now it’s two-and-a-half/two-and-a-half, right? Like it’s whatever.

You have to pick a number and go for it. So, I think, to some extent, we have to try to, like, make sense of the insanity and choose some best practices to anchor ourselves in or some, kind of like, sound logic to start with, and then go from there. So, that’s sort of what I go for.

Corey: So, as I think about the actual reaction times needed for 5/5/5 to actually be realistic, people can’t reliably get a hold of me on the phone within five minutes, so it seems like this is not something you’re going to have humans in the loop for. How does that interface with the idea of automating things versus giving automated systems too much power to take your site down as a potential failure mode?

Anna: Yeah. I don’t even answer the phone anymore, so that wouldn’t work at all. That’s a really, really good question, and probably the question that gives me the most… I don’t know, I don’t want to say lost sleep at night because it’s actually, it’s very interesting to think about, right? I don’t think you can remove humans from the loop in the SOC. Like, certainly there will be things you can auto-respond to some extent, but there’d better be a human being in there because there are too many things at stake, right?

Some of these actions could take your entire business down for far more hours or days than whatever the attacker was doing before. And that trade-off of, like, is my response to this attack actually hurting the business more than the attack itself is a question that’s really hard to answer, especially for most of us technical folks who, like, don’t necessarily know the business impact of any given thing. So, first of all, I think we have to embrace other response actions. Back to our favorite crypto miners, right? Like there is no reason to not automatically shut them down. There is no reason, right? Just build in a detection and an auto-response: every time you see a crypto miner, kill that process, kill that container, kill that node. I don’t care. Kill it. Like, why is it running? This is crazy, right?

I do think it gets nuanced very quickly, right? So again, in SCARLETEEL, there are essentially, like, five or six detections that occur, right? And each of them theoretically has a potential auto-response that you could have executed depending on your, sort of, appetite for that level of intervention, right? Like, when you see somebody assuming a role, that’s perfectly normal activity most of the time. In this case, I believe they actually assumed a machine role, which is less normal. Like, that’s kind of weird.

And then what do you do? Well, you can just, like, remove the role. You can remove that person’s ability to do anything, or remove that role’s ability to do anything. But that could be very dangerous because we don’t necessarily know what the full scope of that role is as this is happening, right? So, you could take, like, a more mitigated auto-response action and add a restrictive policy to that rule, for example, to just prevent activity from that IP address that you just saw, right, because we’re not sure about this IP address, but we’re sure about this role, right?

So, you have to get into these, sort of, risk-tiered response actions where you say, “Okay, this is always okay to do automatically. And this is, like, sometimes, okay, and this is never okay.” And as you develop that muscle, it becomes much easier to do something rather than doing nothing and just, kind of like, analyzing it in forensics and being, like, “Oh, what an interesting attack story,” right? So, that’s step one, is just start taking these different response actions.

And then step two is more long-term, and it’s that you have to embrace the cloud-native way of life, right? Like this immutable, ephemeral, distributed religion that we’ve been selling, it actually works really well if you, like, go all-in on the religion. I sound like a real cult leader [laugh]. Like, “If you just go all in, it’s going to be great.” But it’s true, right?

So, if your workflows are immutable—that means they cannot change as they’re running—then when you see them drifting from their original configuration, like, you know, that is bad. So, you can immediately know that it’s safe to take an auto-respon—well, it’s safe, relatively safe, take an auto-response action to kill that workload because you are, like, a hundred percent certain it is not doing the right things, right? And then furthermore, if all of your deployments are defined as code, which they should be, then it is approximately—[though not entirely 00:27:31]—trivial to get that workload back, right? Because you just push a button, and it just generates that same Kubernetes cluster with those same nodes doing all those same things, right? So, in the on-premise world where shooting a server was potentially the, you know, fireable offense because if that server was running something critical, and you couldn’t get it back, you were done.

In the cloud, this is much less dangerous because there’s, like, an infinite quantity of servers that you could bring back and hopefully Infrastructure-as-Code and, kind of, Configuration-as-Code in some wonderful registry, version-controlled for you to rely on to rehydrate all that stuff, right? So again, to sort of TL;DR, get used to doing auto-response actions, but do this carefully. Like, define a scope for those actions that make sense and not just, like, “Something bad happened; burn it all down,” obviously. And then as you become more cloud-native—which sometimes requires refactoring of entire applications—by the way, this could take years—just embrace the joy of Everything-as-Code.

Corey: That’s a good way of thinking about it. I just, I wish there were an easier path to get there, for an awful lot of folks who otherwise don’t find a clear way to unlock that.

Anna: There is not, unfortunately [laugh]. I mean, again, the upside on that is, like, there are a lot of people that have done it successfully, I have to say. I couldn’t have said that to you, like, six, seven years ago when we were just getting started on this journey, but especially for those of you who were just at KubeCon—however, long ago… before this airs—you see a pretty robust ecosystem around Kubernetes, around containers, around cloud in general, and so even if you feel like your organization’s behind, there are a lot of folks you can reach out to to learn from, to get some help, to just sort of start joining the masses of cloud-native types. So, it’s not nearly as hopeless as before. And also, one thing I like to say always is, almost every organization is going to have some technical debt and some legacy workload that they can’t convert to the religion of cloud.

And so, you’re not going to have a 5/5/5 threat detection SLA on those workloads. Probably. I mean, maybe you can, but probably you’re not, and you may not be able to take auto-response actions, and you may not have all the same benefits available to you, but like, that’s okay. That’s okay. Hopefully, whatever that thing is running is, you know, worth keeping alive, but set this new standard for your new workloads. So, when your team is building a new application, or if they’re refactoring an application, can’t afford the new world, set the standard on them and don’t, kind of like, torment the legacy folks because it doesn’t necessarily make sense. Like, they’re going to have different SLAs for different workloads.

Corey: I really want to thank you for taking the time to speak with me yet again about the stuff you folks are coming out with. If people want to learn more, where’s the best place for them to go?

Anna: Thanks, Corey. It’s always a pleasure to be on your show. If you want to learn more about the 5/5/5 benchmark, you should go to

Corey: And we will, of course, put links to that in the show notes. Thank you so much for taking the time to speak with me today. As always, it’s appreciated. Anna Belak, Director at the Office of Cybersecurity Strategy at Sysdig. I’m Cloud Economist Corey Quinn, and this has been a promoted guest episode brought to us by our friends at Sysdig. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that I will read nowhere even approaching within five minutes.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit to get started.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.