Creating A Resilient Security Strategy Through Chaos Engineering with Kelly Shortridge

Episode Summary

Kelly Shortridge, Senior Principal Engineer at Fastly, joins Corey on Screaming in the Cloud to discuss their recently released book, Security Chaos Engineering: Sustaining Resilience in Software and Systems. Kelly explains why a resilient strategy is far preferable to a bubble-wrapped approach to cybersecurity, and how developer teams can use evidence to mitigate security threats. Corey and Kelly discuss how the risks of working with complex systems is perfectly illustrated by Jurassic Park, and Kelly also highlights why it’s critical to address both system vulnerabilities and human vulnerabilities in your development environment rather than pointing fingers when something goes wrong.

Episode Show Notes & Transcript

About Kelly

Kelly Shortridge is a senior principal engineer at Fastly in the office of the CTO and lead author of "Security Chaos Engineering: Sustaining Resilience in Software and Systems" (O'Reilly Media). Shortridge is best known for their work on resilience in complex software systems, the application of behavioral economics to cybersecurity, and bringing security out of the dark ages. Shortridge has been a successful enterprise product leader as well as a startup founder (with an exit to CrowdStrike) and investment banker. Shortridge frequently advises Fortune 500s, investors, startups, and federal agencies and has spoken at major technology conferences internationally, including Black Hat USA, O'Reilly Velocity Conference, and SREcon. Shortridge's research has been featured in ACM, IEEE, and USENIX, spanning behavioral science in cybersecurity, deception strategies, and the ROI of software resilience. They also serve on the editorial board of ACM Queue.

Links Referenced:


Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Have you listened to the new season of Traceroute yet? Traceroute is a tech podcast that peels back the layers of the stack to tell the real, human stories about how the inner workings of our digital world affect our lives in ways you may have never thought of before. Listen and follow Traceroute on your favorite platform, or learn more about Traceroute at My thanks to them for sponsoring this ridiculous podcast. 

Corey: Welcome to Screaming in the Cloud, I’m Corey Quinn. My guest today is Kelly Shortridge, who is a Senior Principal Engineer over at Fastly, as well as the lead author of the recently released Security Chaos Engineering: Sustaining Resilience in Software and Systems. Kelly, welcome to the show.

Kelly: Thank you so much for having me.

Corey: So, I want to start with the honest truth that in that title, I think I know what some of the words mean, but when you put them together in that particular order, I want to make sure we’re talking about the same thing. Can you explain that like I’m five, as far as what your book is about?

Kelly: Yes. I’ll actually start with an analogy I make in the book, which is, imagine you were trying to rollerblade to some destination. Now, one thing you could do is wrap yourself in a bunch of bubble wrap and become the bubble person, and you can waddle down the street trying to make it to your destination on the rollerblades, but if there’s a gust of wind or a dog barks or something, you’re going to flop over, you’re not going to recover. However, if you instead do what everybody does, which is you know, kneepads and other things that keep you flexible and nimble, the gust you know, there’s a gust of wind, you can kind of be agile, navigate around it; if a dog barks, you just roller-skate around it; you can reach your destination. The former, the bubble person, that’s a lot of our cybersecurity today. It’s just keeping us very rigid, right? And then the alternative is resilience, which is the ability to recover from failure and adapt to evolving conditions.

Corey: I feel like I am about to torture your analogy to death because back when I was in school in 2000, there was an annual tradition at the school I was attending before failing out, where a bunch of us would paint ourselves green every year and then bike around the campus naked. It was the green bike ride. So, one year I did this on rollerblades. So, if you wind up looking—there’s the bubble wrap, there’s the safety gear, and then there’s wearing absolutely nothing, which feels—

Kelly: [laugh]. Yes.

Corey: —kind of like the startup approach to InfoSec. It’s like, “It’ll be fine. What’s the worst that happens?” And you’re super nimble, super flexible, until suddenly, oops, now I really wish I’d done things differently.

Kelly: Well, there’s a reason why I don’t say rollerblade naked, which other than it being rather visceral, what you described is what I’ve called YOLOSec before, which is not what you want to do. Because the problem when you think about it from a resilience perspective, again, is you want to be able to recover from failure and adapt. Sure, you can oftentimes move quickly, but you’re probably going to erode software quality over time, so to a certain point, there’s going to be some big incident, and suddenly, you aren’t fast anymore, you’re actually pretty slow. So, there’s this, kind of, happy medium where you have enough, I would like security by design—we can talk about that a bit if you want—where you have enough of the security by design baked in and you can think of it as guardrails that you’re able to withstand and recover from any failure. But yeah, going naked, that’s a recipe for not being able to rollerblade, like, ever again, potentially [laugh].

Corey: I think, on some level, that the correct dialing in of security posture is going to come down to context, in almost every case. I’m building something in my spare time in the off hours does not need the same security posture—mostly—as we are a bank. It feels like there’s a very wide gulf between those two extremes. Unfortunately, I find that there’s a certain tone-deafness coming from a lot of the security industry around oh, everyone must have security as their number one thing, ever. I mean, with my clients who I fixed their AWS bills, I have to care about security contractually, but the secrets that I hold are boring: how much money certain companies pay another very large company.

Yes, I’ll get sued into oblivion if that leaks, but nobody dies. Nobody is having their money stolen as a result. It’s slightly embarrassing in the tech press for a cycle and then it’s over and done with. That’s not the same thing as a brief stint I did running tech ops at Grindr ten years ago where, leak that database and people will die. There’s a strong difference between those threat models, and on some level, being able to act accordingly has been one of the more eye-opening approaches to increasing velocity in my experience. Does that align with the thesis of your book, since my copy has not yet arrived for this recording?

Kelly: Yes. The book, I am not afraid to say it depends on the book, and you’re right, it depends on context. I actually talk about this resilience potion recipe that you can check out if you want, these ingredients so we can sustain resilience. A key one is defining your critical functions, just what is your system’s reason for existence, and that is what you want to make sure it can recover and still operate under adverse conditions, like you said.

Another example I give all the time is most SaaS apps have some sort of reporting functionality. Guess what? That’s not mission-critical. You don’t need the utmost security on that, for the most part. But if it’s processing transactions, yeah, probably you want to invest more security there. So yes, I couldn’t agree more that it’s context-dependent and oh, my God, does the security industry ignore that so much of the time, and it’s been my gripe for, I feel like as long as I’ve been in the industry.

Corey: I mean, there was a great talk that Netflix gave years ago where they mentioned in passing, that all developers have root in production. And that’s awesome and the person next to him was super excited and I looked at their badge, and holy hell, they worked at an actual bank. That seems like a bad plan. But talking to the Netflix speaker after the fact, Dave Hahn, something that I found that was extraordinarily insightful, was that, yeah, well we just isolate off the PCI environment so the rest and sensitive data lives in its own compartmentalized area. So, at that point, yeah, you’re not going to be able to break much in that scenario.

It’s like, that would have been helpful context to put in talk. Which I’m sure he did, but my attention span had tripped out and I missed that. But that’s, on some level, constraining blast radius and not having compliance and regulatory issues extending to every corner of your environment really frees you up to do things appropriately. But there are some things where you do need to care about this stuff, regardless of how small the surface area is.

Kelly: Agreed. And I introduced the concept of the effort investment portfolio in the book, which is basically, that is where does it matter to invest effort and where can you kind of like, maybe save some resources up. I think one thing you touched on, though, is, we’re really talking about isolation and I actually think people don’t think about isolation in as detailed or maybe as expansively as they could. Because we want both temporal and logical and spatial isolation. What you talked about is, yeah, there are some cases where you want to isolate data, you want to isolate certain subsystems, and that could be containers, it could also be AWS security groups.

It could take a bunch of different forms, it could be something like RLBox in WebAssembly land. But I think that’s something that I really try to highlight in the book is, there’s actually a huge opportunity for security engineers starting from the design of a system to really think about how can we infuse different forms of isolation to sustain resilience.

Corey: It’s interesting that you use the word investment. When fixing AWS bills for a living, I’ve learned over the last almost seven years now of doing this that cost and architecture and cloud are fundamentally the same thing. And resilience is something that comes with a very real cost, particularly when you start looking at what the architectural choices are. And one of the big reasons that I only ever work on a fixed-fee basis is because if I’m charging for a percentage of savings or something, it inspires me to say really uncomfortable things like, “Backups are for cowards.” And, “When was the last time you saw an entire AWS availability zone go down for so long that it mattered? You don’t need to worry about that.” And it does cut off an awful lot of cost issues, at the price of making the environment more fragile.

That’s where one of the context thing starts to come in. I mean, in many cases, if AWS is having a bad day in a given region, well does your business need that workload to be functional? For my newsletter, I have a publication system that’s single-homed out of the Oregon region. If that whole thing goes down for multiple days, I’m writing that week’s issue by hand because I’m going to have something different to talk about anyway. For me, there is no value in making that investment. But for companies, there absolutely is, but there’s also seems to be a lack of awareness around, how much is a reasonable investment in that area when do you start making that investment? And most critically, when do you stop?

Kelly: I think that’s a good point, and luckily, what’s on my side is the fact that there’s a lot of just profligate spending in cybersecurity and [laugh] that’s really what I’m focused on is, how can we spend those investments better? And I actually think there’s an opportunity in many cases to ditch a ton of cybersecurity tools and focus more on some of the stuff he talked about. I agree, by the way that I’ve seen some threat models where it’s like, well, AWS, all regions go down. I’m like, at that point, we have, like, a severe, bigger-than-whatever-you’re-thinking-about problem, right?

Corey: Right. So, does your business continuity plan account for every one of your staff suddenly quitting on the spot because there’s a whole bunch of companies with very expensive consulting, like, problems that I’m going to go work for a week and then buy a house in cash. It’s one of those areas where, yeah, people are not going to care about your environment more than they are about their families and other things that are going on. Plan accordingly. People tend to get so carried away with these things with tabletop planning exercises. And then of course, they forget little things like I overwrote the database by dropping the wrong thing. Turns out that was production. [laugh]. Remembering for [a me 00:10:00] there.

Kelly: Precisely. And a lot of the chaos experiments that I talk about in the book are a lot of those, like, let’s validate some of those basics, right? That’s actually some of the best investments you can make. Like, if you do have backups, I can totally see your argument about backups are for cowards, but if you do have them, like, maybe you conduct experiments to make sure that they’re available when you need them, and the same thing, even on the [unintelligible 00:10:21] side—

Corey: No one cares about backups, but everyone really cares about restores, suddenly, right after—

Kelly: Yeah.

Corey: —they really should have cared about backups.

Kelly: Exactly. So, I think it’s looking at those experiments where it’s like, okay, you have these basic assumptions in place that you assume to be invariance or assume that they’re going to bail you out if something goes wrong. Let’s just verify. That’s a great place to start because I can tell you—I know you’ve been to the RSA hall floor—how many cybersecurity teams are actually assessing the efficacy and actually experimenting to see if those tools really help them during incidents. It’s pretty few.

Corey: Oh, vendors do not want to do those analyses. They don’t want you to do those analyses, either, and if you do, for God’s sakes, shut up about it. They’re trying to sell things here, mostly firewalls.

Kelly: Yeah, cybersecurity vendors aren’t necessarily happy about my book and what I talk about because I have almost this ruthless focus on evidence and [unintelligible 00:11:08] cybersecurity vendors kind of thrive on a lack of evidence. So.

Corey: There’s so much fear, uncertainty, and doubt in that space and I do feel for them. It’s a hard market to sell in without having to talk about here’s the thing that you’re defending against. In my case, it’s easy to sell the AWS bill is high because if I don’t have to explain why more or less setting money on fire as a bad thing, I don’t really know what to tell you. I’m going to go look for a slightly different customer profile. That’s not really how it works in security, I’m sure there are better go-to-market approaches, but they’re hard to find, at least, ones that work holistically.

Kelly: There are. And one of my priorities with the book was to really enumerate how many opportunities there are to take software engineering practices that people already know, let’s say something like type systems even, and how those can actually help sustain resilience. Even things like integration testing or infrastructure as code, there are a lot of opportunities just to extend what we already do for systems reliability to sustain resilience against things that aren’t attacks and just make sure that, you know, we cover a few of those cases as well. A lot of it should be really natural to software engineering teams. Again, security vendors don’t like that because it turns out software engineering teams don’t particularly like security vendors.

Corey: I hadn’t noticed that. I do wonder, though, for those who are unaware, chaos engineering started off as breaking things on purpose, which I feel like one person had a really good story and thought about it super quickly when they were about to get fired. Like, “No, no, it’s called Chaos Engineering.” Good for them. It’s now a well-regarded discipline. But I’ve always heard of it in the context of reliability of, “Oh, you think your site is going to work if the database falls over? Let’s push it over and see what happens.” How does that manifest in a security context?

Kelly: So, I will clarify, I think that’s a slight misconception. It’s really about fixing things in production, and that’s the end goal. I think we should not break things just to break them, right? But I’ll give a simple example, which I know it’s based on what Aaron Rinehart conducted at UnitedHealth Group, which is, okay, let’s inject a misconfigured port as an experiment and see what happens, end-to-end. In their case, the firewall only detected the misconfigured port 60% of the time, so 60% of the time, it works every time.

But it was actually the cloud, the very common, like, Cloud configuration management tool that caught the change and alerted responders. So, it’s that kind of thing where we’re still trying to verify those assumptions that we have about our systems and how they behave, again, end-to-end. In a lot of cases, again, with security tools, they are not behaving as we expect. But I still argue security is just a subset of software quality, so if we’re experimenting to verify, again, our assumptions and observe system behavior, we’re benefiting software quality, and security is just a subset of that. Think about C code, right? It’s not like there’s, like, a healthy memory corruption, so it’s bad for both the quality and security reason.

Corey: One problem that I’ve had in the security space for a while is—let’s [unintelligible 00:14:05] on this to AWS for a second because that is the area in which I spend the most of my time, which probably explains a lot about my personality challenges. But the problem that I keep smacking into is if I go ahead and configure everything the way that I should according to best practices and the rest, I wind up with a firehose torrent of information in terms of CloudTrail logs, et cetera. And it’s expensive in its own right. But then to sort through it or to do a lot of things in security, there are basically two options. I can either buy a vendor’s product, which generally tends to start around $12,000 a year and goes up rapidly from there on my current $6,000 a year bill, so okay, twice as much as the infrastructure for security monitoring. Okay.

Or alternately, find a bunch of different random scripts and tools on GitHub of wildly diverging quality and sort of hope for the best on that. It feels like there’s nothing in between. And the reason I care about this is not because I’m cheap but because when you have an individual learner who is either a student or a career switcher or someone just trying to experiment with this, you want them to begin as you want them to go on, and things that are no money for an enterprise are all the money to them. They’re going to learn to work with the tools that they can afford. That feels like it’s a big security swing and a miss. Do you agree or disagree? What’s the nuance I’m missing here?

Kelly: No, I don’t think there’s nuance you’re missing. I think security observability, for one, isn’t a buzzword that particularly exists. I’ve been trying to make it a thing, but I’m solely one individual screaming into the void. But observability just hasn’t been a thing. We haven’t really focused on, okay, so what, like, we get data and what do we do with it?

And I think, again, from a software engineering perspective, I think there’s a lot we can do. One, we can just avoid duplicating efforts. We can treat observability, again, of any sort of issue as similar, whether that’s an attack or a performance issue. I think this is another place where security, or any sort of chaos experiment, shines though because if you have an idea of here’s an adverse scenario we care about, you can actually see how does it manifest in the logs and you can start to figure out, like, what signals do we actually need to be looking for, what signals mattered to be able to narrow it down. Which again, it involves time and effort, but also, I can attest when you’re buying the security vendor tool and, in theory, absolving some of that time and effort, it’s maybe, maybe not, because it can be hard to understand what the outcomes are or what the outputs are from the tool and it can also be very difficult to tune it and to be able to explain some of the outputs. It’s kind of like trading upfront effort versus long-term overall overhead if that makes sense.

Corey: It does. On that note, the title of your book includes the magic key phrase ‘sustaining resilience.’ I have found that security effort and investment tends to resemble a fire drill in—

Kelly: [laugh].

Corey: —an awful lot of places, where, “We care very much about security,” says the company, right after they very clearly failed to care about security, and I know this because I’m reading getting an email about a breach that they’ve just sent me. And then there’s a whole bunch of running around and hair-on-fire moments. But then there’s a new shiny that always comes up, a new strategic priority, and it falls to the wayside again. What do you see the drives that sustained effort and focus on resilience in a security context?

Kelly: I think it’s really making sure you have a learning culture, which sounds very [unintelligible 00:17:30], but things again, like, experiments can help just because when you do simulate those adverse scenarios and you see how your system behaves, it’s almost like running an incident and you can use that as very fresh, kind of, like collective memory. And I even strongly recommend starting off with prior incidents in simulating those, just to see like, hey, did the improvements we make actually help? If they didn’t, that can be kind of another fire under the butt, so to speak, to continue investing. So, definitely in practice—and there’s some case studies in the book—it can be really helpful just to kind of like sustain that memory and sustain that learning and keep things feeling a bit fresh. It’s almost like prodding the nervous system a little, just so it doesn’t go back to that complacent and convenient feeling.

Corey: It’s one of the hard problems because—I’m sure I’m going to get castigated for this by some of the listeners—but computers are easy, particularly compared to the people. There are deterministic ways to solve almost any computer problem, but people are always going to be a little bit different, and getting them to perform the same way today that they did yesterday is an exercise in frustration. Changing the culture, changing the approach and the attitude that people take toward a lot of these things feels, from my perspective, like, something of an impossible job. Cultural transformations are things that everyone talks about, but it’s rare to see them succeed.

Kelly: Yes, and that’s actually something that I very strongly weaved throughout the book is that if your security solutions rely on human behavior, they’re going to fail. We want to either reduce hazards or eliminate hazards by design as much as possible. So, my view is very much again, like, can you make processes more repeatable? That’s going to help security. I definitely do not think that if anyone takes away from my book that they need to have, like, a thousand hours of training to change hearts and minds, then they have completely misunderstood most of the book.

The idea is very much like, what are practices that we want for other outcomes anyway—again, reliability or faster time to market—and how can we harness those to also be improving resilience or security at the same time? It’s very much trying to think about those opportunities rather than, you know, trying to drill into people’s heads, like, “Thou shalt not,” or, “Thou shall.”

Corey: Way back in 2018, you gave a keynote at some conference or another and you built the entire thing on the story of Jurassic Park, specifically Ian Malcolm as one of your favorite fictional heroes, and you tied it into security in a bunch of different ways. You hadn’t written this book then unless the authorship process is way longer than I think it is. So, I’m curious to get your take on what Jurassic Park can teach us about software security.

Kelly: Yes, so I talk about Jurassic Park as a reference throughout the book, frequently. I’ve loved that book since I was a very young child. Jurassic Park is a great example of a complex system gone wrong because you can’t point to any one thing. Like there’s Dennis Nedry, you know, messing up the power system, but then there’s also the software was looking for a very specific count of dinosaurs and they didn’t anticipate there could be more in the count. Like, there are so many different factors that influenced it, you can’t actually blame just, like, human error or point fingers at one thing.

That’s a beautiful example of how things go wrong in our software systems because like you said, there’s this human element and then there’s also how the humans interact and how the software components interact. But with Jurassic Park, too, I think the great thing is dinosaurs are going to do dinosaur things like eating people, and there are also equivalents in software, like C code. C code is going to do C code things, right? It’s not a memory safe language, so we shouldn’t be surprised when something goes wrong. We need to prepare accordingly.

Corey: “How could this happen? Again?” Yeah.

Kelly: Right. At a certain point, it’s like, there’s probably no way to sufficiently introduce isolation for dinosaurs unless you put them in a bunker where no one can see them, and it’s the same thing sometimes with things like C code. There’s just no amount of effort you can invest, and you’re just kind of investing for a really unclear and generally not fortuitous outcome. So, I like it as kind of this analogy to think about, okay, where do our effort investments make sense and where is it sometimes like, we really just do need to refactor because we’re dealing with dinosaurs here.

Corey: When I was a kid, that was one of my favorite books, too. The problem is, I didn’t realize I was getting a glimpse of my future at a number of crappy startups that I worked at. Because you have John Hammond, who was the owner of the park talking constantly about how, “We spared no expense,” but then you look at what actually happened and he spared every frickin expense. You have one IT person who is so criminally underpaid that smuggling dinosaur embryos off the island becomes a viable strategy for this. He wound up, “Oh, we couldn’t find the right DNA, so we’re just going to, like, splice some other random stuff in there. It’ll be fine.”

Then you have the massive overconfidence because it sounds very much like he had this almost Muskian desire to fire anyone who disagreed with him, and yeah, there was a certain lack of investment that could have been made, despite loud protestations to the contrary. I’d say that he is the root cause, he is the proximate reason for the entire failure of the park. But I’m willing to entertain disagreement on that point.

Kelly: I think there are other individuals, like Dr. Wu, if you recall, like, deciding to do the frog DNA and not thinking that maybe something could go wrong. I think there was a lot of overconfidence, which you’re right, we do see a lot in software. So, I think that’s actually another very important lesson is that incentives matter and incentives are very hard to change, kind of like what you talked about earlier. It doesn’t mean that we shouldn’t include incentives in our threat model.

So like, in the book I talked about, our threat models should include things like maybe yeah, people are underpaid or there is a ton of pressure to deliver things quickly or, you know, do things as cheaply as possible. That should be just as much of our threat models as all of the technical stuff too.

Corey: I think that there’s a lot that was in that movie that was flat-out wrong. For example, one of the kids—I forget her name; it’s been a long time—was logging in and said, “Oh, this is Unix. I know Unix.” And having learned Unix as my first basically professional operating system, “No, you don’t. No one knows Unix. They get very confused at some point, the question is, just how far down what rabbit hole it is.”

I feel so sorry for that kid. I hope she wound up seeking therapy when she was older to realize that, no, you don’t actually know Unix. It’s not that you’re bad at computers, it’s that Unix is user-hostile, actively so. Like, the raptors, like, that’s the better metaphor when everything winds up shaking out.

Kelly: Yeah. I don’t disagree with that. The movie definitely takes many liberties. I think what’s interesting, though, is that Michael Creighton, specifically, when he talks about writing the book—I don’t know how many people know this—dinosaurs were just a mechanism. He knew people would want to read it in airport.

What he cared about was communicating really the danger of complex systems and how if you don’t respect them and respect that interactivity and that it can baffle and surprise us, like, things will go wrong. So, I actually find it kind of beautiful in a way that the dinosaurs were almost like an afterthought. What he really cared about was exactly what we deal with all the time in software, is when things go wrong with complexity.

Corey: Like one of his other books, Airframe, talked about an air disaster. There’s a bunch of contributing factors in the rest, and for some reason, that did not receive the wild acclaim that Jurassic Park did to become a cultural phenomenon that we’re still talking about, what, 30 years later.

Kelly: Right. Dinosaurs are very compelling.

Corey: They really are. I have to ask though—this is the joy of having a kid who is almost six—what is your favorite dinosaur? Not a question most people get asked very often, but I am going to trot that one out.

Kelly: No. Oh, that is such a good question. Maybe a Deinonychus.

Corey: Oh, because they get so angry they spit and kill people? That’s amazing.

Kelly: Yeah. And I like that, kind of like, nimble, smarter one, and also the fact that most of the smaller ones allegedly had feathers, which I just love this idea of, like, feather-ful murder machines. I have the classic, like, nerd kid syndrome, though, where I read all these dinosaur names as a kid and I’ve never pronounced them out loud. So, I’m sure there are others—

Corey: Yep.

Kelly: —that I would just word salad. But honestly, it’s hard to go wrong with choosing a favorite dinosaur.

Corey: Oh, yeah. I’m sure some paleontologist is sitting out there in the field on the dig somewhere listening to this podcast, just getting very angry at our pronunciation and things. But for God’s sake, I call the database Postgres-squeal. Get in line. There’s a lot of that out there where looking at a complex system failures and different contributing factors and the rest makes stuff—that’s what makes things interesting.

I think that there’s this the idea of a root cause is almost always incorrect. It’s not, “Okay, who tripped over the buried landmine,” is not the interesting question. It’s, “Who buried the thing?” What were all the things that wound up contributing to this? And you can’t even frame it that way in the blaming context, just because you start doing that and people clam up, and good luck figuring out what really happened.

Kelly: Exactly. That’s so much of what the cybersecurity industry is focused on is how do we assign blame? And it’s, you know, the marketing person clicked on a link. And it’s like, they do that thousands of times, like a month, and the one time, suddenly, they were stupid for doing it? That doesn’t sound right.

So, I’m a big fan of, yes, vanquishing root cause, thinking about contributing factors, and in particular, in any sort of incident review, you have to think about, was there a designer process problem? You can’t just think about the human behavior; you have to think about where are the opportunities for us to design things better, to make this secure way more of the default way.

Corey: When you talk about resilience and reliability and big, notable outages, most forward-thinking companies are going to go and do a variety of incident reviews and disclosures around everything that happened to it, depending upon levels of trust and whether your NDA’ed or not, and how much gets public is going to vary from place to place. But from a security perspective, that feels like the sort of thing that companies will clam up about and never say a word.

Kelly: Yes.

Corey: Because I can wind up pouring a couple of drinks into people and get the real story of outages, or the AWS bill, but security stuff, they start to wonder if I’m a state actor, on some level. When you were building all of this, how did you wind up getting people to talk candidly and forthrightly about issues that if it became tied to them that they were talking to this in public would almost certainly have negative career impact for them?

Kelly: Yes, so that’s almost like a trade secret, I feel like. A lot of it is yes, over the years talking with people over, generally at a conference where you know, things are tipsy. I never want to betray confidentiality, to be clear, but certainly pattern-matching across people’s stories.

Corey: Yeah, we’re both in positions where if even the hint of they can’t be trusted enters the ecosystem, I think both of our careers explode and never recover. Like it’s—

Kelly: Exactly.

Corey: —yeah. Oh, yeah. They play fast and loose with secrets is never the reputation you want as a professional.

Kelly: No. No, definitely not. So, it’s much more pattern matching and trying to generalize. But again, a lot of what can go wrong is not that different when you think about a developer being really tired and making a bunch of mistakes versus an attacker. A lot of times they’re very much the same, so luckily there’s commonality there.

I do wish the security industry was more forthright and less clandestine because frankly, all of the public postmortems that are out there about performance issues are just such, such a boon for everyone else to improve what they’re doing. So, that’s a change I wish would happen.

Corey: So, I have to ask, given that you talk about security, chaos engineering, and resilience-and of course, software and systems—all in the title of the O’Reilly book, who is the target audience for this? Is it folks who have the word security featured three times in their job title? Is it folks who are new to the space? What is your target audience start and stop?

Kelly: Yes, so I have kept it pretty broad and it’s anyone who works with software, but I’ll talk about the software engineering audience because that is, honestly, probably out of anyone who I would love to read the book the most because I firmly believe that there’s so much that software engineering teams can do to sustain resilience and security and they don’t have to be security experts. So, I’ve tried to demystify security, make it much less arcane, even down to, like, how attackers, you know, they have their own development lifecycle. I try to demystify that, too. So, it’s very much for any team, especially, like, platform engineering teams, SREs, to think about, hey, what are some of the things maybe I’m already doing that I can extend to cover, you know, the security cases as well? So, I would love for every software engineer to check it out to see, like, hey, what are the opportunities for me to just do things slightly differently and have these great security outcomes?

Corey: I really want to thank you for taking the time to talk with me about how you view these things. If people want to learn more, where’s the best place for them to find you?

Kelly: Yes, I have all of the social media which is increasingly fragmented, [laugh] I feel like, but I also have my personal site, The official book site is as well. But otherwise, find me on LinkedIn, Twitter, [Mastodon 00:30:22], Bluesky. I’m probably blanking on the others. There’s probably already a new one while we’ve spoken.

Corey: Blue-ski is how I insist on pronouncing it as well, while we’re talking about—

Kelly: Blue-ski?

Corey: Funhouse pronunciation on things.

Kelly: I like it.

Corey: Excellent. And we will, of course, put links to all of those things in the [show notes 00:30:37]. Thank you so much for being so generous with your time. I really appreciate it.

Kelly: Thank you for having me and being a fellow dinosaur nerd.

Corey: [laugh]. Kelly Shortridge, Senior Principal Engineer at Fastly. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment about how our choice of dinosaurs is incorrect, then put the computer away and struggle to figure out how to open a door.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit to get started.
Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.