How to Investigate the Post-Incident Fallout with Laura Maguire, PhD

Episode Summary

It turns out that when it comes to incidents, you can do more than just blowing past them and onto the next one! Laura Maguire, lead of the research program at Jeli.io, is changing the “leave it in your tracks mentality” and focusing on the post-incident investigative work. Laura, who holds a PhD with a research focus Cognitive Systems Engineering, uses her doctoral work that focuses on DevOps teams responsible for critical digital services. Her work brings a suite of insights into how organizations can better function in the post-incident fallout.

Laura discusses her history working in high risk, high consequence environments–notbaly extreme mountain sports! She translates those perspectives to help her work with software engineers. Laura translates the potentially life threatening risks of work in alpine and mountain sports into studying the societal risks of DevOps and incidents. She offers up some wisdom on how organizations can better handle incidents. She also discusses Jel.io’s post-incident guide, and their efforts to guide how post-incident investigations can, and should, be carried out.

Episode Show Notes & Transcript

About Laura

Laura leads the research program at Jeli.io. She has a Master’s degree in Human Factors & Systems Safety and a PhD in Cognitive Systems Engineering. Her doctoral work focused on distributed incident response practices in DevOps teams responsible for critical digital services. She was a researcher with the SNAFU Catchers Consortium from 2017-2020 and her research interests lie in resilience engineering, coordination design and enabling adaptive capacity across distributed work teams. As a backcountry skier and alpine climber, she also studies cognition & resilient performance in high risk, high consequence mountain environments.

Links:

Howie: The Post-Incident Guide: https://www.jeli.io/howie-the-post-incident-guide/
Jeli: https://www.jeli.io
Twitter: https://twitter.com/lauramdmaguire

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Today’s episode is brought to you in part by our friends at MinIO the high-performance Kubernetes native object store that’s built for the multi-cloud, creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you’re defining those as, which depends probably on where you work. It’s getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere, and that’s exactly what MinIO offers. With superb read speeds in excess of 360 gigs and 100 megabyte binary that doesn’t eat all the data you’ve gotten on the system, it’s exactly what you’ve been looking for. Check it out today at min.io/download, and see for yourself. That’s min.io/download, and be sure to tell them that I sent you.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. One of the things that’s always been a treasure and a joy in working in production environments is things breaking. What do you do after the fact? How do you respond to that incident?

Now, very often in my experience, you dive directly into the next incident because no one has time to actually fix the problems but just spend their entire careers firefighting. It turns out that there are apparently alternate ways. My guest today is Laura Maguire who leads the research program at Jeli, and her doctoral work focused on distributed incident response in DevOps teams responsible for critical digital services. Laura, thank you for joining me.

Laura: Happy to be here, Corey, thanks for having me.

Corey: I’m still just trying to wrap my head around the idea of there being a critical digital service, as someone whose primary output is, let’s be honest, shitposting. But that’s right, people do use the internet for things that are a bit more serious than making jokes that are at least funny only to me. So, what got you down this path? How did you get to be the person that you are in the industry and standing in the position you hold?

Laura: Yeah, I have had a long circuitous route to get to where I am today, but one of the common threads is about safety and risk and how do people manage safety and risk? I started off in natural resource industries, in mountain safety, trying to understand how do we stop things from crashing, from breaking, from exploding, from catching fire, and how do we help support the people in those environments? And when I went back to do my PhD, I was tossed into the world of software engineers. And at first I thought, now, what do firefighters, pilots, you know, emergency room physicians have to do with software engineers and risk in software engineering? And it turns out, there’s actually a lot, there’s a lot in common between the types of people who handle real-time failures that have widespread consequences and the folks who run continuous deployment environments.

And so one of the things that the pandemic did for us is it made it immediately apparent that digital service delivery is a critical function in society. Initially, we’d been thinking about these kinds of things as being financial markets, as being availability of electronic health records, communication systems for disaster recovery, and now we’re seeing things like communication and collaboration systems for schools, for businesses, this helps keep society functioning.

Corey: What makes part of this field so interesting is that the evolution in the space where, back when I first started my career about a decade-and-a-half ago, there was a very real concern in my first Linux admin gig when I accidentally deleted some of the data from the data warehouse that, “Oh, I don’t have a job anymore.” And I remember being surprised and grateful that I still did because, “Oh, you just learned something. You going to do it again?” “No. Well, not like that exactly, but probably some other way, yeah.”

And we have evolved so far beyond that now, to the point where when that doesn’t happen after an incident, it becomes almost noteworthy in its own right and it blows up on social media. So, the Overton window of what is acceptable disaster response and incident management, and how we learn from those things has dramatically shifted even in the relatively brief window of 15 years. And we’re starting to see now almost a next-generation approach to this. One thing that you were, I believe the principal author behind is Howie: The Post-Incident Guide, which is a thing that you have up on jeli.io—that’s J-E-L-I dot I-O—talking about how to run post-incident investigations. What made you decide to write something like this?

Laura: Yeah, so what you described at the beginning there about this kind of shift from blameless—blameful-type approaches to incident response to thinking more broadly about the system of work, thinking about what does it mean to operate in continuous deployment environments is really fundamental. Because working in these kinds of worlds, we don’t have an established knowledge base about how these systems work, about how they break because they’re continuously changing, the knowledge, the expertise required to manage them is continuously changing. And so that shift towards a blameless or blame-aware post-incident review is really important because it creates this environment where we can actually share knowledge, share expertise, and distribute more of our understandings of how these systems work and how they break. So that, kind of, led us to create the Howie Guide—the how we got here post-incident guide. And it was largely because companies were kind of coming from this position of, we find the person who did the thing that broke the system and then we can all rest easy and move forward. And so it was really a way to provide some foundation, introduce some ideas from the resilience engineering literature, which has been around for, you know, the last 30 or 40 years—

Corey: It’s kind of amazing, on some level, how tech as an industry has always tried to reinvent things from first principles. I mean, we figured out long before we started caring about computers in the way we do that when there was an incident, the right response to get the learnings from it for things like airline crashes—always a perennial favorite topic in this space for conference talks—is to make sure that everyone can report what happened in a safe way that’s non-accusatory, but even in the early-2010s, I was still working in environments where the last person to break production or break the bill had the shame trophy hanging out on their desk, and it would stay there until the next person broke it. And it was just a weird, perverse incentive where it’s, “Oh if I broke something, I should hide it.”

That is absolutely the most dangerous approach because when things are broken, yes, it’s generally a bad thing, so you may as well find the silver lining in it from my point of view and figure out, okay, what have we learned about our systems as a result of the way that these things break? And sometimes the things that we learn are, in fact, not that deep, or there’s not a whole lot of learnings about it, such as when the entire county loses power, computers don’t work so well. Oh, okay. Great, we have learned that. More often, though, there seem to be deeper learnings.

And I guess what I’m trying to understand is, I have a relatively naive approach on what the idea of incident response should look like, but it’s basically based on the last time I touched things that were production-looking, which was six or seven years ago. What is the current state of the art that the advanced leaders in the space as they start to really look at how to dive into this? Because I’m reasonably certain it’s not still the, “Oh, you know, you can learn things when your computers break.” What is pushing the envelope these days?

Laura: Yeah, so it’s kind of interesting. You brought up incident response because incident response and incident analysis are the, sort of like, what do we learn from those things are very tightly coupled. What we can see when we look at someone responding in real-time to a failure is, it’s difficult to detect all of the signals; they don’t pop up and wave a little flag and say, like, “I am what’s broken.” There’s multiple compounding and interacting factors. So, there’s difficulty in the detection phase; diagnosis is always challenging because of how the systems are interrelated, and then the repair is never straightforward.

But when we stop and look at these kinds of things after the fact, of really common theme emerges, and that it’s not necessarily about a specific technical skill set or understanding about the system, it’s about the shared, distributed understanding of that. And so to put that in plain speak, it’s what do you know that’s important to the problem? What do I know that’s important to the problem? And then how do we collectively work together to extract that specific knowledge and expertise, and put that into practice when we’re under time pressure, when there’s a lot of uncertainty, when we’ve got the VP DMing us and being like, “When’s the system going to be back up?” and Twitter’s exploding with unhappy customers?

So, when we think about the cutting edge of what’s really interesting and relevant, I think organizations are starting to understand that it’s how do we coordinate and we collaborate effectively? And so using incident analysis as a way to recognize not only the technical aspects of what went wrong but the social aspects of that as well. And the teamwork aspects of that is really driving some innovation in this space.

Corey: It seems to me, on some level, that the increasing sophistication of what environments look like is also potentially driving some of these things. I mean, again, when you have three web servers and one of them’s broken, okay, it’s a problem; we should definitely jump on that and fix it. But now you have thousands of containers running hundreds of microservices for some Godforsaken reason because what we decided this thing that solves the problem of 500 engineers working on the same repository is a political problem, so now we’re going to use microservices for everything because, you know, people. Great. But then it becomes this really difficult to identify problem of what is actually broken?

And past a certain point of scale, it’s no longer a question of, “Is it broken?” so much as, “How broken is it at any given point in time?” And getting real-time observability into what’s going on does pose more than a little bit of a challenge.

Laura: Yeah, absolutely. So, the more complexity that you have in the system, the more diversity of knowledge and skill sets that you have. One person is never going to know everything about the system, obviously, and so you need kind of variability in what people know, how current that knowledge is, you need some people who have legacy knowledge, you have some people who have bleeding edge, my fingers were on the keyboard just moments ago, I did the last deploy, that kind of variability in whose knowledge and skill sets you have to be able to bring to bear to the problem in front of you. One of the really interesting aspects, when you step back and you start to look really carefully about how people work in these kinds of incidents, is you have folks that are jumping, get things done, probe a lot of things, they look at a lot of different areas trying to gather information about what’s happening, and then you have people who sit back and they kind of take a bit of a broader view, and they’re trying to understand where are people trying to find information? Where might our systems not be showing us what’s going on?

And so it takes this combination of people working in the problem directly and people working on the problem more broadly to be able to get a better sense of how it’s broken, how widespread is that problem, what are the implications, what might repair actually look like in this specific context?

Corey: Do you suspect that this might be what gives rise, sometimes, to it seems middle management’s perennial quest to build the single pane of glass dashboard of, “Wow, it looks like you’re poking around through 15 disparate systems trying to figure out what’s going on. Why don’t we put that all on one page?” It’s a, “Great, let’s go tilt at that windmill some more.” It feels like it’s very aligned with what you’re saying. And I just, I don’t know where the pattern comes from; I just know I see it all the time, and it drives me up a wall.

Laura: Yeah, I would call that pattern pretty common across many different domains that work in very complex, adaptive environments. And that is—like, it’s an oversimplification. We want the world to be less messy, less unstructured, less ad hoc than it often is when you’re working at the cutting edge of whatever kind of technology or whatever kind of operating environment you’re in. There are things that we can know about the problems that we are going to face, and we can defend against those kinds of failure modes effectively, but to your point, these are very largely unstructured problem spaces when you start to have multiple interacting failures happening concurrently. And so Ashby, who back in 1956 started talking about, sort of, control systems really hammered this point home when he was talking about, if you have a world where there’s a lot of variability—in this case, how things are going to break—you need a lot of variability in how you’re going to cope with those potential types of failures.

And so part of it is, yes, trying to find the right dashboard or the right set of metrics that are going to tell us about the system performance, but part of it is also giving the responders the ability to, in real-time, figure out what kinds of things they’re going to need to address the problem. So, there’s this tension between wanting to structure unstructured problems—put those all in a single pane of glass—and what most folks who work at the frontlines of these kinds of worlds know is, it’s actually my ability to be flexible and to be able to adapt and to be able to search very quickly to gather the information and the people that I need, that are what’s really going to help me to address those hard problems.

Corey: Something I’ve noticed for my entire career, and I don’t know if it’s just unfounded arrogance, and I’m very much on the wrong side of the Dunning-Kruger curve here, but it always struck me that the corporate response to any form of outage has is generally trending toward oh, we need a process around this, where it seems like the entire idea is that every time a thing happens, there should be a documented process and a runbook on how to perform every given task, with the ultimate milestone on the hill that everyone’s striving for is, ah, with enough process and enough runbooks, we can then eventually get rid of all the people who know all this stuff works, and basically staff at up with people who’d know how to follow a script and run push the button when told to buy the instruction manual. And that’s always rankled, as someone who got into this space because I enjoy creative thinking, I enjoy looking at the relationships between things. Cost and architecture are the same thing; that’s how I got into this. It’s not due to an undying love of spreadsheets on my part. That’s my business partner’s problem.

But it’s this idea of being able to play with the puzzle, and the more you document things with process, the more you become reliant on those things. On some level, it feels like it ossifies things to the point where change is no longer easily attainable. Is that actually what happens, or am I just wildly overstating the case? Either as possible. Or a third option, too. You’re the expert; I’m just here asking ridiculous questions.

Laura: Yeah, well, I think it’s a balance between needing some structure, needing some guidelines around expected actions to take place. This is for a number of reasons. One, we talked about earlier about how we need multiple diverse perspectives. So, you’re going to have people from different teams, from different roles in the organization, from different levels of knowledge, participating in an incident response. And so because of that, you need some form of script, some kind of process that creates some predictability, creates some common ground around how is this thing going to go, what kinds of tools do we have at our disposal to be able to either find out what’s going on, fix what’s going on, get the right kinds of authority to be able to take certain kinds of actions.

So, you need some degree of process around that, but I agree with you that too much process and the idea that we can actually apply operational procedures to these kinds of environments is completely counterproductive. And what it ends up doing is it ends up, kind of, saying, “Well, you didn’t follow those rules and that’s why the incident went the way it did,” as opposed to saying, “Oh, these rules actually didn’t apply in ways that really matter, given the problem that was faced, and there was no latitude to be able to adapt in real-time or to be able to improvise, to be creative in how you’re thinking about the problem.” And so you’ve really kind of put the responders into a bit of a box, and not given them productive avenues to, kind of, move forward from. So, having worked in a lot of very highly regulated environments, I recognize there’s value in having prescription, but it’s also about enabling performance and enabling adaptive performance in real-time when you’re working at the speeds and the scales that we are in this kind of world.

Corey: This episode is sponsored by our friends at Oracle HeatWave is a new high-performance query accelerator for the Oracle MySQL Database Service, although I insist on calling it “my squirrel.” While MySQL has long been the worlds most popular open source database, shifting from transacting to analytics required way too much overhead and, ya know, work. With HeatWave you can run your OLAP and OLTP—don’t ask me to pronounce those acronyms again—workloads directly from your MySQL database and eliminate the time-consuming data movement and integration work, while also performing 1100X faster than Amazon Aurora and 2.5X faster than Amazon Redshift, at a third of the cost. My thanks again to Oracle Cloud for sponsoring this ridiculous nonsense.

Corey: Yeah, and let’s be fair, here; I am setting up something of a false dichotomy. I’m not suggesting that the answer is oh, you either are mired in process, or it is the complete Wild West. If you start a new role and, “Great. How do I get started? What’s the onboarding process?” Like, “Step one, write those docs for us.”

Or how many times have we seen the pattern where day-one onboarding is, “Well, here’s the GitHub repo, and there’s some docs there. And update it as you go because this stuff is constantly in motion.” That’s a terrible first-time experience for a lot of folks, so there has to be something that starts people off in the right direction, a sort of a quick guide to this is what’s going on in the environment, and here are some directions for exploration. But also, you aren’t going to be able to get that to a level of granularity where it’s going to be anything other than woefully out of date in most environments without resorting to draconian measures. I feel like—

Laura: Yeah.

Corey: —the answer is somewhere in the middle, and where that lives depends upon whether you’re running Twitter for Pets or a nuclear reactor control system.

Laura: Yeah. And it brings us to a really important point of organizational life, which is that we are always operating under constraints. We are always managing trade-offs in this space. It’s very acute when you’re in an incident and you’re like, “Do I bring the system back up but I still don’t know what’s wrong or do I leave it down a little bit longer and I can collect more information about the nature of the problem that I’m facing?”

But more chronic is the fact that organizations are always facing this need to build the next thing, not focus on what just happened. You talked about the next incident starting and jumping in before we can actually really digest what just happened with the last incident; these kinds of pressures and constraints are a very normal part of organizational life, and we are balancing those trade-offs between time spent on one thing versus another as being innovating, learning, creating change within our environment. The reason why it’s important to surface that is that it helps change the conversation when we’re doing any kind of post-incident learning session.

It’s like, oh, it allows us to surface things that we typically can’t say in a meeting. “Well, I wasn’t able to do that because I know that team has a code freeze going on right now.” Or, “We don’t have the right type of, like, service agreement to get our vendor on the phone, so we had to sit and wait for the ticket to get dealt with.” Those kinds of things are very real limiters to how people can act during incidents, and yet, don’t typically get brought up because they’re just kind of chronic, everyday things that people deal with.

Corey: As you look across the industry, what do you think that organizations are getting, I guess, it’s the most wrong when it comes to these things today? Because most people are no longer in the era of, “All right. Who’s the last person to touch it? Well, they’re fired.” But I also don’t think that they’re necessarily living the envisioned reality that you described in the Howie Guide, as well as the areas of research you’re exploring. What’s the most common failure mode?

Laura: Hmm. I got to tweak that a little bit to make it less about the failure mode and more about the challenges that I see organizations facing because there are many failure modes, but some common issues that we see companies facing is they’re like, “Okay, we buy into this idea that we should start looking at the system, that we should start looking beyond the technical thing that broke and more broadly at how did different aspects of our system interact.” And I mean, both people as a part of the system, I mean processes part of the system, as well as the software itself. And so that’s a big part of why we wrote the Howie Guide, is because companies are struggling with that gap between, “Okay, we’re not entirely sure what this means to our organization, but we’re willing to take steps to get there.” But there’s a big gap between recognizing that and jumping into the academic literature that’s been around for many, many years from other kinds of high-risk, high-consequence type domains.

So, I think some of the challenges they face is actually operationalizing some of these ideas, particularly when they already have processes and practices in place. There’s ideas that are very common throughout an organization that take a long time to shift people’s thinking around, the implicit biases or orientations towards a problem that we as individuals have, all of those kinds of things take time. You mentioned the Overton window, and that’s a great example of it is intolerable in some organizations to have a discussion about what do people know and not know about different aspects of the system because there’s an assumption that if you’re the engineer responsible for that, you should know everything. So, those challenges, I think, are quite limiting to helping organizations move forward. Unfortunately, we see not a lot of time being put into really understanding how an incident was handled, and so typically, reviews get done on the side of the desk, they get done with a minimal amount of effort, and then the learnings that come out of them are quite shallow.

Corey: Is there a maturity model, where it makes sense to begin investing in this, whereas if you’ve do it too quickly, you’re not really going to be able to ship your MVP and see what happens; if you go too late, you have a globe-spanning service that winds up being down all the time so no one trusts it. What is the sweet spot for really started to care about incident response? In other words, how do people know that it’s time to start taking this stuff more seriously?

Laura: Ah. Well… you have kids?

Corey: Oh, yes. One and four. Oh yeah.

Laura: Right—

Corey: Demons. Little demons whom I love very much.

Laura: [laugh]. They look angelic, Corey. I don’t know what you’re talking about. Would you not teach them how to learn or not teach them about the world until they started school?

Corey: No, but it would also be considered child abuse at this age to teach them about the AWS bill. So, there is a spectrum as far as what is appropriate learnings at what stage.

Laura: Yeah, absolutely. So, that’s a really good point is that depending on where you are at in your operation, you might not have the resources to be able to launch full-scale investigations. You may not have the complexity within your system, within your teams, and you don’t have the legacy to, sort of, draw through, to pull through, that requires large-scale investigations with multiple investigators. That’s really why we were trying to make the Howie Guide very applicable to a broad range of organizations is, here are the tools, here are the techniques that we know can help you understand more about the environment that you’re operating in, the people that you’re working with, so that you can level up over time, you can draw more and more techniques and resources to be able to go deeper on those kinds of things over time. It might be appropriate at an early stage to say, hey, let’s do these really informally, let’s pull the team together, talk about how things got set up, why choices were made to use the kinds of components that we use, and talk a little bit more about why someone made a decision they did.

That might be low-risk when you’re small because y’all know each other, largely you know the decisions, those conversations can be more frank. As you get larger, as more people you don’t know are on those types of calls, you might need to handle them differently so that people have psychological safety, to be able to share what they knew and what they didn’t know at the time. It can be a graduated process over time, but we’ve also seen very small, early-stage companies really treat this seriously right from the get-go. At Jeli, I mean, one of our core fundamentals is learning, right, and so we do, we spend time on sharing with each other, “Oh, my mental model about this was X. Is that the same as what you have?” “No.” And then we can kind of parse what’s going on between those kinds of things. So, I think it really is an orientation towards learning that is appropriate any size or scale.

Corey: I really want to thank you for taking the time to speak with me today. If people want to learn more about what you’re up to, how you view these things and possibly improve their own position on these areas, where can they find you?

Laura: So, we have a lot of content on jeli.io. I am also on Twitter at—

Corey: Oh, that’s always a mistake.

Laura: [laugh]. @lauramdmaguire. And I love to talk about this stuff. I love to hear how people are interpreting, kind of, some of the ideas that are in the resilience engineering space. Should I say, “Tweet at me,” or is that dangerous, Corey?

Corey: It depends. I find that the listeners to this show are all far more attractive than the average, and good people, through and through. At least that’s what I tell the sponsors. So yeah, it should be just fine. And we will of course include links to those in the [show notes 00:27:11].

Laura: Sounds good.

Corey: Thank you so much for your time. I really appreciate it.

Laura: Thank you. It’s been a pleasure.

Corey: Laura Maguire, researcher at Jeli. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this

podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please give a five-star review on your podcast platform of choice along with an angry, insulting comment that I will read just as soon as I get them all to display on my single-pane-of-glass dashboard.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

How to Investigate the Post-Incident Fallout with Laura Maguire, PhD

Episode Summary

Episode Show Notes & Transcript

Transcript

You might also like

Coding Agents, Chaos, and the Future of Dev Work with Dexter Horthy

The Rise of Autonomous Ops: Inside AWS’s DevOps Agent with David Yanacek

Building the Backbone of AI Agents: Telemetry, Open Source, and the Future of Developer Infrastructure with Brian Douglas

Get the Newsletter

Gnarly cloud cost questions?