Screaming in the Cloud
aws-section-divider
Audio Icon
Non-Incidentally Keeping Tabs on the Internet with Courtney Nash
Episode Summary
What does an Internet Incident Librarian do? Courtney Nash is here to tell us. It turns out when websites don’t fly, it isn’t widely advertised. Courtney is here to keep track of those incidents for the edification of the rest of us. These incidents impact us all, in so many ways, and it is something that Courtney wants to share with the rest of us! From the overwhelming volume of dependency on AWS, to the parallels in the airline industry, to the growing importance of stuff simply having to work in our day to day lives, Courtney’s conversation shows us a lot. Not only are these dependencies becoming more prevalent every day, but building in systems to cover down on these outages. Be they at the hand backhoes in the woods, or--beavers. She discussed VOID, what it is and how it works, and more!
Episode Show Notes and Transcript
About Courtney
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.


Links:

Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.


Corey: This episode is sponsored in part by our friends at Jellyfish. So, you’re sitting in front of your office chair, bleary eyed, parked in front of a powerpoint and—oh my sweet feathery Jesus its the night before the board meeting, because of course it is! As you slot that crappy screenshot of traffic light colored excel tables into your deck, or sift through endless spreadsheets looking for just the right data set, have you ever wondered, why is it that sales and marketing get all this shiny, awesome analytics and inside tools? Whereas, engineering basically gets left with the dregs. Well, the founders of Jellyfish certainly did. That’s why they created the Jellyfish Engineering Management Platform, but don’t you dare call it JEMP! Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack. Including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack but this is 2021. They use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you’re doing from an engineering perspective to people whose primary IDE is Microsoft Powerpoint, consider Jellyfish. Thats Jellyfish.co and tell them Corey sent you! Watch for the wince, thats my favorite part.


Corey: This episode is sponsored in part by our friends at VMware. Let’s be honest—the past year has been far from easy. Due to, well, everything. It caused us to rush cloud migrations and digital transformation, which of course means long hours refactoring your apps, surprises on your cloud bill, misconfigurations and headache for everyone trying manage disparate and fractured cloud environments. VMware has an answer for this. With VMware multi-cloud solutions, organizations have the choice, speed, and control to migrate and optimize
applications seamlessly without recoding, take the fastest path to modern infrastructure, and operate consistently across the data center, the edge, and any cloud. I urge to take a look at vmware.com/go/multicloud. You know my opinions on multi cloud by now, but there's a lot of stuff in here that works on any cloud. But don’t take it from me thats: VMware.com/go/multicloud and my thanks to them again for sponsoring my ridiculous nonsense.


Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Periodically, websites like to fall into the sea and explode. And it’s sort of a thing that we’ve accepted happens. Well, most of us have. My guest today is Courtney Nash, Internet Incident Librarian at Verica. Courtney, thank you for joining me.


Courtney: Hi, Corey. Thanks so much for having me.


Corey: So, I’m going to assume that my intro is somewhat accurate, that we’ve sort of accepted that sites will crash into the sea, the internet will break, and then everyone tears their hair out and complains on Twitter, assuming that’s not the thing that fell over this time—


Courtney: [laugh].


Corey: —but what does an Internet Incident Librarian do?


Courtney: Yeah, I’ll come back to the first part about how—some people have accepted it and some people haven’t, I think is the interesting part. So technically, I think my official real title is, like, research analyst or something really boring, but I have a background in the cognitive sciences and also in technology, and I’m really—have always been fascinated by how these socio-technical systems work. And so as an Internet Incident Librarian, I am doing a number of things to try to better understand—both for myself and, obviously, the company I work for, but for the industry as a whole—what do we really know about how incidents happen, why they happen, when they happen, and what do we do when they happen? And how do we learn from that? So, one of the first things that I’m doing along those lines is actually collecting a database of all of the public write-ups of incidents that happened at companies that are software-related.


So, there’s already bodies of work of people who collect airline incidents and other kinds of things. And we don’t have that [laugh] as an industry, which I think is—I want to solve that problem because I think other industries that have spent some time introspecting about why things fall down, or when things fall down and how they fall down. Take the airline industry for example; planes don’t really fall out of the sky very often.


Corey: No. When it does, it makes news and everyone’s scared about flying, but at the same time, it’s yeah, do you have any idea how many people die in car crashes in a given hour?


Courtney: Yeah, yeah. And we’ll come back to how the media covers things in a minute because that is definitely something I have opinions about. But, I’m not trying to say I want to create the NTSB of the internet; I don’t think that’s quite the same thing, and I really want something in the spirit of software, and the internet, and open-source that’s more collaborative and it’s very open to all of us. So, the first step is to just get them in one place. There is no single place where you could go and say, “Oh, where all of the X incident reports? Where all the ones that Microsoft’s written, and also Amazon, or Google, or, you know, whoever.”


Corey: They have them, but they hide them so thoroughly. It turns out that they don’t really put that in big letters on their corporate blog with links to it. And when you look at one incident report, they don’t say, “Here, look at our previous incident reports.” They really—


Courtney: Yeah.


Corey: —should but no one does.


Courtney: And I think that’s fascinating because there’s a precedent. So, there’s two precedents, and I just gave you basically one side of the two, which is, the airline industry has done this and it’s not like people don’t fly, right? So, a lot of internet companies, a lot of software-based companies, seem to be afraid of what their customers, or what the stock market, or what folks will think. Mind you, these are publicly traded [laugh] airline companies. People aren’t going to stop using Amazon just because you give more of this information out.


And so I think that piece is—I would love to see that stop being the case. Because the flip side of the coin is that this is a rising tide lifts all boats kind of thing, which granted, not all companies agree on, especially really big ones because their boats already mowing all the little ones out of the ocean. But that’s another story.


Corey: Sure, but also, it’s easy to hide an outage. “Our site is down for you can say three days. Great, if a customer didn’t try to access the site at all during those three days, was the site really down in the first place?”


Courtney: Oh, the tree in the forest of internet outages. Yes, it’s true, although I think that companies are—they know that people go complain on social media, right? I think there’s more and more of that happening now. It’s not like you can hide it as easily as you could have before Twitter or Instagram or—


Corey: Right. Whereas a plane falls out of the sky, generally it’s one of those things that people notice.


Courtney: Yeah. Even if you weren’t interested in that flight at all.


Corey: Right. When it lands in your garden, you sort of have a comment on this.


Courtney: [laugh]. Yeah. Pieces fall out of the sky. That has happened. But I think the other flip side of the coin I already mentioned is the safety of airline industry has increased so significantly over the past, you know, whatever, 30, 40 years because of this concerted effort.


And the other piece of it, then, as an industry, as technologists, as people who use software to run their businesses, some of those things are now safety-critical. And this comes back to the whole software is running the world now. Planes now actually could fall out of the sky because of software, not just because of hardware failures. And nuclear power plants are [laugh] run by software, and your electronic grid, and your health care systems, heart rate monitors, insulin pumps. There are a lot of really critical things, and now our phone services and our internet stuff is so entwined in our lives, that people can’t be on their Zoom calls, people can’t run their businesses. So, this stuff has a massive impact on people’s lives. It’s no longer just pictures of cats on the internet, which admittedly, we’ve really honed the machine for that.


Corey: No, but now when software goes down, the biggest arguments people make, the stories people tell is, “Oh, well, it meant that the company lost this much money during that timeframe.” And great, maybe. We can argue about is that really true or is it not? It depends entirely on the company’s business model, but I don’t like to tend to accept those things at face value. But yeah, that’s the small-scale thing, especially when you start getting to these massive platform providers. There are a lot of second and third-order effects that are a lot more interesting slash important to people’s lives, than, well, we couldn’t show ads to people for an hour and a half.


Courtney: Right. Yes. Absolutely. So, T-Mobile had this outage, what is it, how is time—time is still not working very well, for me. I’m trying to remember if it was earlier this year, or if it was in—it was last year. I think it was 2020. And you’re like, T-Mobile, oh okay, whatever. You know, like, cell phones, yadda, yadda. 911 stopped working. [laugh].


And it was a fascinating outage because these are now actually regulated industries that are heavily software-backed. There was a government investigation into that the same way we have NTSB investigations into airline accidents, and they looked at all of those, kind of, second or third-order effects of people who—you know, a grandma who was stranded on the road, people who couldn’t call 911, those kinds of things that are really significant impacts on people’s lives. And the second-order effect is, oh, yeah, AWS goes down—like you said—and Amazon or people like to say, Jeff Bezos—I guess, now, are they going to complain about how much money Andy loses? I guess so—but [laugh] what lives on AWS, that’s crazy to think about, right?


Corey: Yeah, the more I learn the answer to that question, the more disturbed I become.


Courtney: Well, you’d probably know a better answer to that question [laugh] than a lot of people.


Corey: They have the big companies they can talk about. What’s really interesting is the companies that they don’t and can’t. An easy example: financial services is an industry that is notorious for never granting logo rights. Like, at some point, they’ll begrudgingly admit, “Yes, our multinational bank does use computers.” But it’s always like pulling teeth, and I get it on some level; the entire philosophy of a lot of these companies is risk-mitigation, rather than growth and advancing the current awareness of knowledge. But it does become a problem.


Courtney: Yeah. It’s interesting, I need more data, which we’ll get to—help me, people—but I am able to start seeing some of those interesting graphs of, kind of these cascading effects of these kinds of outages. And so I strongly believe that we need to talk about them more, that more companies need to write them up, and publish them, and be a lot more transparent about it. And I think there’s a number of companies that are showing the way there that—and it has to do with your first question which is, we’ve all sort of accepted this, right? But I disagree with that.


I think those of us who are super close to these kinds of complex, dynamic distributed systems totally know that they’re going to fail, and that’s not shocking, nor the case of incompetence. We are building systems that are so big and so complex, no one person, no 10X engineer out there could possibly model or hold the whole thing in their head. Especially because it’s not even just your systems… we were just talking about, right? Your stuff’s on GitHub; it’s on AWS; there’s, like, three other upstream providers; there’s this API from over there. These systems are too intricate, too complex; they’re going to fail.


Corey: So, we’re back to why all these things failed simultaneously and it comes out it’s a Northern woods, middle of nowhere backhoe incident. That’s right, if we look at the natural food chain of things, fiber optic cable has a natural predator in the form of a backhoe. To the point where if I’m ever lost in the woods, I will drop a length of fiber, kick some dirt over it, wait a few minutes; a backhoe will be along to sever it. Then I can follow the backhoe back to civilization. They don’t teach that one and the boy scout manual, but they really should.


Courtney: Yeah. Oh, my gosh. There was a beaver outage in Canada, which is the—[laugh] God, that’s the most Canadian thing ever.


Corey: Can you come up with a more Canadian—


Courtney: No.


Corey: —story than that? I would posit you could not, but give it a shot.


Courtney: No, probably not. Anyhoo. So, I think, like I was saying, those of us close to it accept that, understand it, and are trying to now think about, okay, well, how do we change our approach and our philosophy about this, knowing that things will fall down? But I think if you look at a lot of the rest of the world, people are still like, “What are those idiots doing over there? Why did their site fall down?”


Corey: Oh, my God—


Courtney: Right?


Corey: —the general population is the worst on stuff like this. The absolute worst.


Courtney: The media is the worst. [laugh].


Corey: It’s, “How did they wind up to going down?” “Yeah, because this stuff is complicated.” Back when I was getting started in tech, I thought the whole thing worked on magic, so I started figuring out different pieces of it worked. And now I’m convinced; it runs on magic. The most amazing thing is this all works together. Because—


Courtney: Yeah.


Corey: —spit and duct tape and baling wire holding this stuff together would be an upgrade from a lot of the stuff that currently exists in the real world. And it’s amazing.


Courtney: I know the secret, Corey. You know what holds it all together?


Corey: Hit me with it. Hope? Tears?


Courtney: People.


Corey: Mmm.


Courtney: Technology is Soylent Green, Corey. It’s Soylent Green. It’s made of people.


Corey: And that’s the thing that always bugs me on Twitter. The whole HugOps movement has it right. When you see a big provider taking an outage, all their competitors are immediately there with, “Man, hope things get back together soon. Best of luck. Let us know if we can help.” And that’s super reassuring because today is their outage; tomorrow it’s yours.


Courtney: Yep.


Corey: And once in a blue moon, you see someone who’s relatively new to the industry starting trying to market their stuff based on someone else’s outage, and they basically get their butts fed to them, just because it’s this—it’s not what you do, and it’s not how we operate. And it’s one of the few moments where I look at this and realize that maybe people’s inherent nature isn’t all terrible.


Courtney: [laugh]. Oh. Oh, I would hope that would be something that comes out of all of this.


Corey: Yeah.


Courtney: No one goes to work at their day job doing what we do, to suck. [laugh]. Right? To do a bad job.


Corey: Right. Unless you’re in Facebook’s ethics department, I completely agree with you.


Courtney: Okay. Yes. All right. There are a few caveats to that, probably. But you know, we all want to show up and do good stuff. So, nobody’s going in trying to take the site down, barring bad actor stuff that’s not relevant.


Corey: When Azure takes an outage, AWS is not sitting there going, “Ah, we’re going to win more cloud deals because of this,” because they’re smarter than that. It’s, no, people are going to look at this and say, “Ah, see. Told you the cloud was dangerous.” It sets the entire industry back.


Courtney: Yeah. That’s why we need to talk about it more, and we need to just normalize that these things happen and that we can all level up as an industry if we get a lot smarter about how we, A) think about that, and B) how we react to them. And we will develop much more useful models of our safety boundaries, right? That’s really it. You don’t know—no one at any of these companies hardly knows if you’re five steps from the cliff, five feet, driving a Ferrari 90 miles an hour towards the edge of it.


Like, we don’t know, it’s amazing to me just how much in the dark we are as an industry and how much of the world we’re running. So, I think this is one tiny, first little step in what could be sort of a sea change about how all of this works. So, that’s a big part of why I’m doing what I’m doing.


Corey: Well, let’s talk about something else you’re doing. So, tell me a little bit about VOID?


Courtney: Yeah. So, that’s the first iteration of this. So, it’s the [Verica Open Incident Database 00:14:10]. I feel like I have to say this almost every time John Allspaw would like me to say that it’s the Verica Open Incident Report Database, but VOID is way cooler than—


Corey: VOIRD?


Courtney: VOIRD.


Corey: Yeah, that sounds like you’re trying to make fun of someone ineffectively.


Courtney: Yeah. And there’s a reason why he’s not in marketing. But what this is is a collection of all of the publicly available incident reports in one place, easily searchable. You can search by company, you can search by technology, you can filter things by the types of, sort of, kinds of failure modes that we’re seeing. And it’s, I hope, valuable to a wide swath of folks, both technologists and otherwise: researchers, media and press types, analysts, and whatnot.


And my biggest desire is that people will look at it, realize how incomplete it is, and then help me fill it. [laugh]. Help me fill the VOID, people. I think I have right now, at the time we’re talking, about 1700, maybe 1800 of these. And they run the gamut. And I know some people who like to quibble about language—and I am one of those people having been an editor in various flavors of my life—not all of these are what a lot of people directly related to these, sort of, incident management and whatnot would call ‘incident reports.’


I wanted to collect a corpus that reflects all of the public information about software-related incidents. So, it’s anything from tweets—either from a company or just from people—to a status page, to a media article, a news article, an online article, to a full-blown deep-dive retrospective or post-mortem from a company that really does go into detail. It’s the whole gamut. It’s all of those things. I have no opinionated take on that.


I want that all to be available to people. And we’ve collected some metadata on all of the incidents as well. So, we’re collecting the obvious things like when did it happen? What date was it, if we can figure it out, or if it’s explicit—how long was it? And those kinds of things and then we collect some metadata, like I said. We add some tags: was this a complete production outage, was it a partial outage? Those kinds of things.


And this is all directly just taken from the language of the report. And we’re not trying—like I said—we’re trying not to have any sort of really subjective takes on any of that, but a bit of metadata that helps people spelunk some of this stuff. So, if it is the kind of report—these are usually from a status page, or a company post about it—what kinds of things were involved in this outage? So, sometimes you’ll get lucky and the company will tell you, “It was DNS,” because, you know, it’s always DNS.


Corey: On some level, it always is. That’s why—


Courtney: It always is.


Corey: —DNS is my database. It’s a database problem.


Courtney: It’s a database problem. And sometimes you get even more detail. And so we will put as much of that that’s in the report into a set of metadata about these things. So, I think there’s some fascinating, really easy things that I’ve already seen from some of these data, and we kind of hit on one of these, which is the way that companies themselves talk about these outages versus the way that press and media and other types of organizations talk about these things. So, I think there’s a whole bunch of really fascinating analysis that’s going to be available to nerdy research-minded type folks like myself.


I think it’s a place, though, where technologists can also go and spelunk things that they’re interested in, looking for patterns, anything that’s really—there’s an opportunity for experts in the field to add insights to what we can discern from these public incident reports. They are, like, two orders abstracted from what happened internally, but I think there’s still a lot that we can learn from those. So, the first iteration of the VOID will allow people to get a first look at some of the data and to help me, hopefully, add to it, grow that corpus over time, and we’ll see where that goes.


This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking databases, observability, management, and security.


And - let me be clear here - it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build.


With Always Free you can do things like run small scale applications, or do proof of concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free. No asterisk. Start now. Visit https://snark.cloud/oci-free that's https://snark.cloud/oci-free.


Corey: I love the idea of having a centralized place where outages, post-mortems, root cause analyses—I’ll let you tear into that in a minute—and other things that are all tied to where can I find a list of outages. Because companies list these on their websites, they put them in blog posts, and it’s always very begrudging; they don’t link them from any other place, you have to know the magic incantation to find the buried link on their site. Having something that is easily searchable for outages is really something that’s kind of valuable.


Courtney: Yeah. And I mean, some of them are like—I’m looking at you, Microsoft—I like you for a lot of reasons, but hey, I have to scroll your status page. I can’t link directly to their write-ups, and—this is Azure—and it [laugh] please stop. Make it easier. [laugh]. You’re driving me crazy; I don’t even have a data model to figure out how to make this work for people, other than, like, taking screenshots of them.


So yeah, so there’s shades of grey and black in how much they’ll share, or how easy it is to find these things. So, it’ll be interesting to see if there’s any less-than-positive [laugh] reactions to all of this being available in one place. I’m anticipating at least a little bit of that.


There is one other type of metadata that we collect for the VOID. And that is the type of analysis that is conducted if it is clear what that type of analysis is. And there, some companies explicitly say, or call it an RCA, “We did a Root Cause Analysis.” There’s a few other types; some people talk about having a Contributing Factors Analysis. Most people don’t consider a formal analysis type, but I am trying to collect and categorize these because I do think there are some fascinating implications buried therein, and I would like to see if I can keep track of whether or not those change over time. And yes, you’ve hit on one of my favorite hot-take soapbox things, which is root cause.


Corey: Please, take it away.


Courtney: Yeah. Well, and anyone who’s close to these systems and has watched these things fall down has the inherent sense that there is no root cause. Like—[laugh]—let’s—great. One of my favorite ones: human error. We don’t have enough hours for this, Corey. I’m sorry. That’s one of my favorite other ones. But let’s say somebody fat-fingers a config change. Which happens—


Corey: That was fundamentally the S3 service disruption back in—


Courtney: Yes.


Corey: —2017 that took down S3 for hours on end.


Courtney: And took down so many other people that relied on S3.


Corey: Everything was tied to that. And that’s an interesting question; when something like that hits, does that mean that everything it takes down get its own entry in VOID?


Courtney: I hope so. If everybody writes them up, then yes. [laugh]. So, if S3 goes down, and you go down, and you write it up, and you put it in the VOID, then we can see those things, which would be so cool. But let’s go back to the fat-fingered config file—which if you haven’t ever done, you’re lying, first of all—
Corey: Or you haven’t been allowed to touch anything large and breakable yet, which, either way, you’re lying on some level. So, please—


Courtney: Yeah. I mean, I took down [Halloway’s 00:20:53] homepage when it was on Hacker News because of YAML. So, anywho. Even if you fat-finger a config change, that’s not the root cause because you have this system wherein a fat-fingered configure change can take down S3. That is a very big, complex, and I might add, socio-technical system.


There are decisions that were made long ago about why it was structured that way, or why this happens that way, or what kinds of checks and balances you have. It’s just, get over it people. There is no root cause. These are complex, highly dynamic systems that when they fail, they fail in unpredictable and weird ways because we’ve built them that way. They’re complex because you’re successful at pushing the envelope and your safety boundaries.


So, if we could get past the root cause thing as an industry, I mean, I could probably just retire happy, honestly. [laugh]. I’m a simple woman; could we just get one thing, people? [laugh]. First of all, then it gives non-technologists, people outside of our bubble, the media, you can’t hang it on these things anymore. We all have to then grapple with the complexity, which admittedly humans, not big fans of, but—


Corey: People want simple stories, simple narratives. When people say, “Oh, remember the S3 outage?” They don’t want to sit there and have to recount 50,000 different details. They want to say, “Oh, yeah. It took down a few big sites like Instagram, United Airlines, and it was a real mess.” The end. They want something that fits in a tweet, not something that fits in a thesis.


Courtney: Well, and if you have a single root cause, then you can fix the root cause and it will never happen again. Right?


Corey: That’s the theory. If we’re just a little bit more careful, we’re never going to have outages anymore.


Courtney: Yeah, if we could just train those humans to not try to make the best possible high-quality decision they could possibly make in that situation given the information they have at the time, then we’ll do better. But I mean, that’s why your system stay up most of the time, if you think about it. It’s shocking how well these things actually work the vast majority of the time. And that’s what we could learn from this, too. We could, you know—oh if we would write near-misses up, please.


I mean, if I could have one more wish, I think one of the coolest things the airline industry and the government side of that did was start writing up near-misses. It’s, wow, what do we learn from when we’re successful, versus trying to, like, spelunk and nitpick the failures.


Corey: Most of us aren’t so good at the whole introspection part. We need failures, we need painful outages to really force us to make difficult, introspective, soul-searching decisions and learn from them.


Courtney: Yeah. And I don’t disagree with that. I just wish one of the things we would learn is that we should study our successes, too. There’s more to be mined from our successes, if we can figure out how to do that, then there is from our failures. So, I have a metadata category in the VOID called ‘near-miss.’


And oh man, I really wish people would write those up more. I mean, I think there’s, like, five things in there that I’ve found so far. Because the humans hold these systems together. We make these things work the vast majority of the time. That’s why there is no root cause, and even when we’re involved in these things, we’re also involved in preventing them, or solving them, or remediating them. So, yeah, there’s no root cause. Humans aren’t the problem. Those are my big hot button ones.


Corey: I really wish more places would embrace that. Even Amazon uses the ‘root cause’ terminology internally, and I’m not going to sit here and tell them how to run large things at scale; that’s what I pay them to figure out for me. But I can’t shake the feeling that by using that somewhat reductive terminology that they’re glossing over an awful lot of things the rest of us could really benefit from.


Courtney: Well, so the question then—one of the other things that I look at is, personally when I read and analyze these incident reports, these public ones a lot, I always ask myself, “Who’s the audience for this?” And there are different audiences for different types of incident reports and different things. The vast majority of them are for customers, partners, investors.


Corey: The stock market. Yes. Yes.


Courtney: They’re not actually for the organization. There’s usually an internal one that we don’t get to see—maybe—that’s for the organization. But a lot of places feel that if you have a process, and a template, and a checklist, and a list of action items at the end, then you’ve done the right thing. You’ve had your incident, you’ve talked about it, you’ve got your action items. Move on.


Corey: Right, and it always seems with companies, that as you get further into the company, the more honest and transparent the actual analysis is. Like, at some point, you wind up with the, like, they’re very public and very cagey, and under NDA, they open up a little bit more, and a little bit more, and finally, when you work there, their executive team, it turns out, the actual thing was, “Well, Dewey was carrying arm full of boxes in the data center, tripped, went cascading face-first into the EPO cutoff switch that cut power to the entire facility.” The cagier they get, the—I guess, not to be unkind here—but the more ridiculous whatever the actual answer is. It’s one of those things where, “Really? Someone tripped and hit a button. You didn’t have a plan for that?” “Well, not really. We sort of assumed that people would”—


Courtney: Why would you have a plan for that, right?


Corey: Right.


Courtney: I mean like—[laugh].


Corey: Why would you have a plan for that, the first time?


Courtney: Yeah. I mean, so imagine this exercise: sitting down in a room with a bunch of people and going, “What are all the things that could go wrong?” I mean, [laugh] ain’t nobody got time for that? That’s not how it works. You all have other jobs to do, too, and systems to build, and pressures, and customers, and partners, and features to build, so admit and acknowledge that you just won’t know all of the antecedents and how do you respond when things happen?


Which is a whole other, you know—I know you told me you recorded an episode with Dr. Christina Maslach on burnout, which I’m so happy you did, and there’s a whole ‘nother piece of incidents and incident response, and burning people out, and blaming people, and all that stuff that’s a whole ‘nother pod—it sounds like you might—you know, probably not incidents with her. But still, these things take a toll on people. And people who, like I said, show up every day really hoping to do their best job, and go up a ladder, and get a promotion, and whatever. So, I think not just treating those things as checklists has broader implications as well, just for the wellbeing of your organization.


Corey: On some level, the biggest problem that I think we’ve run into is that, as you said, it all comes down to people. Unfortunately, legally, 
we can’t patch those. Yet.


Courtney: No, [laugh]. No, no. Not most kinds of patches, no. And that’s messy. And I know some people are like, “Everyone should learn to code.” And I’m like, “Actually, everyone should get a liberal arts degree.” Come on, help me out people. Because there’s so much of these socio-technical systems where the socio part of it is more relevant than the actual technical part.


Corey: I believe you’re right, for better or worse; there’s no way around it. Thank you so much for taking the time to speak with me. If people want to learn more about what you’re up to, where can they find you? And we will, of course, throw a link to VOID in the [show notes 00:28:06].


Courtney: Yeah, I also like to talk on Twitter, like you do. I’m not as good at it as you are, but I try. So yeah, I’m @courtneynash on Twitter. And at Verica, you can find me at Verica as well, [email protected] And those are the best ways to find me, I would say. And yeah, please people, write up your incidents, send them to the VOID and let’s all learn and get better together, please.


Corey: Thank you so much for taking the time to speak with me today. I really do appreciate it.


Courtney: Thank you for having me on. I know—do people say this: I’m like, “Yeah, big fan,” but I am. I’m a [laugh] big fan [laugh] of the podcast.


Corey: Oh, dear Lord, find better things to listen to. My God.


Courtney: [laugh]. But it’s been a treat. Thank you.


Corey: Courtney Nash, Internet Incident Librarian at Verica. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with a comment making it very clear that for whatever reason the website is down, it is most certainly not your fault.


Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.


Announcer: This has been a HumblePod production. Stay humble.
View Full TranscriptHide Full Transcript