The Art of Effective Incident Response with Emily Ruppe

Episode Summary

Emily Ruppe, Solutions Engineer at Jeli.io, joins Corey on Screaming in the Cloud to discuss the best practices she’s discovered for effectively handling incident responses. Emily explains how she fell into incident response and why it suits her mindset, as well as the different ways she’s seen organizations handle incident response and what seems to be most effective. Emily describes how she managed to not only survive but thrive through an acquisition, why blameless root causes analysis is well intentioned but misses finer points for learning, and what she has most enjoyed about working at Jeli.io.

Episode Show Notes & Transcript

About Emily

Emily Ruppe is a Solutions Engineer at Jeli.io whose greatest accomplishment was once being referred to as “the Bob Ross of incident reviews.” Previously Emily has written hundreds of status posts, incident timelines and analyses at SendGrid, and was a founding member of the Incident Command team at Twilio. She’s written on human centered incident management and facilitating incident reviews. Emily believes the most important thing in both life and incidents is having enough snacks.

Links Referenced:

Jeli.io: https://jeli.io
Twitter: https://twitter.com/themortalemily
Howie Guide: https://www.jeli.io/howie/welcome

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored by our friends at Logicworks. Getting to the cloud is challenging enough for many places, especially maintaining security, resiliency, cost control, agility, etc, etc, etc. Things break, configurations drift, technology advances, and organizations, frankly, need to evolve. How can you get to the cloud faster and ensure you have the right team in place to maintain success over time? Day 2 matters. Work with a partner who gets it - Logicworks combines the cloud expertise and platform automation to customize solutions to meet your unique requirements. Get started by chatting with a cloud specialist today at snark.cloud/logicworks. That’s snark.cloud/logicworks

Corey: Cloud native just means you’ve got more components or microservices than anyone (even a mythical 10x engineer) can keep track of. With OpsLevel, you can build a catalog in minutes and forget needing that mythical 10x engineer. Now, you’ll have a 10x service catalog to accompany your 10x service count. Visit OpsLevel.com to learn how easy it is to build and manage your service catalog. Connect to your git provider and you’re off to the races with service import, repo ownership, tech docs, and more.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. My guest today is Emily Ruppe, who’s a solutions engineer over at Jeli.io, but her entire career has generally focused around incident management. So, I sort of view her as being my eternal nemesis, just because I like to cause problems by and large and then I make incidents for other people to wind up solving. Emily, thank you for joining me and agreeing to suffer my slings and arrows here.

Emily: Yeah. Hey, I like causing problems too. I am a solutions engineer, but sometimes we like to call ourselves problems engineers. So.

Corey: Yeah, I’m a problems architect is generally how I tend to view it. But doing the work, ah, one wonders. So, you are a Jeli, where as of this recording, you’ve been for a year now. And before that, you spent some time over at Twilio slash SendGrid—spoiler, it’s kind of the same company, given the way acquisitions tend to work and all. And—

Emily: Now, it is.

Corey: Yeah. Oh, yeah. You were there during the acquisition.

Emily: Mm-hm. Yes, they acquired me and that’s why they bought SendGrid.

Corey: Indeed. It’s a good reason to acquire a company. That one person I want to bring in. Absolutely. So, you started with email and then effectively continued in that general direction, given the Twilio now has eaten that business whole. And that’s where I started my career.

The one thing I’ve learned about email systems is that they love to cause problems because it’s either completely invisible and no one knows, or suddenly an email didn’t go through and everyone’s screaming at you. And there’s no upside, only down. So, let me ask the obvious question I suspect I know the answer to here. What made you decide to get into incident management?

Emily: [laugh]. Well, I joined SendGrid actually, I’ve, I love mess. I run towards problems. I’m someone who really enjoys that. My ADHD, I hyperfocus, incidents are like that perfect environment of just, like, all of the problems are laying themselves out right in front of you, the distraction is the focus. It’s kind of a wonderful place where I really enjoy the flow of that.

But I’ve started in customer support. I’ve been in technical support and customer—I used to work at the Apple Store, I worked at the Genius Bar for a long time, moved into technical support over the phone, and whenever things broke really bad, I really enjoyed that process and kind of getting involved in incidents. And I came, I was one of two weekend support people at SendGrid, came in during a time of change and growth. And everyone knows that growth, usually exponential growth, usually happens very smoothly and nothing breaks during that time. So… no, there was a lot of incidents.

And because I was on the weekend, one of the only people on the weekend, I kind of had to very quickly find my way and learn when do I escalate this. How do I make the determination that this is something that is an incident? And you know, is this worth paging engineers that are on their weekend? And getting involved in incidents and being kind of a core communication between our customers and engineers.

Corey: For those who might not have been involved in sufficiently scaled-out environments, that sounds counterintuitive, but one of the things that you learn—very often the hard way—has been that as you continue down the path of building a site out and scaling it, it stops being an issue relatively quickly of, “Is the site up or down?” And instead becomes a question of, “How up is it?” So, it’s it doesn’t sound obvious until you’ve lived it, but declaring what is an incident versus what isn’t an incident is incredibly nuanced and it’s not the sort of thing that lends itself to casual solutions. Because every time a customer gets an error, we should open an incident on that. Well, I’ve worked at companies that throw dozens of 500 errors every second at their scale. You will never hire enough people to solve that if you do an incident process on even 10% of them.

Emily: Yeah. So, I mean, it actually became something that when you join Twilio, they have you create a project using Twilio’s API to earn your track jacket, essentially. It’s kind of like an onboarding thing. And as they absorbed SendGrid, we all did that onboarding process. And mine was a number for support people to text and it would ask them six questions and if they answered yes to more than two of them, it would text back, “Okay, maybe you should escalate this.”

And the questions were pretty simple of, “Can emails be sent?” [laugh]. Can customers log into their website? Are you able to view this particular part of the website? Because it is—with email in particular, at SendGrid in particular—the bulk of it is the email API. So, like, the site being up or down was the easiest type of incident, the easiest thing to flex on because that’s so much easier to see.

Being able to determine, like, what percentage or what level, like, how many emails are not processing? Are they getting stuck or is this, like, the correct amount of things that should be bouncing because of IP reput—there’s, like, a thousand different things. We had kind of this visualization of this mail pipeline that was just a mess of all of these different pipes kind of connected together. And mail could get stuck in a lot of different places, so it was a lot of spending time trying to find that and segwayed into project management. I was a QA for a little while doing QA work.

Became a project manager and learned a lot about imposing process because you’re supposed to and that sometimes imposing process on teams that are working well can actually destroy them [laugh]. So, I learned a lot of interesting things about process the hard way. And during all of that time that I was doing project management, I kind of accidentally started owning the incident response process because a lot of people left, I had been a part of the incident analysis group as well, and so I kind of became the sole owner of that. And when Twilio purchase SendGrid, I found out they were creating an incident commander team and I just reached out and said, “Here’s all of SendGrids incident response stuff. We just created a new Slackbot, I just retrained the entire team on how to talk to each other and recognize when something might be an incident. Please don’t rewrite all of this to be Twillio’s response process.”

And Terry, the person who was putting together that team said, “Excellent. You’re going to be [laugh] welcome to Twilio Incident Command. This is your problem and it’s a lot worse than you thought because here’s all the rest of it.” So yeah, it was really interesting experience coming into technically the same company, but an entirely different company and finding out—like, really trying to learn and understand all of the differences, and you know, the different problems, the different organizational history, the, like, fascia that has been built up between some of these parts of the organization to understand why things are the way that they are within process. It’s very interesting.

And I kind of get to do it now as my job. I get to learn about the full organizational subtext of [laugh] all of these different companies to understand how incident response works, how incident analysis works, and maybe some of the whys. Like, what are the places where there was a very bad incident, so we put in very specific, very strange process pieces in order to navigate that, or teams that are difficult to work with, so we’ve built up interesting process around them. So yeah.

Corey: It feels like that can almost become ossified if you’re not careful because you wind up with a release process that’s two thousand steps long, and each one of them is there to wind up avoiding a specific type of failure that had happened previously. And this gets into a world where, in so many cases, there needs to be a level of dynamism to how you wind up going about your work. It feels almost like companies have this idealized vision of the future where if they can distill every task that happens within the company down to a series of inputs and responses—scripts almost—you can either wind up replacing your staff with a bunch of folks who just work from a runbook and cost way less money or computers in the ultimate sense of things. But that’s been teased for generations now and I have a very hard time seeing a path where you’re ever going to be able to replace the contextually informed level of human judgment that, honestly, has fixed every incident I’ve ever seen.

Emily: Yeah. The problem comes down to in my opinion, the fact that humans wrote this code, people with specific context and specific understanding of how the thing needs to work in a specific way and the shortcomings and limitations they have for the libraries they’re using or the different things are trying to integrate in, a human being is who’s writing the code. Code is not being written by computers, it’s being written by people who have understanding and subtext. And so, when you have that code written and then maybe that person leaves or that person joins a different team and they focus and priorities on something else, there is still human subtests that exists within the services that have been written. We have it call in this specific way and timeout in this specific amount of time because when we were writing it, there was this ancient service that we had to integrate with.

Like, there’s always just these little pieces of we had to do things because we were people trying to make connections with lines of code. We’re trying to connect a bunch of things to do some sort of task, and we have a human understanding of how to get from A to B, and probably if A computer wrote this code, it would work in an entirely different way, so in order to debug a problem, the humans usually need some sort of context, like, why did we do this the way that we did this? And I think it’s a really interesting thing that we’re finding that it is very hard to replace humans around computers, even though intellectually we think, like, this is all computers. But it’s not. It’s people convincing computers to do things that maybe they shouldn’t necessarily be doing. Sometimes they’re things that computers shouldn’t be doing, maybe, but a lot of the times, it’s kind of a miracle [laugh] that any of these things continue to work on it on a given basis. And I think that it’s very interesting when we, I think, we think that we can take people out of it.

Corey: The problem I keep running into though, the more I think about this and the more I see it out there is I don’t think that it necessarily did incident management any favors when it was originally cast as the idea of blamelessness and blameless postmortems. Just because it seems an awful lot to me like the people who are the most advocate champions of approaching things from a blameless perspective and having a blameless culture are the people who would otherwise have been blamed themselves. So, it really kind of feels on some broader level, like, “Oh, was this entire movement really just about being self-serving so that people don’t themselves get in trouble?” Because if you’re not going to blame no one, you’re going to blame me instead. I think that, on some level, set up a framing that was not usually helpful for folks with only a limited understanding of what the incident lifecycle looks like.

Emily: Mmm. Yeah, I think we’ve evolved, right? I think, from the blameless, I think there was good intentions there, but I think that we actually missed the really big part of that boat that a lot of folks glossed over because then, as it is now, it’s a little bit harder to sell. When we’re talking about being blameless, we have to talk about circumventing blame in order to get people to talk candidly about their experiences. And really, it’s less about blaming someone and what they’ve done because we as humans blame—there’s a great Brené Brown talk that she gives, I think it’s a TED talk about blame and how we as humans cannot physically avoid blaming, placing blame on things.

It’s about understanding where that’s coming from, and working through it that is actually how we grow. And I think that we’re starting to kind of shift into this more blame-aware culture. But I think the hard pill to swallow about blamelessness is that we actually need to talk about the way that this stuff makes us feel as people. Like feelings, like emotions [laugh]. Talk about emotions during a technical incident review is not really an easy thing to get some tech executives to swallow.

Or even engineers. There’s a lot of engineers who are just kind of like, “Why do you care about how I felt about this problem?” But in reality, you can’t measure emotions as easily as you can measure Mean Time to Resolution. But Mean Time to Resolution is impacted really heavily by, like, were we freaking out? Did we feel like we had absolutely no idea what we were trying to solve, or did we understand this problem, and we were confident that we could solve it; we just couldn’t find the specific place where this bug was happening. All of that is really interesting and important context about how we work together and how our processes work for us, but it’s hard because we have to talk about our feelings.

Corey: I think that you’re onto something here because I look back at the key outages that really define my perspective on things over the course of my career, and most of the early ones were beset by a sense of panic of am I going to get fired for this? Because at the time, I was firmly convinced that well, root cause is me. I am the person that did the thing that blew up production. And while I am certainly not blameless in some of those things, I was never setting out with an intent to wind up tiering things down. So, it was not that I was a bad actor subverting internal controls because, in many companies, you don’t need that level of rigor.

This was a combination of factors that made it easy or possible to wind up tiering things down when I did not mean to. So, there were absolutely systemic issues there. But I still remember that rising tide of panic. Like, should I be focused on getting the site backup or updating my resume? Which of these is going to be the better longer-term outcome? And now that I’ve been in this industry long enough and I’ve seen enough of these, it’s, you almost don’t feel the blood pressure rise anymore when you wind up having something gets panicky. But it takes time and nuance to get there.

Emily: Yeah. Well, and it’s also, in order to best understand how you got in that situation, like, were you willing to tell people that you were absolutely panicked? Would you have felt comfortable, like, if someone was saying like, “Okay, so what happened? How did—walk me through what you were experiencing?” Would you have said like, “I was scared out of my goddamn mind?”

Were you absolutely panicking or did you feel like you had some, like, grasping at some straws? Like, where were you? Because uncovering that for the person who is experiencing that in the issue, in the incident can help understand, what resources did they feel like they knew where to go to. Or where did they go to? Like, what resource did they decide in the middle of this panicked haze to grasp for? Is that something that we should start using as, “Hey, if it’s your first time on call, this is a great thing to pull into,” because that’s where instinctively you went?

Like, there’s so much that we can learn from the people who are experiencing [laugh] this massive amount of panic during the incident. But sometimes we will, if we’re being quote-unquote, “Blameless,” gloss over your entire, like, your involvement in that entirely. Because we don’t want to blame Corey for this thing happening. Instead, we’ll say, “An engineer made a decision and that’s fine. We’ll move past that.” But there’s so much wealth of information there.

Corey: Well, I wound up in postmortems later when I ran teams, I said, “Okay, so an engineer made a mistake.” It’s like, “Well, hang on. There’s always more to it than that”—

Emily: Uh-huh.

Corey: —“Because we don’t hire malicious people and the people we have are competent for their role.” So, that goes a bit beyond that. We will never get into a scenario people do not make mistakes in a variety of different ways. So, that’s not a helpful framing, it’s a question of what—if they made a mistake, sure, what was it that brought them to that place because that’s where it gets really interesting. The problem is when you’re trying to figure out in a business context why a customer is super upset—if they’re a major partner, for example—and there’s a sense of, “All right, we’re looking for a sacrificial lamb or someone that we can blame for this because we tend to think in relatively straight lines.”

And in those scenarios, often, a nuanced understanding of the systemic failure modes within your organization that might wind up being useful in the mid to long-term are not helpful for the crisis there. So, trying to stuff too much into a given incident response might be a symptom there. I’m thinking of one or two incidents in the course of my later career that really had that stink to them, for lack of a better term. What’s your take on the idea?

Emily: I’ve been in a lot of incidents where it’s the desire to be able to point and say a person made this mistake is high, it’s definitely something that the, “organization”—and I put the organization in quotes there—and say technical leadership, or maybe PR or the comms team said like, “We’re going to say, like, a person made this mistake,” when in reality, I mean, nine times out of ten, calling it a mistake is hindsight, right? Usually people—sometimes we know that we make a mistake and it’s the recovery from that, that is response. But a lot of times we are making an informed decision, you know? An engineer has the information that they have available to them at the time and they’re making an informed decision, and oh, no [laugh], it does not go as we planned, things in the system that we didn’t fully understand are coexisting, it’s a perfect storm of these events in order to lead to impact to this important customer.

For me, I’ve been customer-facing for a very long time and I feel like from my observation, customers tend to—like if you say, like, “This person did something wrong,” versus, “We learned more about how the system works together and we understand how these kind of different pieces and mechanisms within our system are not necessarily single points of failure, but points at which they interact that we didn’t understand could cause impact before, and now we have a better understanding of how our system works and we’re making some changes to some pieces,” I feel like personally, as someone who has had to say that kind of stuff to customers a thousand times, saying, “It was a person who did this thing,” it shows so much less understanding of the event and understanding of the system than actually talking through the different components and different kind of contributing factors that were wrong. So, I feel like there’s a lot of growth that we as an industry can could go from blaming things on an intern to actually saying, “No, we invested time and understanding how a single person could perform these actions that would lead to this impact, and now we have a deeper understanding of our system,” is in my opinion, builds a little bit more confidence from the customer side.

Corey: This episode is sponsored in part by Honeycomb. I’m not going to dance around the problem. Your. Engineers. Are. Burned. Out. They’re tired from pagers waking them up at 2 am for something that could have waited until after their morning coffee. They’re fed up with relying on two or three different “monitoring tools” that still require them to manually trudge through logs to decipher what might be wrong. Simply put, there’s a better way. Observability tools like Honeycomb show you the patterns and outliers of how users experience your code in complex and unpredictable environments so you can spend less time firefighting and more time innovating. It’s great for your business, great for your engineers, and, most importantly, great for your customers. Try FREE today at honeycomb.io/screaminginthecloud. That’s honeycomb.io/screaminginthecloud.

Corey: I think so much of this is—I mean, it gets back to your question to me that I sort of dodged was I willing to talk about how my emotional state in these moments? And yeah, I was visibly sweating and very nervous and I’ve always been relatively okay with calling out the fact that I’m not in a great place at the moment, and I’m panicking. And it wasn’t helped in some cases by, in those early days, the CEO of the company standing over my shoulder, coming down from the upstairs building to know what was going on, and everything had broken. And in that case, I was only coming in to do mop-up I wasn’t one of the factors contributing to this, at least not by a primary or secondary degree, and it still was incredibly stress-inducing. So, from that perspective, it feels odd.

But you also talk about ‘we,’ in the sense of as an industry, as a culture, and the rest. I’m going to push back on that a little bit because there are still companies today in the closing days of 2022 that are extraordinarily far behind where many of us are at the companies we work for. And they’re still stuck in the relative Dark Ages technically, were, “Well, are VMs okay, or should we stay on bare metal?” Is still the era that they’re in, let alone cloud, let alone containerization, let alone infrastructure as code, et cetera, et cetera. I’m unconvinced that they have meaningfully progressed on the interpersonal aspects of incident management when they’ve been effectively frozen in amber from a technical basis.

Emily: Mmm, I don’t think that’s fair [laugh].

Corey: No. Excellent. Let’s talk about that.

Emily: [laugh]. I think just because an organization is still, like, maybe in DCs and using hardware and maybe hasn’t advanced so thoroughly within the technical aspect of things, that doesn’t necessarily mean that they haven’t adopted new—

Corey: Ah, very fair. Let me add one point of clarification, then, on this because what I’m talking about here is the fact there are companies who are that far behind on a technical basis, they are not necessarily one and the same, too—

Emily: Correct.

Corey: Because you’re using older technology, that means your processes are stuck in the past, too.

Emily: Right.

Corey: But rather, just as there are companies that are anxious on the technology basis, there are also companies who will be 20 years behind in learnings—

Emily: Yes.

Corey: —compared to how the more progressive folks have already internalized some of these things ages ago. Blamelessness is still in the future for them. They haven’t gotten there yet.

Emily: I mean, yeah, there’s still places that are doing root cause analysis, that are doing the five whys. And I think that we’re doing our best [laugh]. I mean, I think it really takes—that’s a cultural change. A lot of the actual change in approach of incident analysis and incident response is a cultural change. And I can speak from firsthand experience that that’s really hard to do, especially from the inside it’s very hard to do.

So luckily, with the role that I’m in now at Jeli.io, I get to kind of support those folks who are trying to champion a change like that internally. And right now, my perspective is just trying to generate as much material for those folks to send internally, to say like, “Hey, there’s a better way. Hey, there’s a different approach for this that can maybe get us around these things that are difficult.” I do think that there’s this tendency—and I’ve used this analogy before—is for us to think that our junk drawers are better than somebody else’s junk drawers.

I see an organization as just a junk drawer, a drawer full of weird odds and ends and spilled glue and, like, a broken box of tacks. And when you pull out somebody else’s junk drawer, you’re like, “This is a mess. This is an absolute mess. How can anyone live like this?” But when you pull out your own junk drawer, like, I know there are 17 rubber bands in this drawer, somehow. I am going to just completely rifle through this drawer until I find those things that I know are in here.

Just a difference of knowing where our mess is, knowing where the bodies are buried, or the skeletons are in each closet, whatever analogy works best. But I think that some organizations have this thought process that—by organizations, I mean, executive leadership organizations are not an entity with an opinion, they’re made up of a bunch of individuals doing [laugh] the work that they need to do—but they think that their problems are harder or more unique than at other organizations. And so, it’s a lot harder to kind of help them see that, yes, there is a very unique situation, the way that your people work together with their technology is unique to every single different organization, but it’s not that those problems cannot be solved in new and different ways. Just because we’ve always done something in this way does not mean that is the way that is serving us the best in this moment. So, we can experiment and we can make some changes.

Especially with process, especially with the human aspect of things of how we talk to each other during incidents and how we communicate externally during incidents. Those aren’t hard-coded. We don’t have to do a bunch of code reviews and make sure it’s working with existing integrations to be able to make those changes. We can experiment with that kind of stuff and I really would like to try to encourage folks to do that even though it seems scary because incidents are… [unintelligible 00:24:33] people think they’re scary. They’re not. They’re [unintelligible 00:24:35].

Corey: They seem to be. For a lot of folks, they are. Let’s not be too dismissive on that.

Emily: But we were both talking about panic [laugh] and the panic that we have felt during incidents. And I don’t want to dismiss that and say that it’s not real. But I also think that we feel that way because we’re worried about how we’re going to be judged for our involvement in them. We’re panicking because, “Oh no, we have contributed to this in some way, and the fact that I don’t know what to do, or the fact that I did something is going to reflect poorly on me, or maybe I’m going to get fired.” And I think that the panic associated with incidents also very often has to do with the environment in which you are experiencing that incident and how that is going to be accepted and discussed. Are you going to be blamed regardless of how, quote-unquote, “Blameless,” your organization is?

Corey: I wish there was a better awareness of a lot of these things, but I don’t think that we are at a point yet where we’re there.

Emily: No.

Corey: How does this map what you do, day-to-day over at Jeli.io?

Emily: It is what I do every single day. So, I mean, I do a ton of different things. We’re a very small startup, so I’m doing a lot, but the main thing that I’m doing is working with our customers to tackle these hurdles within each of their organizations. Our customers vary from very small organizations to very, very large organizations, and working with them to find how to make movement, how to sell this internally, sell this idea of let’s talk about our incidents a little bit differently, let’s maybe dial back some of the hard-coded automation that we’re doing around response and change that to speaking to each other, as opposed to, we need 11 emails sent automatically upon the creation of an incident that will automatically map to these three PagerDuty schedules, and a lot more of it can be us working through the issue together and then talking about it afterwards, not just in reference to the root cause, but in how we interfaced: how did it go, how did response work, as well as how did we solve the problem of the technical problem that occurred?

So, I kind of pinch myself. I feel very lucky that I get to work with a lot of different companies to understand these human aspects and the technical aspects of how to do these experiments and make some change within organizations to help make incidents easier. That’s the whole feeling, right? We were talking about the panic. It doesn’t need to be as hard as it feels, sometimes. And I think that it can be easier than we let ourselves think.

Corey: That’s a good way of framing it. It just feels on so many levels like this is one of the hardest areas to build a company in because you’re not really talking about fixing technical, broken systems out there. You’re talking about solving people problems. And I have some software that solves your people problems, I’m not sure if that’s ever been true.

Emily: Yeah, it’s not the software that’s going to solve the people problems. It’s building the skills. A lot of what we do is we have software that helps you immensely in the analysis process and build out a story as opposed to just building out a timeline, trying to tell, kind of, the narrative of the incident because that’s what works. Like anthropologically, we’ve been conveying information through folklore, through tales, telling tales of things that happened in order to help teach people lessons is kind of how we’ve—oral history has worked for [laugh] thousands of years. And we aren’t better than that just because we have technology, so it’s really about helping people uncover those things by using the technology we have: pulling in Slack transcripts, and PagerDuty alerts, and Zoom transcripts, and all of this different information that we have available to us, and help people tell that story and convey that story to the folks that were involved in it, as well as other peoples in your organization who might have similar things come up in the future.

And that’s how we learn. That’s how we teach. But that’s what we learn. I feel like there’s a big difference—I’m understanding, there’s a big difference between being taught something and learning something because you usually have to earn that knowledge when you learn it. You can be taught something a thousand times and then you’ve learned that once.

And so, we’re trying to use those moments that we actually learn it where we earn that hard-earned information through an incident and tell those stories and convey that, and our team—the solutions team—is in there, helping people build these skills, teaching people how to talk to each other [laugh] and really find out this information during incidents, not after them.

Corey: I really want to thank you for being as generous with your time as you have been. And if people want to learn more, where’s the best place to find you?

Emily: Oh. I was going to say Twitter, but… [laugh].

Corey: Yeah, that’s a big open question these days, isn’t it? Assuming it’s still there by the time this episode airs, it might be a few days between now and then. Where should they find you on Twitter, with a big asterisk next to it?

Emily: It’s at @themortalemily. Which, I started this by saying I like mess and I’m someone who loves incidents, so I’ll be on Twitter [laugh].

Corey: We’re there to watch it all burn.

Emily: Oh, I feel terrible saying that. Actually, if any Twitter engineers are listening to this, someone is found that the TLS certificate is going to expire at the end of this year. Please check Twitter for where that TLS certificate lives so that you all can renew that. Also, Jeli.io, we have a blog that a lot of us write, our solutions team, we—and honestly a lot of us, we tend to hire folks who have a lot of experience in incident response and analysis.

I’ve never been a solutions engineer before in my life, but I’ve done a lot of incident response. So, we put up a lot of stuff and our goal is to build resources that are available to folks who are trying to make these changes happen, who are in those organizations where they’re still doing five whys, and RCAs, and are trying to convince people to experiment and change. We have our Howie Guide, which is available for free. It’s ‘How We Got Here’ which is, like, a full, free incident analysis guide and a lot of cool blogs and stuff there. So, if you can’t find me on Twitter, we’re writing… things… there [laugh].

Corey: We will, of course, put links to all of that in the [show notes 00:30:46]. Thank you so much for your time today. It’s appreciated.

Emily: Thank you, Corey. This was great.

Corey: Emily Ruppe, solutions engineer at Jeli.io. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this episode, please leave a five-star review on your podcast platform of choice, along with an angry comment talking about how we’ve gotten it wrong and it is always someone’s fault.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

The Art of Effective Incident Response with Emily Ruppe

Episode Summary

Episode Show Notes & Transcript

You might also like

Generating AI Laughs with Daniel Feldman

Piledriving the GenAI Grift with Nikhil Suresh

Summer Replay – The Future of Kubernetes with Bryan Liles

Get the Newsletter

Sponsor an Episode