Incidents, Solutions, and ChatOps Integration with Chris Evans

Episode Summary

Today Corey chats with Chris Evans, Co-founder and CPO of incident.io. After defining “incident,” they talk about the complexity of systems at organizations and how incident.io comes in and provides communication and structural solutions when networks go down or problems arise. Chris explains how incident.io is more effective and pragmatic than mere documentation when it comes to addressing systemic issues. They have a conversation about circular dependencies, and how incident.io integrates with existing systems to complement and augment them.

Episode Show Notes & Transcript

About Chris

Chris is the Co-founder and Chief Product Officer at incident.io, where they're building incident management products that people actually want to use. A software engineer by trade, Chris is no stranger to gnarly incidents, having participated (and caused!) them at everything from early stage startups through to enormous IT organizations.

Links Referenced:

incident.io: https://incident.io
Practical Guide to Incident Management: https://incident.io/guide/

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: DoorDash had a problem. As their cloud-native environment scaled and developers delivered new features, their monitoring system kept breaking down. In an organization where data is used to make better decisions about technology and about the business, losing observability means the entire company loses their competitive edge. With Chronosphere, DoorDash is no longer losing visibility into their applications suite. The key? Chronosphere is an open-source compatible, scalable, and reliable observability solution that gives the observability lead at DoorDash business, confidence, and peace of mind. Read the full success story at snark.cloud/chronosphere. That's snark.cloud slash C-H-R-O-N-O-S-P-H-E-R-E.

Corey: Let’s face it, on-call firefighting at 2am is stressful! So there’s good news and there’s bad news. The bad news is that you probably can’t prevent incidents from happening, but the good news is that incident.io makes incidents less stressful and a lot more valuable. incident.io is a Slack-native incident management platform that allows you to automate incident processes, focus on fixing the issues and learn from incident insights to improve site reliability and fix your vulnerabilities. Try incident.io, recover faster and sleep more.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Today’s promoted guest is Chris Evans, who’s the CPO and co-founder of incident.io. Chris, first, thank you very much for joining me. And I’m going to start with an easy question—well, easy question, hard answer, I think—what is an incident.io exactly?

Chris: Incident.io is a software platform that helps entire organizations to respond to recover from and learn from incidents.

Corey: When you say incident, that means an awful lot of things. And depending on where you are in the ecosystem in the world, that means different things to different people. For example, oh, incident. Like, “Are you talking about the noodle incident because we had an agreement that we would never speak about that thing again,” style, versus folks who are steeped in DevOps or SRE culture, which is, of course, a fancy way to say those who are sad all the time, usually about computers. What is an incident in the context of what you folks do?

Chris: That, I think, is the killer question. I think if you look at organizations in the past, I think incidents were those things that happened once a quarter, maybe once a year, and they were the thing that brought the entirety of your site down because your big central database that was in a data center sort of disappeared. The way that modern companies run means that the definition has to be very, very different. So, most places now rely on distributed systems and there is no, sort of, binary sense of up or down these days. And essentially, in the general case, like, most companies are continually in a sort of state of things being broken all of the time.

And so, for us, when we look at what an incident is, it is essentially anything that takes you away from your planned work with a sense of urgency. And that’s the sort of the pithy definition that we use there. Generally, that can mean anything—it means different things to different folks, and, like, when we talk to folks, we encourage them to think carefully about what that threshold is, but generally, for us at incident.io, that means basically a single error that is worthwhile investigating that you would stop doing your backlog work for is an incident. And also an entire app being down, that is an incident.

So, there’s quite a wide range there. But essentially, by sort of having more incidents and lowering that threshold, you suddenly have a heap of benefits, which I can go very deep into and talk for hours about.

Corey: It’s a deceptively complex question. When I talk to folks about backups, one of the biggest problems in the world of backup and building a DR plan, it’s not building the DR plan—though that’s no picnic either—it’s okay. In the time of cloud, all your planning figures out, okay. Suddenly the site is down, how do we fix it? There are different levels of down and that means different things to different people where, especially the way we build apps today, it’s not is the service or site up or down, but with distributed systems, it’s how down is it?

And oh, we’re seeing elevated error rates in us-tire-fire-1 region of AWS. At what point do we begin executing on our disaster plan? Because the worst answer, in some respects is, every time you think you see a problem, you start failing over to other regions and other providers and the rest, and three minutes in, you’ve irrevocably made the cutover and it’s going to take 15 minutes to come back up. And oh, yeah, then your primary site comes back up because whoever unplugged something, plugged it back in and now you’ve made the wrong choice. Figuring out all the things around the incident, it’s not what it once was.

When you were running your own blog on a single web server and it’s broken, it’s pretty easy to say, “Is it up or is it down?” As you scale out, it seems like that gets more and more diffuse. But it feels to me that it’s also less of a question of how the technology has scaled, but also how the culture and the people have scaled. When you’re the only engineer somewhere, you pretty much have no choice but to have the entire state of your stack shoved into your head. When that becomes 15 or 20 different teams of people, in some cases, it feels like it’s almost less than a technology problem than it is a problem of how you communicate and how you get people involved. And the issues in front of the people who are empowered and insightful in a certain area that needs fixing.

Chris: A hundred percent. This is, like, a really, really key point, which is that organizations themselves are very complex. And so, you’ve got this combination of systems getting more and more complicated, more and more sort of things going wrong and perpetually breaking but you’ve got very, very complicated information structures and communication throughout the whole organization to keep things up and running. The very best orgs are the ones where they can engage the entire, sort of, every corner of the organization when things do go wrong. And lived and breathed this firsthand when various different previous companies, but most recently at Monzo—which is a bank here in the UK—when an incident happened there, like, one of our two physical data center locations went down, the bank wasn’t offline. Everything was resilient to that, but that required an immediate response.

And that meant that engineers were deployed to go and fix things. But it also meant the customer support folks might be required to get involved because we might be slightly slower processing payments. And it means that risk and compliance folks might need to get involved because they need to be reporting things to regulators. And the list goes on. There’s, like, this need for a bunch of different people who almost certainly have never worked together or rarely worked together to come together, land in this sort of like empty space of this incident room or virtual incident room, and figure out how they’re going to coordinate their response and get things back on track in the sort of most streamlined way and as quick as possible.

Corey: Yeah, when your bank is suddenly offline, that seems like a really inopportune time to be introduced to the database team. It’s, “Oh, we have one of those. Wonderful. I feel like you folks are going to come in handy later today.” You want to have those pathways of communication open well in advance of these issues.

Chris: A hundred percent. And I think the thing that makes incidents unique is that fact. And I think the solution to that is this sort of consistent, level playing field that you can put everybody on. So, if everybody understands that the way that incidents are dealt with is consistent, we declare it like this, and under these conditions, these things happen. And, you know, if I flag this kind of level of impact, we have to pull in someone else to come and help make a decision.

At the core of it, there’s this weird kind of duality to incidents where they are both kind of semi-formulaic and that you can basically encode a lot of the processes that happen, but equally, they are incredibly chaotic and require a lot of human impact to be resilient and figure these things out because stuff that you have never seen happen before is happening and failing in ways that you never predicted. And so, this is where incident.io plays into this is that we try to take the first half of that off of your hands, which is, we will help you run your process so that all of the brain capacity you have, it goes on to the bit that humans are uniquely placed to be able to do, which is responding to these very, very chaotic, sort of, surprise events that have happened.

Corey: I feel as well—because I played around in this space a bit before I used to run ops teams—and, more or less I really should have had a t-shirt then that said, “I am the root cause,” because yeah, I basically did a lot of self-inflicted outages in various environments because it turns out, I’m not always the best with computers. Imagine that. There are a number of different companies that play in the space that look at some part of the incident lifecycle. And from the outside, first, they all look alike because it’s, “Oh, so you’re incident.io. I assume you’re PagerDuty. You’re the thing that calls me at two in the morning to make sure I wake up.”

Conversely, for folks who haven’t worked deeply in that space, as well, of setting things on fire, what you do sounds like it’s highly susceptible to the Hacker News problem. Where, “Wait, so what you do is effectively just getting people to coordinate and talk during an incident? Well, that doesn’t sound hard. I could do that in a weekend.” And no, no, you can’t.

If this were easy, you would not have been in business as long as you have, have the team the size that you do, the customers that you do. But it’s one of those things that until you’ve been in a very specific set of a problem, it doesn’t sound like it’s a real problem that needs solving.

Chris: Yeah, I think that’s true. And I think that the Hacker News point is a particularly pertinent one and that someone else, sort of, in an adjacent area launched on Hacker News recently, and the amount of feedback they got around, you know, “You’re a Slack bot. How is this a company?” Was kind of staggering. And I think generally where that comes from is—well, first of all that bias that engineers have, which is just everything you look at as an engineer is like, “Yeah, I can build that in a weekend.” I think there’s often infinite complexity under the hood that just gets kind of brushed over. But yeah, I think at the core of it, you probably could build a Slack bot in a weekend that creates a channel for you in Slack and allows you to post somewhere that some—

Corey: Oh, good. More channels in Slack. Just when everyone wants.

Chris: Well, there you go. I mean, that’s a particular pertinent one because, like, our tool does do that. And one of the things—so I built at Monzo, a version of incident.io that we used at the company there, and that was something that I built evenings and weekends. And among the many, many things I never got around to building, archiving and cleaning up channels was one of the ones that was always on that list.

And so, Monzo did have this problem of littered channels everywhere, I think that sort of like, part of the problem here is, like, it is easy to look at a product like ours and sort of assume it is this sort of friendly Slack bot that helps you orchestrate some very basic commands. And I think when you actually dig into the problems that organizations above a certain size have, they’re not solved by Slack bots. They’re solved by platforms that help you to encode your processes that otherwise have to live on a Google Doc somewhere which is five pages long and when it’s 2 a.m. and everything’s on fire, I guarantee you not a single person reads that Google Doc, so your process is as good as not in place at all. That’s the beauty of a tool like ours. We have a powerful engine that helps you basically to encode that and take some load off of you.

Corey: To be clear, I’m also not coming at this from a position of judging other people. I just look right now at the Slack workspace that we have The Duckbill Group, and we have something like a ten-to-one channel-to-human ratio. And the proliferation of channels is a very real thing. And the problem that I’ve seen across the board with other things that try to address incident management has always been fanciful at best about what really happens when something breaks. Like, you talk about, oh, here’s what happens. Step one: you will pull up the Google Doc, or you will pull up the wiki or the rest, or in some aspirational places, ah, something seems weird, I will go open a ticket in Jira.

Meanwhile, here in reality, anyone who’s ever worked in these environments knows that step one, “Oh shit, oh shit, oh shit, oh shit, oh shit. What are we going to do?” And all the practices and procedures that often exist, especially in orgs that aren’t very practiced at these sorts of things, tend to fly out the window and people are going to do what they’re going to do. So, any tool or any platform that winds up addressing that has to accept the reality of meeting people where they are not trying to educate people into different patterns of behavior as such. One of the things I like about your approach is, yeah, it’s going to be a lot of conversation in Slack that is a given we can pretend otherwise, but here in reality, that is how work gets communicated, particularly in extremis. And I really appreciate the fact that you are not trying to, like, fight what feels almost like a law of nature at this point.

Chris: Yeah, I think there’s a few things in that. The first point around the document approach or the clearly defined steps of how an incident works. In my experience, those things have always gone wrong because—

Corey: The data center is down, so we’re going to the wiki to follow our incident management procedure, which is in the data center just lost power.

Chris: Yeah.

Corey: There’s a dependency problem there, too. [laugh].

Chris: Yeah, a hundred percent. [laugh]. A hundred percent. And I think part of the problem that I see there is that very, very often, you’ve got this situation where the people designing the process are not the people following the process. And so, there’s this classic, I’ve heard it through John Allspaw, but it’s a bunch of other folks who talk about the difference between people, you know, at the sharp end or the blunt end of the work.

And I think the problem that people are facing the past is you have these people who sit in the, sort of, metaphorical upstairs of the office and think that they make a company safe by defining a process on paper. And they ship the piece of paper and go, “That is a good job for me done. I’m going to leave and know that I’ve made the bank—the other whatever your organization does—much, much safer.” And I think this is where things fall down because—

Corey: I want to ambush some of those people in their performance reviews with, “Cool. Just for fun, all the documentation here, we’re going to pull up the analytics to see how often that stuff gets viewed. Oh, nobody ever sees it. Hmm.”

Chris: It’s frustrating. It’s frustrating because that never ever happens, clearly. But the point you made around, like, meeting people where you are, I think that is a huge one, which is incidents are founded on great communication. Like, as I said earlier, this is, like, a form of team with someone you’ve never ever worked with before and the last thing you want to do is be, like, “Hey, Corey, I’ve never met you before, but let’s jump out onto this other platform somewhere that I’ve never been or haven’t been for weeks and we’ll try and figure stuff out over there.” It’s like, no, you’re going to be communicating—

Corey: We use Slack internally, but we have a WhatsApp chat that we wind up using for incident stuff, so go ahead and log into WhatsApp, which you haven’t done in 18 months, and join the chat. Yeah, in the dawn of time, in the mists of antiquity, you vaguely remember hearing something about that your first week and then never again. This stuff has to be practiced and it’s important to get it right. How do you approach the inherent and often unfortunate reality that incident response and management inherently becomes very different depending upon the specifics of your company or your culture or something like that? In other words, how cookie-cutter is what you have built versus adaptable to different environments it finds itself operating in?

Chris: Man, the amount of time we spent as a founding team in the early days deliberating over how opinionated we should be versus how flexible we should be was staggering. The way we like to describe it as we are quite opinionated about how we think incidents should be run, however we let you imprint your own process into that, so putting some color onto that. We expect incidents to have a lead. That is something you cannot get away from. However, you can call the lead whatever makes sense for you at your organization. So, some folks call them an incident commander or a manager or whatever else.

Corey: There’s overwhelming militarization of these things. Like, oh, yes, we’re going to wind up taking a bunch of terms from the military here. It’s like, you realize that your entire giant screaming fire is that the lights on the screen are in the wrong pattern. You’re trying to make them in the right pattern. No one dies here in most cases, so it feels a little grandiose for some of those terms being tossed around in some cases, but I get it. You’ve got to make something that is unpleasant and tedious in many respects, a little bit more gripping. I don’t envy people. Messaging is hard.

Chris: Yeah, it is. And I think if you’re overly virtuoustic and inflexible, you’re sort of fighting an uphill battle here, right? So, folks are going to want to call things what they want to call things. And you’ve got people who want to import [ITIL 00:15:04] definitions for severity ease into the platform because that’s what they’re familiar with. That’s fine.

What we are opinionated about is that you have some severity levels because absent academic criticism of severity levels, they are a useful mechanism to very coarsely and very quickly assess how bad something is and to take some actions off of it. So yeah, we basically have various points in the product where you can customize and put your own sort of flavor on it, but generally, we have a relatively opinionated end-to-end expectation of how you will run that process.

Corey: The thing that I find that annoys me—in some cases—the most is how heavyweight the process is, and it’s clearly built by people in an ivory tower somewhere where there’s effectively a two-day long postmortem analysis of the incident, and so on and so forth. And okay, great. Your entire site has been blown off the internet, yeah, that probably makes sense. But as soon as you start broadening that to things like okay, an increase in 500 errors on this service for 30 minutes, “Great. Well, we’re going to have a two-day postmortem on that.” It’s, “Yeah, sure would be nice if we could go two full days without having another incident of that caliber.” So, in other words, whose foot—are we going to hire a new team whose full-time job it is, is to just go ahead and triage and learn from all these incidents? Seems to me like that’s sort of throwing wood behind the wrong arrows.

Chris: Yeah, I think it’s very reductive to suggest that learning only happens in a postmortem process. So, I wrote a blog, actually, not so long ago that is about running postmortems and when it makes sense to do it. And as part of that, I had a sort of a statement that was [laugh] that we haven’t run a single postmortem when I wrote this blog at incident.io. Which is probably shocking to many people because we’re an incident company, and we talk about this stuff, but we were also a company of five people and when something went wrong, the learning was happening and these things were sort of—we were carving out the time, whether it was called a postmortem, or not to learn and figure out these things. Extrapolating that to bigger companies, there is little value in following processes for the sake of following processes. And so, you could have—

Corey: Someone in compliance just wound up spitting their coffee over their desktop as soon as you said that. But I hear you.

Chris: Yeah. And it's those same folks who are the ones who care about the document being written, not the process and the learning happening. And I think that’s deeply frustrating to me as—

Corey: All the plans, of course, assume that people will prioritize the company over their own family for certain kinds of disasters. I love that, too. It’s divorced from reality; that’s ridiculous, on some level. Speaking of ridiculous things, as you continue to grow and scale, I imagine you integrate with things beyond just Slack. You grab other data sources and over in the fullness of time.

For example, I imagine one of your most popular requests from some of your larger customers is to integrate with their HR system in order to figure out who’s the last engineer who left, therefore everything immediately their fault because lord knows the best practice is to pillory whoever was the last left because then they’re not there to defend themselves anymore and no one’s going to get dinged for that irresponsible jackass’s decisions, even if they never touched the system at all. I’m being slightly hyperbolic, but only slightly.

Chris: Yeah. I think [laugh] that's an interesting point. I am definitely going to raise that feature request for a prefilled root cause category, which is, you know, the value is just that last person who left the organization. That it’s a wonderful scapegoat situation there. I like it.

To the point around what we do integrate with, I think the thing is actually with incidents that’s quite interesting is there is a lot of tooling that exists in this space that does little pockets of useful, valuable things in the shape of incidents. So, you have PagerDuty is this system that does a great job of making people’s phone making noise, but that happens, and then you’re dropped into this sort of empty void of nothingness and you’ve got to go and figure out what to do. And then you’ve got things like Jira where clearly you want to be able to track actions that are coming out of things going wrong in some cases, and that’s a great tool for that. And various other things in the middle there. And yeah, our value proposition, if you want to call it that, is to bring those things together in a way that is massively ergonomic during an incident.

So, when you’re in the middle of an incident, it is really handy to be able to go, “Oh, I have shipped this horrible fix to this thing. It works, but I must remember to undo that.” And we put that at your fingertips in an incident channel from Slack, that you can just log that action, lose that cognitive load that would otherwise be there, move on with fixing the thing. And you have this sort of—I think it’s, like, that multiplied by 1000 in incidents that is just what makes it feel delightful. And I cringe a little bit saying that because it’s an incident at the end of the day, but genuinely, it feels magical when some things happen that are just like, “Oh, my gosh, you’ve automatically hooked into my GitHub thing and someone else merged that PR and you’ve posted that back into the channel for me so I know that that happens. That would otherwise have been a thing where I jump out of the incident to go and figure out what was happening.”

Corey: This episode is sponsored in part by our friend EnterpriseDB. EnterpriseDB has been powering enterprise applications with PostgreSQL for 15 years. And now EnterpriseDB has you covered wherever you deploy PostgreSQL on-premises, private cloud, and they just announced a fully-managed service on AWS and Azure called BigAnimal, all one word. Don’t leave managing your database to your cloud vendor because they’re too busy launching another half-dozen managed databases to focus on any one of them that they didn’t build themselves. Instead, work with the experts over at EnterpriseDB. They can save you time and money, they can even help you migrate legacy applications—including Oracle—to the cloud. To learn more, try BigAnimal for free. Go to biganimal.com/snark, and tell them Corey sent you.

Corey: The problem with the cloud, too, is the first thing that, when there starts to be an incident happening is the number one decision—almost the number one decision point is this my shitty code, something we have just pushed in our stuff, or is it the underlying provider itself? Which is why the AWS status page being slow to update is so maddening. Because those are two completely different paths to go down and you are having to pursue both of them equally at the same time until one can be ruled out. And that is why time to identify at least what side of the universe it’s on is so important. That has always been a bit of a tricky challenge.

I want to talk a bit about circular dependencies. You target a certain persona of customer, but I’m going to go out on a limb and assume that one explicit company that you are not going to want to do business with in your current iteration is Slack itself because a tool to manage—okay, so our service is down, so we’re going to go to Slack to fix it doesn’t work when the service is Slack itself. So, that becomes a significant challenge. As you look at this across the board, are you seeing customers having problems where you have circular dependency issues with this? Easy example: Slack is built on top of AWS.

When there’s an underlying degradation of, huh, suddenly us-east-1 is not doing what it’s supposed to be doing, now, Slack is degraded as well, as well as the customer site, it seems like at that point, you’re sort of in a bit of tricky positioning as a customer. Counterpoint, when neither Slack nor your site are working, figuring out what caused that issue doesn’t seem like it’s the biggest stretch of the imagination at that point.

Chris: I’ve spent a lot of my career working in infrastructure, platform-type teams, and I think you can end up tying yourself in knots if you try and over-optimize for, like, avoiding these dependencies. I think it’s one of those, sort of, turtles all the way down situations. So yes, Slack are unlikely to become a customer because they are clearly going to want to use our product when they are down.

Corey: They reach out, “We’d like to be your customer.” Your response is, “Please don’t be.” None of us are going to be happy with this outcome.

Chris: Yeah, I mean, the interesting thing that is that we’re friends with some folks at Slack, and they believe it or not, they do use Slack to navigate their incidents. They have an internal tool that they have written. And I think this sort of speaks to the point we made earlier, which is that incidents and things failing or not these sort of big binary events. And so—

Corey: All of Slack is down is not the only kind of incident that a company like Slack can experience.

Chris: I’d go as far as that it’s most commonly not that. It’s most commonly that you’re navigating incidents where it is a degradation, or some edge case, or something else that’s happened. And so, like, the pragmatic solution here is not to avoid the circular dependencies, in my view; it’s to accept that they exist and make sure you have sensible escape hatches so that when something does go wrong—so a good example, we use incident.io at incident.io to manage incidents that we’re having with incident.io. And 99% of the time, that is absolutely fine because we are having some error in some corner of the product or a particular customer is doing something that is a bit curious.

And I could count literally on one hand the number of times that we have not been able to use our products to fix our product. And in those cases, we have a fallback which is jump into—

Corey: I assume you put a little thought into what happened. “Well, what if our product is down?” “Oh well, I guess we’ll never be able to fix it or communicate about it.” It seems like that’s the sort of thing that, given what you do, you might have put more than ten seconds of thought into.

Chris: We’ve put a fair amount of thought into it. But at the end of the day, [laugh] it’s like if stuff is down, like, what do you need to do? You need to communicate with people. So, jump on a Google Chat, jump on a Slack huddle, whatever else it is we have various different, like, fallbacks in different order. And at the core of it, I think this is the thing is, like, you cannot be prepared for every single thing going wrong, and so what you can be prepared for is to be unprepared and just accept that humans are incredibly good at being resilient, and therefore, all manner of things are going to happen that you’ve never seen before and I guarantee you will figure them out and fix them, basically.

But yeah, I say this; if my SOC 2 auditor is listening, we also do have a very well-defined, like, backup plan in our SOC 2 [laugh] in our policies and processes that is the thing that we will follow that. But yeah.

Corey: The fact that you’re saying the magic words of SOC 2, yes, exactly. Being in a responsible adult and living up to some baseline compliance obligations is really the sign of a company that’s put a little thought into these things. So, as I pull up incident.io—the website, not the company to be clear—and look through what you’ve written and how you talk about what you’re doing, you’ve avoided what I would almost certainly have not because your tagline front and center on your landing page is, “Manage incidents at scale without leaving Slack.” If someone were to reach out and say, well, we’re down all the time, but we’re using Microsoft Teams, so I don’t know that we can use you, like, the immediate instinctive response that I would have for that to the point where I would put it in the copy is, “Okay, this piece of advice is free. I would posit that you’re down all the time because you’re the kind of company to use Microsoft Teams.” But that doesn’t tend to win a whole lot of friends in various places. In a slightly less sarcastic bent, do you see people reaching out with, “Well, we want to use you because we love what you’re doing, but we don’t use Slack.”

Chris: Yeah. We do. A lot of folks actually. And we will support Teams one day, I think. There is nothing especially unique about the product that means that we are tied to Slack.

It is a great way to distribute our product and it sort of aligns with the companies that think in the way that we do in the general case but, like, at the core of what we’re building, it’s a platform that augments a communication platform to make it much easier to deal with a high-stress, high-pressure situation. And so, in the future, we will support ways for you to connect Microsoft Teams or if Zoom sought out getting rich app experiences, talk on a Zoom and be able to do various things like logging actions and communicating with other systems and things like that. But yeah, for the time being very, very deliberate focus mechanism for us. We’re a small company with, like, 30 people now, and so yeah, focusing on that sort of very slim vertical is working well for us.

Corey: And it certainly seems to be working to your benefit. Every person I’ve talked to who is encountered you folks has nothing but good things to say. We have a bunch of folks in common listed on the wall of logos, the social proof eye chart thing of here’s people who are using us. And these are serious companies. I mean, your last job before starting incident.io was at Monzo, as you mentioned.

You know what you’re doing in a regulated, serious sense. I would be, quite honestly, extraordinarily skeptical if your background were significantly different from this because, “Well, yeah, we worked at Twitter for Pets in our three-person SRE team, we can tell you exactly how to go ahead and handle your incidents.” Yeah, there’s a certain level of operational maturity that I kind of just based upon the name of the company there; don’t think that Twitter for Pets is going to nail. Monzo is a bank. Guess you know what you’re talking about, given that you have not, basically, been shut down by an army of regulators. It really does breed an awful lot of confidence.

But what’s interesting to me is the number of people that we talk to in common are not themselves banks. Some are and they do very serious things, but others are not these highly regulated, command-and-control, top-down companies. You are nimble enough that you can get embedded at those startup-y of startup companies once they hit a certain point of scale and wind up helping them arrive at a better outcome. It’s interesting in that you don’t normally see a whole lot of tools that wind up being able to speak to both sides of that very broad spectrum—and most things in between—very effectively. But you’ve somehow managed to thread that needle. Good work.

Chris: Thank you. Yeah. What else can I say other than thank you? I think, like, it’s a deliberate product positioning that we’ve gone down to try and be able to support those different use cases. So, I think, at the core of it, we have always tried to maintain the incident.io should be installable and usable in your very first incident without you having to have a very steep learning curve, but there is depth behind it that allows you to support a much more sophisticated incident setup.

So, like, I mean, you mentioned Monzo. Like, I just feel incredibly fortunate to have worked at that company. I joined back in 2017 when they were, I don’t know, like, 150,000 customers and it was just getting its banking license. And I was there for four years and was able to then see it scale up to 6 million customers and all of the challenges and pain that goes along with that both from building infrastructure on the technical side of things, but from an organizational side of things. And was, like, front-row seat to being able to work with some incredibly smart people and sort of see all these various different pain points.

And honestly, it feels a little bit like being in sort of a cheat mode where we get to this import a lot of that knowledge and pain that we felt at Monzo into the product. And that happens to resonate with a bunch of folks. So yeah, I feel like things are sort of coming out quite well at the moment for folks.

Corey: The one thing I will say before we wind up calling this an episode is just how grateful I am that I don’t have to think about things like this anymore. There’s a reason that the problem that I chose to work on of expensive AWS bills being very much a business-hours only style of problem. We’re a services company. We don’t have production infrastructure that is externally facing. “Oh, no, one of our data analysis tools isn’t working internally.”

That’s an interesting curiosity, but it’s not an emergency in the same way that, “Oh, we’re an ad network and people are looking at ads right now because we’re broken,” is. So, I am grateful that I don’t have to think about these things anymore. And also a little wistful because there’s so much that you do it would have made dealing with expensive and dangerous outages back in my production years a lot nicer.

Chris: Yep. I think that’s what a lot of folks are telling us essentially. There’s this curious thing with, like, this product didn’t exist however many years ago and I think it’s sort of been quite emergent in a lot of companies that, you know, as sort of things have moved on, that something needs to exist in this little pocket of space, dealing with incidents in modern companies. So, I’m very pleased that what we’re able to build here is sort of working and filling that for folks.

Corey: Yeah. I really want to thank you for taking so much time to go through the ethos of what you do, why you do it, and how you do it. If people want to learn more, where’s the best place for them to go? Ideally, not during an incident.

Chris: Not during an incident, obviously. Handily, the website is the company name. So, incident.io is a great place to go and find out more. We’ve literally—literally just today, actually—launched our Practical Guide to Incident Management, which is, like, a really full piece of content which, hopefully, will be useful to a bunch of different folks.

Corey: Excellent. We will, of course, put a link to that in the [show notes 00:29:52]. I really want to thank you for being so generous with your time. Really appreciate it.

Chris: Thanks so much. It’s been an absolute pleasure.

Corey: Chris Evans, Chief Product Officer and co-founder of incident.io. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this episode, please leave a five-star review on your podcast platform of choice along with an angry comment telling me why your latest incident is all the intern’s fault.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

Incidents, Solutions, and ChatOps Integration with Chris Evans

Episode Summary

Episode Show Notes & Transcript

You might also like

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

The Latest State of IaC with Ido Neeman

Get the Newsletter

Sponsor an Episode