The Kinesis Outage

Episode Summary

Join Pete and Jesse for a lively discussion about the recent AWS Kinesis outage. They touch upon why you should never throw shade at someone else’s outage, how there might not even be a single person at AWS who understands how every AWS service works together, what the downstream effects were when Kinesis was knocked offline, how AWS outages are a good reminder of how we’re all human and no one is immune to these kinds of things, why you shouldn’t decide to move away from AWS because of an outage, why multi-cloud strategies need to be proactive and not reactive, how it’s great how AWS released an in-depth blog post about the outage, and more.

Episode Show Notes & Transcript

Links

Follow Last Week In AWS on Twitter
AWS Outage Message
"Kinesis Outage" by Ryan Frantz

Transcript
Corey: This episode is sponsored in part by our friends at Linode. You might be familiar with Linode; they’ve been around for almost 20 years. They offer Cloud in a way that makes sense rather than a way that is actively ridiculous by trying to throw everything at a wall and see what sticks. Their pricing winds up being a lot more transparent—not to mention lower—their performance kicks the crap out of most other things in this space, and—my personal favorite—whenever you call them for support, you’ll get a human who’s empowered to fix whatever it is that’s giving you trouble. Visit linode.com/screaminginthecloud to learn more, and get $100 in credit to kick the tires. That’s linode.com/screaminginthecloud.

Pete: Hello, everyone. Welcome to the AWS Morning Brief. It's Pete Cheslock again—

Jesse: And Jesse DeRose.

Pete: We are back to talk about ‘The Kinesis Outage.’

Jesse: [singing] bom bom bum.

Pete: So, at this point, as you're listening to this, it's been a couple of weeks since the Kinesis outage has happened, and I'm sure there are many, many armchair sysadmins out there speculating at all the reasons why Amazon should not have had this outage. And guess what? You have two more system administrators here to armchair quarterback this as well.

Jesse: We are happy to discuss what happened, why it happened. I will try to put on my best announcer voice, but I think I normally fall more into the golf announcer voice than the football announcer voice, so I'm not really sure if that's going to play as well into our story here.

Pete: It's going, it's going, it's gone.

Jesse: It’s—and it's just down. It's down—

Pete: It's just—

Jesse: —and it's gone.

Pete: No, but seriously, we're not critiquing it. That is not the purpose of this talk today. We're not critiquing the outage because you should never critique other people's outages; never throw shade at another person's outage. That's not only crazy to do because you have no context into their world. It's just, it's not nice either, so just try to be nice out there.

Jesse: Yeah, nobody wants to get critiqued when their company has an outage and when they're under pressure to fix something. So, we're not here to do that. We don't want to point any fingers. We're not blaming anyone. We just want to talk about what happened because honestly, it's a fascinating, complex conversation.

Pete: It is so fascinating and honestly, loved the detail, a far cry from the early years of Amazon outages that were just, “We had a small percentage of instances have some issues.” This was very detailed. This gave out a lot of information. And the other thing too is that, when it comes to critiquing outages, you have to imagine that there are unlikely to be more than a handful of people even inside Amazon Web Services that fully understand the scope of the size and the interactions of all these different services. There may not even be a single person who truly understands how these dozens of services interact with each other.

I mean, it takes teams and teams of people working together to build these things and to have these understandings. So, that being said, let's dive in. So, the Wednesday before Thanksgiving, Kinesis decided to take off early. You know, long weekend coming up, right? But really, what happened was is that there was an addition of capacity to Kinesis, and it caused it to hit an operating system limit causing an outage.

But interestingly enough—and what we'll talk about today—are the interesting and downstream effects that occurred via CloudWatch, Cognito, even the status page, and the Personal Health Dashboard. I mean, that's a really interesting contributing factor or a correlating outage. I don't know the words here, but it's interesting to hear that both CloudWatch goes down and the Personal Health Dashboard goes down.

Jesse: That's when somebody from the product side says, “Oh, that's a feature, definitely not a bug.”

Pete: But the outage to CloudWatch then even affected some of the downstream services to CloudWatch—such as Lambda—which also included auto-scaling events. It even included EventBridge, which was impacted, and that even caused some ECS and EKS delays with provisioning new clusters and scaling of existing clusters.

Jesse: So, right out of the bat, I just want to say huge kudos to AWS for dogfooding all of their services within AWS itself: not just providing the services to its customers, but actually using Kinesis internally for other things like CloudWatch and Cognito. They called that out in the write-up and said, “Kinesis is leveraged for CloudWatch, and Cognito, and for other things, for various different use cases.” That's fantastic. That's definitely what you want from your service provider.

Pete: Yeah, I mean, it's a little amazing to hear, and also a little terrifying, that all of these services are built based on all of these other services. So, again, the complexity of the dependencies is pretty dramatic. But at the end of the day, it's still software underneath it; it's still humans. And I don't want to say that I am happy that Amazon had this outage at all, but watching a company of this stature, of this operational expertise, have an outage, it's kind of like watching the Masters when Tiger Woods duffs one into the water or something like that. It's just—it's a good reminder that—listen, we're all human, we're all working under largely the same constraints, and this stuff happens to everyone; no one is immune.

Jesse: And I think it's also a really great opportunity—after the write-up is released—to see how the Masters go about doing what they do. Because everybody at some point is going to have to troubleshoot some kind of technology problem, and we get to see firsthand from this, how they go about troubleshooting these technology problems.

Pete: Exactly. So, of course, one of the first things that I saw everywhere is everyone is, on mass, moving off of Amazon, right? They had an outage, so we're just going to turn off all our servers and just move over to GCP, or Azure, right?

Jesse: Because GCP is a hundred percent uptime. Azure is a hundred percent uptime. They're never going to have any kind of outages like this. Google would never do something to maybe turn off a service, or sunset something.

Pete: Yeah, exactly. So, with the whole talk about hybrid-cloud and multi-cloud strategies, you got to know that there's a whole slew of people out there, probably some executive at some business, who says, “Well, we need to engineer for this type of durability, this type of thing to happen again,” but could you even imagine the complexity of just the authentication systems that exist differently between two systems. Like IAM, and one and whatever's in GCP. But then, if you've built for Kinesis, and then using, like, Amazon, or Google's Pub/Sub, building for the interoperability, like, just from a technical perspective, I would love to see someone do that. And then please do a conference talk so I can listen to it because that sounds technologically impressive.

Jesse: Absolutely. And full disclosure, both Pete and I and the folks at Duckbill Group have mixed feelings on a multi-cloud strategy, but the point that we want to, especially, stress here is that there are places where a multi-cloud strategy may be beneficial for your company, for a business use case, and we're not trying to say that's wrong. But running into an outage with AWS, running into an outage with the cloud provider that has the largest share of the industry isn't necessarily the right move. Don't just move because you ran into an outage. Move purposefully, or develop a multi-cloud strategy purposefully for a business use case, not because you don't want outages because let's be honest: outages are going to happen, no matter which service provider you use.

Pete: Yeah, exactly. So, let's dive into the details. So, back on Wednesday, November 25, Kinesis, like I said, decided to take off for the long weekend. But the trigger for this event was a small addition of capacity that was added about 2:44 a.m. PST, and it took about an hour for that to complete.

This was specifically the frontend systems that handle authentication, throttling, and request routing. And again, definitely read through this whole outline of the outage because it gives tremendous detail into more about these frontend systems, why it takes so long for them to come on board, really, just all of the complexity involved here. It's really fascinating. So, on adding that capacity, they talked about that servers that are operating members of the fleet, they have to learn of the new servers joining and they'll establish threads to those other systems. And they mentioned it would take up to an hour for existing frontend members of the fleet to learn these new participants.

So, about an hour and a half after bringing that capacity online, they started getting alerts from Kinesis, and they thought—like many would think—it was likely related to the new capacity, but they were unsure because some of the errors that they were seeing just didn't correlate to that. But they still decided to start removing the new capacity anyway, right? That's a pretty logical first step right, Jesse? Undo the thing I did when you start getting alerts?

Jesse: Absolutely. And I think that is the logical first step. And I think it's even more important to point out that those alarms started an hour after they deployed the services and that's really, really tough because when you deploy something, you want to know immediately if it fails. The fact that they didn't start seeing alerts until an hour later, gives any engineer on call that kind of sinking feeling of dread of, “Well, I thought everything was good to go, and I went back to sleep after running five or ten minutes worth of tests or looking at the data. But clearly, it's not, and now I need to dive into this more deeply.”

Pete: Yeah, so about two and a half hours later, after those alerts went off, they narrowed things down, and they believed that a full restart of those frontend systems would be involved. And now, Amazon does a really good job to explain why this is. Again, go and read the full outline, we're just summarizing here. And I think, for anyone who's ever run any sort of large distributed database at scale knows that adding and removing capacity—or just restarting in general—can be really challenging and time-consuming because you have to check, ensure consistency along the way. They even pointed out that they were worried about systems being overloaded, and because they were being overloaded, they might be marked as unhealthy, and then those would be removed from the pool as well. So, that's a really interesting caveat here when they talk about what it would take to actually resolve this issue.

Jesse: Yeah, my heart goes out to the engineers who were diagnosing this outage because diagnosing an outage is stressful enough, but diagnosing an outage with multiple potential influencing causes, and different metrics and alerts, your brain is already working so, so hard to keep up, and it's not great for your mental health. This is probably why we see a lot of burnout because there's, there's a lot of different potential influencing causes for this kind of outage, and when you're running any kind of distributed database at scale, it's really tough to really clearly and easily nail down one thing that caused a service outage, quickly and easily. I also really want to quickly call out, AWS mentioned that this process of restarting the frontend servers was a long and careful process. And I have to admit, I always cringe a little bit whenever an organization says that a technical workflow is going to be a long and careful process because it points out a system that is extremely susceptible to human error or negative environmental forces. That's a big business risk.

And it doesn't mean that there's anything in the process that is wrong, or incorrect, but it points out a great opportunity for improvement. It points to a place where maybe more testing needs to happen. Or maybe this kind of process needs to be broken down into smaller, more manageable processes that have been tested and can be either automated or can be tested on a more regular basis to make sure that when this type of issue comes up again, it's able to be handled much more quickly and efficiently.

Pete: Yeah, I mean, I've been on the brunt side of the database restarting game adding capacity to systems, and it just pushes something over the limit. You're not sure what but, I mean, it's like, reading through this has reminded me of so many outages that I've had to deal with. But distributed databases are hard. Distributed databases at Amazon scale is next. Level hard. I mean, you're dealing with edge cases that most people are unlikely to see.

Jesse: And again, this is why my heart goes out to all the engineers who worked on this, not just managing these systems day to day in general, but who were part of troubleshooting and managing this outage. That's a lot of work.

Pete: So, about half an hour after where we last stopped when they had, kind of, narrowed this down—so this is now four and a half hours after the initial alarms fired—they got to identify the contributing factor—I'm not going to say ‘root cause.’ They say ‘root cause’ enough for everyone in that document. But they found that when adding the new systems, they hit a thread limit on the operating system. And so, it was this classic Linux limits that are set historically low for decades of past. I can't tell you how many times I've hit random Linux thread limits, and file count limits, and socket limits, and—I mean, it's just—it’s—

Jesse: It's annoying.

Pete: It's annoying, yeah. And one thing I really want to call out is that when they talked about these number of files, the adding of those systems—because they required them to talk to other ones—it was increasing the number of sockets, number of network open connections, like that makes a ton of sense, hearing them kind of explained this out. But again, think about just how—not simple but, really, how simple of a problem this was it was just this largely artificial limit set by the operating system from who knows when, long ago.

Jesse: Yeah, and the fact that they found this particular contributing factor, four and a half hours after the first alarm went off. That's a huge shout-out to those engineers who were heads down, doing exploratory work for that long. And similar to Pete, I've been on the receiving end of this where you find the contributing factor—or one of the contributing factors—and you think to yourself, “Oh, thank God. Now I know what went wrong.” But it takes time to get there, and with these engineers, who were looking at multiple different streams of metrics, and alerts, and errors, to be able to find something, four and a half hours later, I know there's a lot of HugOps going around on Twitter when this is all happening, but I just want to plus one that because huge, huge props to the people who were focused on this for that entire period of time. That is a lot of time for your brain to be under this cognitive load, to be stressed out, trying to resolve this outage.

Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.

Pete: Now, when they identified that they were hitting this limit when they removed those extra systems because they didn't want to just blindly increase that limit without understanding the impact. And that was a phenomenally smart move. I have been at too many places where we hit an artificial limit, we'll just go and fire the command that will increase that limit to some unreasonably impossibly high number. Because that's totally not setting a ticking time bomb for your future.

Jesse: [laugh]. Yeah. What could go wrong? What could possibly go wrong?

Pete: [laugh]. As it turns out, like, yeah sure, you won't hit the filesystem limit anymore, but then you’re going to have some subtle memory leak. It'll fail in just some new and interesting way.

Jesse: “That's Future Me's problem.”

Pete: Yeah, exactly. Or if you're like most people, and you bounce from job every two years, it's truly somebody else’s [00:17:44 crosstalk]. [laugh]. You're like two companies away at that point.

Jesse: Oh, god, it's so true.

Pete: Oh, I’m not saying that from experience or anything. So, long term, one of the solutions here is they're actually going to move Kinesis to fewer, yet larger hosts, so that they can run less, which solves the scaling challenge. And again, I really like this solution as well because they know how to operate Kinesis at a certain server count size. They know how the discovery happens, how the frontend systems talk to each other at a certain size. By reducing those number of systems to larger hosts, they kind of give themselves an ability to scale further because they know how to scale to a certain server count if that's their key point.

By reducing that number down and using those bigger hosts, they can always then scale back up again knowing where their limits are by, again, not increasing that artificial limit on the Linux operating system, or—assuming they use Linux there. Because Amazon Linux—by not increasing that limit, they're not introducing a new unknown variable on how the system will react. They can leave the limit in place and just change into a model that they know should have well. So, it really limits that unknown consequence of changing that limit.

Jesse: Yeah, I think this is a really great way to look at it because they are able to see that there are multiple different levers that they could pull and manipulate in order to resolve this problem, but rather than tweaking a lever that is potentially going to open up a bunch of new problems down the line, they are specifically saying, “No, we're not going to touch that. We’re going to keep these OS limits in place, and we're specifically just going to move to systems that allow us to run more threads concurrently.” Which I think is a really great way to look at this.

Pete: So, there were some bugs that were found along the way. Obviously, it wasn't just Kinesis that had this problem; we mentioned this before. The first thing, I think, that was mentioned in the outline was that there was a bug that was surfaced in Cognito. Cognito uses Kinesis for analyzing usage patterns and access patterns for their customers, and it was having issues because it could not send that data off to Kinesis. But then also there were issues with CloudWatch metrics: they were being buffered locally in, actually, various services or just dropped entirely.

And that then causes anything that's dependent on those metrics to no longer work. And that's potentially a pretty huge issue. Like Auto Scaling, if you had an Auto Scaling event based on metrics that never arrived, that could have and very likely caused many of the outages for consumers of these services.

Jesse: And this is part of what makes this outage so fascinating to me because we are talking about a very complex system here that has multiple moving parts, multiple services were involved, not just services from the AWS perspective, but services within the different systems of the Kinesis service, and one of the most important things is graceful degradation of these services so that in the future, we don't run into these issues as hard. So, maybe in the future, the Cognito service is able to continue to operate, even when it's seeing errors from the Kinesis API, or these other services, these AWS services are able to continue functioning at some degraded level even when they're seeing errors from upstream services that they depend on. And that's really important because that's one of the things that ultimately became bugs that were highlighted here, but also future improvements that we want to call out that are really great ways to think about how can we make this better in the future, not just in terms of preventing this from happening again, but how can we minimize this kind of impact in the future?

Pete: Exactly. And in some cases, this buffering of metrics that some services had, like Lambda, actually caused memory contention, until engineers identified and resolved it. In some cases, they actually added additional buffers—they even mentioned adding three hours’ worth of storage into CloudWatch’s local metric store that would then allow for services like Auto Scaling and such to be able to operate. I think, one change that they made, which, again, I kind of laugh at, just because it's so real.

Again, you want to think Amazon is this whole other level, and in scale they are, but they’re the same humans as we are doing the same type of work, and the change they did was to migrate CloudWatch into a separate partitioned frontend fleet, which is just incredibly common and oftentimes is the inevitable result of an outage. Take the most critical thing off of the, quote, “shared cluster” and move it into somewhere that's a little bit separate. I can't tell you how many times I've had outages where the answer was, move that really noisy client off of our Elasticsearch cluster and they get their own.

Jesse: Yeah. If they are going to be super noisy, let them have their own space to be noisy so that they're not impacting everybody else who needs the same services. If there's one client who specifically is going to be noisy, or needy, or high-compute-intensive, you put them in their own cluster, and maybe give them more compute resources so that they ultimately are able to do what they need to do without impacting everybody else.

Pete: Exactly. So, onto the summary. Obviously, we both have our hot takes, and we'll greet you with these hot topics now. But I think at the high level, as always, more monitoring, more alerting; these are things that are always needed. It's super hard to know what to monitor in advance, the greater observability that you have in your environment, that ability to have insight into what's happening to be storing that data somewhere that's accessible—of course, if CloudWatch goes down, then maybe you have some problems there, so—but having more of this data because even though you may not know what to monitor, trying to monitor as much as you are financially and technologically able to, it allows you to have that data there for answering the unknown-unknowns. This is a common topic in the observability world is trying to find those unknown-unknowns—those outliers—to get a quicker answer and a quicker resolution to those problems.

Jesse: Yeah, I think that unknown-unknowns are extremely important to think about, especially in observability, as you mentioned, Pete. If I could go back and teach my younger self anything, I would say, “Just be mindful that there are going to be unknown-unknowns.” And I think being mindful of that is critical because there are definitely folks in the monitoring space who believe that you need to monitor everything and have all the metrics so that you can always have the data that you need, but I think it's less about that and more about understanding what you are aware of and understanding some things you aren't aware of, or at least, understanding that there are things that you aren't aware of that could potentially come up and bite you in the butt and that you need to be able to have contingency plans for that.

Pete: Yeah, exactly. Wow. Well, this was a fascinating post-mortem outline that Amazon wrote up. I highly recommend that you all read through it. I think it's just great to see this level of detail. Outages are painful for everyone, but the amount of detail they gave that really explain the world in which they were operating and debugging this within, I just thought it was incredibly fascinating to get that insight, kind of, behind the curtain.

Jesse: Yeah, we'll throw the link to the outage right up in the [00:25:16 show notes], but I also wanted to highlight an [article by Ryan Frantz], who talked about this outage through the lens of a Donella Meadows’ Systems Thinking and Practice. Kinesis is a really complex system in its own right. Even if this outage didn't impact any other systems, even if it was just Kinesis itself that was experiencing problems, the retrospective of just Kinesis itself having these problems is a fantastic example of complex systems failing. But then, when you add in all of these other strands to the web, that make the system even larger, even more complex—you have not just the microservices within Kinesis, but you have, now, other AWS services that rely on Kinesis—you've got lots of other moving parts to worry about and coordinate. And it's not just about the contributing factors or the quote-unquote, “root cause,” but about how all of these different components in the larger system can still function in some kind of degraded mode when the services that they rely on are unavailable. How can we keep the entire service web, so to speak, available and online, even when some of the components of the service web might be weaker, or some of the components may be gone altogether?

Pete: Yeah, exactly. All of this just speaks for that the level of complexity that we operate within is growing at an unknown rate over the past many decades. I mean, things are just so much more complex, and especially with the rise of microservices, it gets harder and harder to identify dependencies. You know, you see those Death Star graphs as well. It's crazy.

Awesome. I think that does it for us. If you have enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us what your most painful outage was. Thanks again.

Announcer: This has been a HumblePod production. Stay humble.

The Kinesis Outage

Episode Summary

Episode Show Notes & Transcript

You might also like

Get the Newsletter

Sponsor an Episode