Networking in the Cloud Fundamentals, Part 6

Episode Summary

Join me as continue my series on cloud fundamentals with a look at how things break in the cloud, the differences between computers breaking in data centers versus breaking in the cloud, why you need to check Twitter or ThousandEyes instead of the AWS status page to find out whether your cloud provider’s having a massive outage, what some of the more common outages in the cloud look like, why you should probably still be in the cloud despite the fact that things break, and more.

Episode Show Notes & Transcript

About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Transcript

Corey: Knock knock. Who's there? A DDOS attack. A DDOS a... Knock. Knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock.

Welcome to what we're calling Networking in the Cloud, episodes six, How Things Break in the Cloud, sponsored by ThousandEyes. ThousandEyes recently launched their state of the cloud performance benchmark report that effectively lets you compare and contrast performance and other aspects between the five large cloud providers, AWS, Azure, GCP, Alibaba and IBM cloud. Oracle cloud was not invited because we are talking about real clouds here. You can get your copy of this report at snark.cloud/realclouds. and they compare and contrast an awful lot of interesting things. One thing that we're not going to compare and contrast though, because of my own personal beliefs, is the outages of different cloud providers.

Making people in companies, by the way, companies are composed of people, making them feel crappy about their downtime is mean, first off. Secondly, if companies are shamed for outages, it in turn makes it far likelier that they won't disclose having suffered an outage. And when companies talk about their outages in constructive blameless ways, there are incredibly valuable lessons that we all can learn from it. So let's dive into this a bit.

If there's one thing that computers do well, better than almost anything else, it's break. And this is, and I'm not being sarcastic when I say this, a significant edge that Microsoft has when they come to cloud. They have 40 some odd years of experience in apologizing for software failures. That's not trying to be insulting to Microsoft, it's what computers do, they break. And being able to explain that intelligently to business stakeholders is incredibly important. They're masters at that. They also have a 20 year head start on everyone else in the space. What makes this interesting and useful is that in the cloud, computers break differently than people would expect them to in a non-cloud environment.

Once upon a time when you were running servers and data centers, if you see everything suddenly go offline, you have some options. You can call the data center directly to see if someone cut the fiber, in case you were unaware of fiber optic cables' sole natural predator in the food chain is the mighty backhoe. So maybe something backhoed out some fiber lines, maybe the power is dead to the data center, maybe the entire thing exploded, burst into flames and burned to the ground, but you can call people. In the cloud, it doesn't work that way. Here in the cloud, instead you check Twitter because it's 3:00 AM and Nagios is the original call of duty or PagerDuty calls you, because you didn't need that sleep anyway, telling you there is something amiss with your site. So when a large bond provider takes an outage, and you're hanging out on Twitter at two in the morning, you can see DevOps Twitter come to life in the middle of the night, as they chatter back and forth.

And incidentally, if that's you, understand a nuance of AWS availability zone naming. When people say things like us-east-1a is having a problem and someone else says, "No, I just see us-east-1c is having a problem," you're probably talking about the same availability zone. Those letters change, non deterministically, between accounts. You can pull zone IDs, and those are consistent. But by and large, that was originally to avoid having problems like everyone picking A, as humans tend to do or C, getting the reputation as the crappy one.

So why would you check Twitter to figure out if your cloud provider's having a massive outage? Well, because honestly, the AWS status page is completely full of lies and gaslights you. It is as green as the healthiest Christmas tree you can imagine, even when things are exploding for a disturbingly long period of time. If you visit the website, stop.lying.cloud, you'll find a Lambda and Edge function that I've put there that cuts out some of the croft, but it's not perfect. And the reason behind this, after I gave them a bit too much crap one day and I got a phone call that started with, "Now you listen here," it turns out that there are humans in the loop, and they need to validate that there is in fact a systemic issue at AWS and what that issue might be, and then finally come up with a way to report that in a way that ideally doesn't get people sued and manually update the status page. Meanwhile, your site's on fire. So that is a trailing function, not a leading function.

Alternately, you could always check ThousandEyes. That's right, this episode is sponsored by ThousandEyes. In addition to the report we mentioned earlier, you can think of them as Google Maps of the internet without the creepy privacy overreach issues. Just like you wouldn't necessarily want to commute during rush hour without checking where traffic is going to be and which route was faster, businesses rely on ThousandEyes to see the end to end paths their applications and services are taking in real time to identify where the slow downs are, where the outages are and what's causing problems. They use ThousandEyes to see what's breaking where and then importantly, ThousandEyes shares that data directly with the offending service providers. Not just to hold them accountable, but also to get them to fix the issue fast. Ideally, before it impacts users. But on this episode, it already has.

So let's say that you don't have the good sense to pay for ThousandEyes or you're not on Twitter, for whatever reason, watching people flail around helplessly trying to figure out what's going on. Instead, you're now trying desperately to figure out whether this issue is the last deploy your team did or if it's a global problem. And the first thing people try to do in the event of an issue is, "Oh crap, what did we just change? undo it." And often that is a knee jerk response that can make things worse if it's not actually your code that caused the problem. Worse, it can eat up precious time at the beginning of an outage. If you knew that it was a single availability zone or an entire AWS region that was having a problem, you could instead be working to fail over to a different location instead of wasting valuable incident retime checking Twitter or looking over your last 200 commits.

Part of the problem, and the reason this is the way that it is, is that unlike rusting computers in your data center currently being savaged by raccoons, things in the cloud break differently. You don't have the same diagnostic tools, you don't have the same level of visibility into what the hardware is doing, and the behaviors themselves are radically different. I have a half dozen tips and tricks on how to monitor whether or not your data center's experiencing a problem remotely, but they don't work in the cloud because you're not allowed to break into us-east-1 and install your own hardware. Believe me, I've tried. I still have the scars to prove it. Instead, you have to deal with this problem of behaviors looking differently.

For example, sometimes you can talk to one set of servers but the other is completely non-responsive to you, but those two server sets can still talk to one another intermittently. So you wind up with each one of them thinking at times they're the only ones there or you can talk to both of them but they can't talk to each other. There are different kinds of failures and they all look slightly different. Occasionally, it looks like slow API responses. Latencies are increasing. Well, that's an awfully nice way to say that suddenly your database doesn't. It often can look like a certain subset of systems that seem slow or intermittent. Remember as well that availability zones aren't multiple buildings. It's not just one room with different racks being called different AZs, the way we used to do things badly in crappy data center land. It's super hard to take out 20 square blocks and cause multiple AZ outages at the same time, at least it is with that attitude.

So instead of automatically assuming that, "Well, it works for me on this other account, so things are fine," dig deeper into it. Often issues in 1AZ have cascading effects and you see other popular sites on the internet starting to have problems. Maybe it's not just you. The fact that this is sort of state of the art for monitoring these issues is a separate issue. The problem comes in when people haven't changed their thinking to reflect this new cloud reality.

There's no better example of this than DR exercises or disaster recovery. Now, most ops folks, and I still sort of count myself as one, have tremendous levels of experience with disasters, planning for disasters, recovering from disasters, and notable cases causing disasters. The problem is is that very often stories about how to handle disasters don't work in the real world. An easy example is if you're running in us-east-1 and your disaster recovery approach is, "Oh, we're just going to spin up the site in us-west-2 in Oregon, great." Now there are problems with that approach, but let's skip over a few of them and get to the interesting ones.

First, if you're doing this during a test and you spin up a bunch of east-2 instances or other services, great, that's going to probably work super well for purposes of your test. The challenge of course is that when you're in the middle of an actual disaster, you are not the only person who has that strategy in mind for how they're going to handle the disaster. So suddenly us-west-2 and other regions, I don't mean to pick on Oregon in particular, are going to suffer from inrush issues. Very often that means that the API calls to the control plane in the cloud, wind up becoming impacted, latencies start to increase. There have been scenarios in the past where it takes up to an hour to have instances come online after you've requested them. So if you need to have an active DR site ready to go, you have to pay for the durability for those instances and other services to already be up and running.

Secondly, if you're like most jobs, you'll test your DR site every quarter or every year, and you'll find at the first past that, "Oh, it didn't work, it broke immediately." So you go back, you fix the thing, you try it again, and it breaks differently. And after enough of this, you finally beat something together and it works and you call it done. You put the binder on the shelf where no one will ever read it again and everything is just fine until the next commit breaks your DR plan again. And the problem is is that that's in the best of times where there's no actual disaster. Trying to make that work in any reasonable approach during a disaster in the middle of the night, where not everyone's firing on all cylinders, that becomes a problem.

I also strongly suggest that you don't approach business continuity planning, or BCP, the way that I did, and it's why I stopped being invited to those meetings. The problem that we ran into was, "Okay, let's pretend for the sake of argument that San Francisco is no longer able to conduct business," to which my immediate response is, "Oh dear heavens, is my family okay?" "Yes, yes, your family's fine, everyone's fine, but magically we can't do any computer work." Okay, I struggled to identify that, but all right, let's pretend I care that much about my job and not about my family. Cool. I understand everyone's family relationships are different and for some folks that works.

All right, next step. Simultaneously, us-east-1 is completely unusable. "Okay, so let me get this straight, not only is San Francisco now magically not usable, but also roughly a hundred square miles of Northern Virginia is also completely unusable. And at this point, I'm not hunkering down in a basement cowering, waiting for the end of days because why exactly?" And the response was "Just roll with it, it'll be fine. Now, we need to have a facility outside of the city for you to go to and in a different provider, have all the backups, you can rehydrate this new. And at the end of that project, we're going to be able to do this whenever we need to." At which point I stared at people for the longest time and said, "You get that we sell ads here, right? And furthermore, let's pretend that everything you say is true, us-east-1 is irreparably damaged and I don't want to spend time with my family in a disaster like that because everyone's fine. Why do I still work here rather than going to make extortionate money as a consultant somewhere else who is not prepared nearly as well as we have?" And then I wasn't invited to those meetings anymore.

One last angle that people tend to approach this stuff from is the idea of, well ,the service needs an SLA or service level agreement. Some AWS services have them, some do not. But they don't mean what you think they do. Route 53 famously has a 100% SLA. If they don't meet that, first, they owe you some small portion of your route 53 bill which, spoiler, is probably not a large pile of money. Secondly, because they've published, that everything else, including a number of AWS services themselves, almost certainly build to that SLA. So it breaks, they owe you some small pile of money, but when that outage hits, because everything breaks, it's what they do, it impacts your site. No, you can't trust various SLA metrics as statements that services will never go down. You own your own availability. You can't outsource the responsibility of that to third parties, no matter how much you might want to.

It may sound like I'm suggesting that things in the cloud always break and that you shouldn't be in the cloud at all if you can't withstand an outage. I strongly disagree. There are reasons to stay with a cloud provider. First, they're going to diagnose and fix the problem with a far larger staff who is far better equipped to handle these issues, then you'll be able to independently in almost every case.

Secondly, if there's a massive disruption to a public cloud provider, then you're going to be in good company. The headlines are not going to be about your company's outage, they're going to be about the cloud provider. There's some reputational risks that gets mitigated as a direct result.

Finally, if all of that fails and you still go down and everyone makes fun of you for it, well, you can always go for consolation on Twitter.

This has been another episode of Networking in the Cloud. I'm cloud economist, Corey Quinn. Thank you for joining us and we'll talk soon. Thanks again to ThousandEyes for their sponsorship of this ridiculous podcast.

Announcer: This has been a HumblePod production. Stay humble.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.