Episode Show Notes & Transcript
Corey: Hello, and welcome back to our Networking in the Cloud mini series sponsored by ThousandEyes. That's right. ThousandEyes is state-of-the-cloud Performance Benchmark Report is now available for your perusal. It's really providing a lot of baseline that we're taking all of the miniseries information from. It pointed us in a bunch of interesting directions and helps us tell stories that are actually, for a change, backed by data rather than pure sarcasm. To get your copy, visit snark.cloud/realclouds because it only covers real cloud providers. Thanks again to ThousandEyes for their ridiculous support of this shockingly informative podcast mini series.
It's a basic fact of cloud that things break all the time. I've been joking for a while that a big competitive advantage that Microsoft brings to this space is that they have 40 years of experience apologizing for software failures, except that's not really a joke. It's true. There's something to be said for the idea of apologizing to both technical and business people about real or perceived failures being its own skillset, and they have a lot more experience than anyone in this space.
There are two schools of thought around how to avoid having to apologize for service or component failures to your customers. The first is to build super expensive but super durable things, and you can kind of get away with this in typical data center environments right up until you can't, and then it turns out that your SAN just exploded. You're really not diversifying with most SANs. You're just putting all of your eggs in a really expensive basket, and of course, if you're still with power or networking outage, nothing can talk to the SAN, and you're back to square one.
The other approach is to come at it with a perspective of building in redundancy to everything and eliminating single points of failure. That's usually the better path in the cloud. You don't ever want to have a single point of failure if you can reasonably avoid it, so going with multiple everythings starts to make sense to a point. Going with a full on multi-cloud story is a whole separate kettle of nonsense we'll get to another time. But you realize at some point you will have single points of failure and you're not going to be able to solve for that. We still only have one planet going around one sun for example. If either of those things explode, well, computers aren't really anyone's concern anymore. However, betting the entire farm on one EC2 instance is generally something you'll want to avoid if at all possible.
In the world of AWS, there aren't data centers in the way that you or I would contextualize them. Instead, they have constructs known as availability zones and those composed to form a different construct called regions. Presumably, other cloud providers have similar constructs over in non-AWS land, but we're focusing on AWS as implementation in this series, again, because they have a giant multi-year head start over every other cloud provider, and even that manifests in those other cloud providers comparing what they've built and how they operate to AWS. If that upsets you and you work at one of those other cloud providers, well, you should have tried harder. Let's dive in to a discussion of data centers, availability zones, and regions today.
Take an empty warehouse and shove it full of server racks. Congratulations. You have built the bare minimum requirement for a data center at its most basic layer. Your primary constraint and why it's a lot harder than it sounds is power, and to a lesser extent, cooling. Computers aren't just crunching numbers, they're also throwing off waste heat. You've got to think an awful lot about how to keep that heat out of the data center.
At some point, you can't shove more capacity into that warehouse-style building just because you can't cool it if it's all running at the same time. If your data center's particularly robust, meaning you didn't cheap out on it, you're going to have different power distribution substations that feed the building from different lines that enter the building at different corners. You're going to see similar things with cooling as well, multiply redundant cooling systems.
One of the big challenges, of course, when dealing with this physical infrastructure is validating that what it says on the diagram is what's actually there in the physical environment. That can be a trickier thing to explore than you would hope. Also, if you have a whole bunch of systems sitting in that warehouse and you take a power outage, well, you have to plan for this thing known as inrush current.
Normally, it's steady state. Computers generally draw a known quantity of power. But when you first turn them on, if you've ever dealt with data center servers, the first thing they do is they power up everything to self-test. They sound like a jet fighter taking off as all the fans spin up. If you're not careful, and all these things turn on at once, you'll see a giant power spike that winds up causing issues, blowing breakers, maxing out consumption, so having a staggered start becomes a concern as well. Having spent too much time in data centers, I am painfully familiar with this problem of how you safely and sanely recover from site-wide events, but that's a bit out of scope, thankfully, because in the cloud, this is less of a problem.
Let's talk about the internet and getting connectivity to these things. This is the Networking in the Cloud podcast after all. You're ideally going to have multiple providers running fiber lines to that data center hoping to avoid fiber's natural predator, the noble backhoe. Now, ideally, all those fiber lines are going over different paths, but again, hard thing to prove, so doing your homework's important, but here's something folks don't always consider: If you have a hundred gigabit ethernet links to each computer, which is not cheap, but doable, and then you have 20 servers in a rack, each rack theoretically needs to be able to speak at least two terabit at all times to each other server in each other rack, and most of them can't do that. They wind up having bottle-necking issues.
As a result, when you have high-traffic applications speaking between systems, you need to make sure that they're aware of something known as rack affinity. In other words, are there bottlenecks between these systems, and how do you minimize those to make sure the crosstalk works responsibly? There are a lot of dragons in here, but let's hand-wave past all of it because we're talking about cloud here. The point of this is that there's an awful lot of nuance to running data centers, and AWS and other large cloud providers do a better job of it than you do. That's not me insulting your data center staff. That's just a fact. They have the scale and the staff and the expertise of running these things operationally that very few other companies are going to be able to touch.
Sure, if you're Facebook, you probably have some expertise in this as well, and a lot of this won't apply to you, but you know that already. If you're wondering whether what I'm talking about here applies to your environment, unless you know for a fact it doesn't, it does. Assume that. This impacts your approach in the cloud, the networking, durability, and the concept of blast radius, and the forms that AWS gives us that wrap these concepts for us are twofold, and I want to cover them today: regions and availability zones.
An availability zone, or A-Zee, or A-Zed if you're not in the United States, is effectively a set of data centers, and yes, that's plural. It's not just different racks with different power buses in the same room. AWS tries to guarantee that there is no shared power, network, or control plane between availability zones, but you can expect some issues to impact an entire availability zone as a result. Ergo, if you're building something important, you're going to want it to be in at least multiple availability zones.
To that end, here's a fun fact that trips up nearly everyone the first time they see it. If you have an AWS account, you might see that there's an outage in a particular availability zone, us-west-2a. Meanwhile, in my account, I see an outage in us-west-2c. Who's right in that scenario? Well, we both are because availability zone names aren't consistent between AWS accounts.
Relatively recently, about a year ago as of this recording, they announced a zone ID that is consistent between accounts, but people still don't talk about it in those terms. They're still talking with the old style region us-west-2 followed by a letter for the availability zone. You still have to disambiguate those back to zone IDs with an extra step. That also doesn't solve the problem for you because note that even with indirect issues that you're seeing, they can still impact other availability zones, even with a completely separate control plane because if you have things that are running in two availability zones for your application, and one of those availability zones drops off the internet for a while, you're suddenly seeing twice the load in the availability zone that's still working. You're also probably not the only customer that has planned for this and has built out in multiple availability zones, so other folks are going to be seeing the exact same type of behavior.
As a result, failures will then cascade and manifest as slow performance in the good availability zones, and it's super hard to plan for. It's also super hard to detect, which brings us back to our sponsor, ThousandEyes. ThousandEyes provides a global observer perspective on what's going on internet-wide with a bunch of different providers. It helps answer the question when one of these incidents hits of "is it my code, is it the last deployment that we did, or is there something global that's causing this problem?"
ThousandEyes provides that global observer perspective that helps you figure out immediately was it your code or was it something infrastructure-based because if it's not your code and it is infrastructure-based, suddenly, you can stop looking at everything you just shipped to production and instead look at mitigating this according to an established DR plan. Thanks again to ThousandEyes for sponsoring this. To learn more, visit thousandeyes.com, and tell them Corey sent you. They seem to like me for some reason that we really can't tell.
You finally build something that's in multiple availability zones, but as mentioned, that cascade effect can be a challenge, so this is where we get into the idea of multiple regions. A region is two or more availability zones, usually three or more, but there are some legacy stories behind that, and those are separated by very large distances. In the United States, for example, there's one in Oregon, one in Ohio, one in Virginia, and one, sort of, in Northern California. The challenge, of course, is building applications that work spanning multiple regions, and there are a couple of issues with this. If you're trying to tilt at the windmill of multi-cloud, I would strongly encourage you to start by going multiple region in a single provider first.
This removes a lot of the finicky bits of multi-cloud, like the services work differently, and if you pick the right regions, you'll have a one-to-one affinity between all of the different services, and that's awesome. Once you wind up getting multiple regions online, then you start to see a lot of the challenges with this approach. Different things experience latency very differently.
There's also data transfer cost to consider. Whether data is traversing between regions or between availability zones, it does incur an additional cost, so anything with a high replication factor is going to be of some concern. We'll talk specifically about data transfer costs in a future episode. Additionally, if you're in a single provider and going multiple regions, they do have dedicated links between their regions that usually wind up providing better performance and faster speeds than you're going to see traversing the general internet, but see the previous episode on global accelerator to figure out a little bit more about some of the caveats there.
One thing to also consider is that because AWS does have a severed control plane that does not extend to multiple regions, there are two things that this impacts. The first is that we have never yet seen a networking event that traverses more than one region. The counterpoint is that not all services are available in all regions, so make sure that you wind up selecting appropriate regions based upon their region service availability table.
Further, you're also going to want to make sure that the pricing aligns with it. The region in Northern California, for example, doesn't have nearly as many availability zones as the rest, and everything in that region does tend to cost more, so pay attention to that. There are two other regions that were announced, or region-like things that were announced recently at AWS re:Invent.
The first is the local zone, which is a different type of availability zone. They only have one so far. It is generally available in preview, which means that words no longer have the same meaning anymore, and it's an extension into Los Angeles of the region based in Oregon, us-west-2. This enables companies and other organizations in Los Angeles to have lower latency access to AWS resources when effectively tens of milliseconds or less matter for certain workloads.
It's fascinating, but it does suffer from a lack of durability that you're going to see in a fully baked region, so use it if you have to, but if you can avoid it, you're potentially saving yourself some ops burden down the road. They've also taken their outposts, which are fundamentally just racks full of AWS equipment that you can now rent and put in your facility, and done some partnerships with cell companies. In the United States, they've started with Verizon, and they're exploring 5G and calling this AWS wavelength. This is relevant if and only if you're looking at building 5G type applications in partnership with Verizon. Most folks aren't, so it's not going to be super relevant, but it is a type of global infrastructure to pay attention to.
Fundamentally, understanding the differences between regions and availability zones in AWS, or their equivalent in other providers, is going to be critical for planning for your DR type of tests. It's super unfortunate to wind up testing your DR plan and finding everything works, and then testing it during an actual... and then using it during an actual outage and discovering everything's slow and the provisioning takes forever because everyone has the same plan that you do. Take some time, make sure that you understand these region and availability zone concepts when you're building out your infrastructure plan, and ideally, everything goes a lot more smoothly for you.
That's all I've got to say on this particular topic. If you have questions, please feel free to ask them. On Twitter, I'm QuinnyPig. That's Q-U-I, double N, Y Pig, and I'll do my best to either answer them myself or point you to someone smart who can answer them more authoritatively. If you've enjoyed this podcast, please leave a five-star review in Apple Podcasts. If you've hated this podcast, please leave a five-star review in Apple Podcasts and a funny comment, so I have something to laugh at while crying. I'm cloud economist Corey Quinn, and I'll talk to you more next week.
Announcer: This is been a HumblePod production. Stay humble.