Networking in the Cloud Fundamentals, Part 1

Episode Summary

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to. Join me as I ramble on about why networking in the cloud doesn’t matter until it does, IPv4 and the emergence of IPv6, the anatomy of a network, how networks talk to one another, the difference between a network in a data center and a network in the cloud, why wire cutters are nature’s best firewall, and more.

Episode Show Notes & Transcript

Links Referenced


UDP. I'd make a joke about it, but I'm not sure you'd get it. 

This episode is sponsored by ThousandEyes. Think of ThousandEyes as the Google Maps of the internet. Just like you wouldn't dare leave San Jose to drive to San Francisco without checking if 101 or 280 was faster and yes, that's a very localized reference to San Francisco Bay area. Businesses rely on ThousandEyes to see the end to end paths their apps and services are taking from their servers to their end users to identify where the slowdowns are, where the pileups are hiding and what's causing the issues. They use ThousandEyes to see what's breaking where and importantly, they share that data directly with the offending service providers to hold them accountable and get them to fix the issue fast, ideally before it impacts end users. You'll be hearing a fair bit more about ThousandEyes over the next 12 weeks because Thursdays are now devoted to networking in the cloud. It's like screaming in the cloud, only far angrier.

We begin today with the first of 12 episodes. Episode one, the fundamentals of cloud networking. You can consider this the AWS morning brief networking edition. So a common perception in the world of cloud today is that networking doesn't matter, and that perception is largely accurate. You don't have to be a network engineer the way that any reasonable systems or operations person did even 10 years ago, because in the cloud, the network doesn't matter at all until suddenly it does at the worst possible time, and then everyone's left scratching their heads.

So let's begin with how networking works, because a computer in 2019 is pretty useless if it can't talk to other computers somehow. And for better or worse, Bluetooth isn't really enough to get the job done. Computers talk to one another over networks, basically by having a unique identifier. Generally, we call those IP addresses here in the path that this future has taken. In a different world, we would've gone with token ring and a whole bunch of other addressing protocols, but we didn't. Instead we went with IP, the unimaginatively named internet protocol, and with the current version of the internet protocol, version four, we're not talking about IPv6 because let's not kid ourselves, no one's really using that at scale despite everyone claiming that it's going to happen real soon now.

So there are roughly 4 billion IP addresses and change, and those are allocated throughout effectively the entire internet. When this stuff was built back when it was just defense institutions and universities on the internet, 4 billion seemed like stupendous overkill. Now it turns out that some people have 4 billion objects on their person that are talking to the internet and all chirping and distracting them at the same time when you're attempting to have a conversation with them.

So those networks are broken down into subnetworks or subnets, for lack of a better term. And they can range anywhere from a single IP address, which in CIDR, C-I-D-R parlance is a /32 to all 4 billion and change, which is a /0. Some common ones tend to be /24, which is 256 IP addresses, of which 254 are usable and you can expand that into 512 with a /23 and so on and so forth. The specific math isn't particularly interesting or important and it's super hard to describe without some kind of whiteboard. So smile, nod and move past that. So then you have all these different subnets. How do they talk to one another? I mean the easy way to think of it is, "Oh, I have one network, I plug it directly into another network and they can talk to each other."

Well, sure in theory. In practice, it never works that way because those two networks are often not adjacent. They have to talk to something else, go through different hops to go from here to there to somewhere else, to somewhere else to finally the destination it cares about. And when you take a look at the internet as being this network that spans the entire world, well that turns into a super complicated problem because remember, the internet was originally designed to be something that could withstand a massive disruption generally in the terms of nuclear war where effectively large percentages of the earth were no longer habitable, had to be able to reroute around things and routing is more or less how that wound up working.

The idea that you could have different paths to get to the same destination and that solves an awful lot. It's why the internet is as durable as it is, but also explains why these things are terrible and why everyone is super quick to blame the network. One last thing to consider is network address translation. They're private IP address ranges that are not reachable over the general internet, anything starting with a 10 for example, the entire 10/8 is considered private IP address space. Same with one 192.168, anything in that range is as well and anything between 172.16 and 172.20, give or take, if I'm wrong, don't at me. It's been a very long week and translating those private IP addresses into public IP addresses is known as network address translation or NAT. We're not going to get into the specifics of that at the moment, but just know that it exists.

Now, most of the traditional networking experience doesn't come from working in the cloud. It comes from working in data centers, a job that sucks and some of the things that you learn doing that are tremendously impactful. They completely change how you view how computers work and in the cloud, that knowledge becomes invaluable. So let's talk a little bit about what it looks like in the world of cloud, specifically AWS, because AWS had effectively five years of uninterrupted non-compete time where no one else was really playing with cloud. So by the time everyone else woke up, the patterns that AWS had established were more or less what other people were using. This is the legacy of Rip Van Wrinkling through five years of cloud. If you don't want me to talk about AWS and talk about a different company instead, that other company should have tried harder.

In AWS context, they have something known as a virtual private network or a VPC, and planning out what your network looks like in those environments is relatively challenging because people tend to make some of the same mistakes here as they did in data centers. For example, something that has changed is that common wisdom in a data center is that anything larger than a /23 or a subnet that has 512 IP addresses in it was a complete non-starter because at that point that is a large enough subnet that your broadcast domain or everything being able to talk to everything is large enough that it was going to completely screw over your switch. It would get overwhelmed. You'd wind up with massive challenges and things falling over constantly, so having small subnets was critical.

Now in the world of cloud, that's not true anymore because broadcast storms aren't a thing that AWS and other reasonable cloud providers allows to happen. It winds up getting tamped down. There are rate limits. They do all kinds of interesting things that mean that this isn't really an issue. So if you want to have a massive flat network inside of a VPC, knock yourself out, you're not going to break anything, whereas if you're doing this in a data center, you absolutely will. So that's one of those things. It needs to be adjusted as you start going from legacy on premises environments into the world of cloud.

Another common network failure mode that hasn't changed is that putting subnets next to each other was kind of a thing. If you have a bunch of /24s, let's say a 10.0.1/24, anything ranging from to would be in that subnet and people would naturally want to put a subnet right next to it, 10.0.2 and so on and so forth. The problem is by packing them right next to each other when one thing explodes, such as having a whole bunch more computers there than you thought or hey, there's this container thing now where you're going to have a whole bunch of IP addresses tied to one computer, suddenly the entire pattern changes and you can't expand the subnet to make it bigger because you'll run into the next subnet up, and then you have to do data center moves and this was freaking horrible.

Everyone hated it. No one liked it and nothing good came of it. So now, that's still a problem. You want to make sure that you can expand your subnet significantly without stomping into other ranges. Having to plan your network addresses inside of a VPC is still there. It's the sort of thing that you can do really easily and you think of it without even stopping for breath the second time you see it because the first time it happened to you, it leaves scars and you remember it.

Something else that I love is that bad cables aren't a thing anymore in the world of cloud. How do you handle a cable? How do you crimp it appropriately? You can interview people working in data centers by asking them from memory to tell you what the pinout is on the ethernet B standard because there's eight cables inside of a CAT 5 cable and crimping them in the right order is incredibly important. Now, the answer is I don't care. I don't have to care. I don't know about these things, but you still remember them. For example, when you lose an entire day due to a bad cable and it might be that the batch was bad, so then you wind up with these weird intermittent failures. The lesson you take from this is when you throw away the bad cable, you cut it first with nature's best firewall, a wire cutter, because otherwise, some well-meaning idiot and yes, I've been that idiot will take it out of the trash bag, "Oh, this cable looks fine. I'll put it back at the spare parts bin."

Then you're testing one bad cable with another bad cable and well it couldn't be the cable in that case and people go slowly mad. The pro move is to wind up having a network cable tester in the data center when you're building things out. Getting away from the hardware a bit because that's the whole point of cloud where we don't have to think about it anymore, you also have certain assumptions that get baked in inherently into anything that you're building when you're doing things on premises. You have to worry about things like cables failing. You don't generally think about that in terms of cloud. You have to worry about things like bottle-necking at the top of the rack switches, where you have a whole bunch of systems that are talking to each other at one gigabit per second and you have a 10 gigabit link between racks.

Well, okay. More than 10 servers talking to another are not going to fit over that link, so you have to worry about bandwidth constraints. You have to worry about cables failing and well, how are we going to fix that? How are we going to route over to a secondary set of cables? In terms of cloud, you generally don't have to think about most of, if any of that. Things even like DRI failures wind up not being an issue, let alone cabling issues. It just sort of slips below the surface of awareness. Same story with routing inside of AWS. Route tables are relatively simplistic things in the world of cloud compared to any sort of routing situation that you have in the world of data centers. The reason behind that is that you're not having 15 AWS accounts all routing through each other to get from one end of your network to another.

If you are, for God's sake, stop it and do literally anything else other than what you're currently doing because it's awful. This seems like a good point to pause and talk a little bit more about ThousandEyes, which is not abjectly awful. They tend to focus on a couple of different things. The first is consumer digital experience, I think in terms of SaaS providers. They care about providing visibility to global network storage because consumers don't wait for anything anymore. I know I'm impatient and most people I know are too. If they're not, I've probably gotten impatient and stopped waiting around for them. If Netflix is slow, people move to Hulu. If Uber isn't loading, we'll take a Lyft. If Twitter's down, they're on to Facebook or going somewhere profoundly more racist if they can find it. So businesses who simply wouldn't exist without the internet absolutely rely on ThousandEyes to give them visibility into effectively every potential point of failure along the service delivery chain.

So when things break, because things always break, welcome to computers, they aren't wasting precious time in war rooms trying to figure out whose fault it is. The second type of customer that tends to bring ThousandEyes in is for the employee digital experience side of the house. We're all on Office 365, Salesforce, WebEx, Zoom, or other things that don't work, Zendesk, JIRA, GitHub, et cetera, et cetera. Because if employees can't get their job done, you're paying an awful lot of expensive people to sit around, twiddling their thumbs and complaining about not being able to do work. So internal IT teams who manage massive SaaS deployments use ThousandEyes for visibility to what's breaking where. We're going to hear a lot more about ThousandEyes over the next 12 weeks or so, but check them out. Visit them at That's and my thanks to them for supporting this ridiculous run of my ridiculous podcast.

Let's talk about Network ACLs or NACLs. I don't care how you pronounce it. I'm sure one of us is wrong and I just don't care anymore. They're a terrible idea, and the reason they're a terrible idea, whether they're in the terms of cloud or in an on premises environment is that people forget they're there. Amazon's guidance is to set the default NACL and don't touch it ever again, and the problem there is because NACLs plus routing plus subnet groups plus security groups on top of that all replicate data center tiers of complexity. We really don't need that anymore. If you take a look at any on premises environment that's migrated as a lift and shift to the cloud, you've noticed that they have a tremendously complicated AWS network that they don't need to have to be that complicated. They're just replicating what they had in their rusted iron data centers in the world of cloud.

You don't need that. In the cloud, internal networks can be largely flat because security is no longer defined based upon what IP address something has, but by roles that they assume. You can move up the stack and get better levels of access control without having to depend on network borders being the thing that keeps your stuff safe. Something else to consider is that private versus public subnets aren't really a thing in the on-prem world. It's just a subnet that does or doesn't route to the internet. In the cloud, they're absolutely different because a private subnet does things in it that have no public IP and things in public subnets do have public IP addresses and that's how you accidentally expose your database to the entire internet. If you do have things in a private subnet that need to talk to the internet, that's where NAT comes in.

We mentioned that a little bit earlier in this episode. You used to have to run NAT instances yourself, which was annoying, and then AWS came out with the managed NAT gateway, which is of the freaking devil because it charges a $0.045 per gigabyte data processing fee on everything passing through it. That's not data transfer. That is data processing because once again, it's of the devil. If you have this at any significant point of scale, for God's sake, stop. I'll be ranting about this at a later episode in this mini series, so we're going to put a pin in that for now before I blow the microphone and possibly some speakers. Lastly, something else that tends to get people in trouble is heartbeat protocols, where you have two routers in an on premises environment, one on active, one on standby. If you take a look at failure in analysis, the most common cause of failures is not a router failing.

It's the thing that keeps them in check where they're talking to one another to ensure that both are still working, that thing fails. So then you have two routers vying for control. They aren't talking to one another anymore and it brings your entire site down. That's not a great plan. Consider not doing that in the world of cloud if you can avoid it.

So in conclusion, the network does matter, but if you do it right, it doesn't matter as much as it once did. That said, in the next 11 weeks, we're going to talk through exactly why and how it matters. I'm cloud economist Corey Quinn. Thank you for joining me and I'll talk to you about the network some more next week.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.