Networking in the Cloud Fundamentals, Part 4

Episode Summary

Join me as continue my series on cloud fundamentals with a high-level exploration of load balancers that includes what they do, how they work, how they prioritize requests (e.g., round robin and weighted round robin), the differences between load balancing in a region and load balancing on a global scale, how lots of redundancy is often a major driver of outages, how the right combination of AWS tools can support global loan balancing, the five dimensions of a load balancer capacity unit, and more.

Episode Show Notes & Transcript

About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


An IPv6 packet walks into a bar. Nobody talks to it.

Welcome back to what we're calling a networking in the cloud, a 12 week networking extravaganza sponsored by ThousandEyes. You can think of ThousandEyes as the Google maps of the internet. Just like you wouldn't dare leave San Jose to drive to San Francisco without checking to see if the 101 or the 280 was faster, businesses rely on ThousandEyes to see the end to end pads their apps and services are taking and for localized traffic stories that mean nothing to people outside of the Bay Area. This enables companies to figure out where are the slowdowns happening, where are the pile ups and what's causing issues. They use ThousandEyes to see what's breaking where, and importantly they share that data directly with the offending service providers to hold them accountable in a blameless way and get them to fix the issue fast, ideally before it impacts their end users.

Learn more at And my thanks to them for sponsoring this ridiculous podcast mini-series.

This week we're talking about load balancers. They generally do one thing and that's balancing load, but let's back up. Let's say that you, against all odds, you have a website and that website is generally built on a computer. You want to share that website with the world, so you put that computer on the internet. Computers are weak and frail and often fall over invariably at the worst possible time. They're herd animals. They're much more comfortable together. And of course, we've heard of animals. We see some right over there.

So now you have a herd of computers that are working together to serve your website. The problem now of course, is that you have a bunch of computers serving your website. No one is going to want to go to to view your site. They want to have a unified address that just gets to wherever it has to happen. Exposing those implementation details to customers never goes well.

Amusingly, if you go to Deloitte, the giant consultancy's website, the entire thing lives at But I digress. Nothing says we're having trouble with digital transformation quite so succinctly.

So you have your special computer or series of computers now that live in front of the computers that are serving your website. That's where you wind up pointing to, or towards. Those computers are specialized and they're called load balancers because that's exactly what they do; they balance load, it says so right there on the tin. They pass out incoming web traffic to the servers behind the load balancer so that those servers can handle your website while the load balancer just handles being the front door that traffic shows up through.

This unlocks a world of amazing possibilities. You can now, for example, update your website or patch the servers without taking your website down with a back in five minutes sign on the front it. You can test new deployments with entire separate fleets of servers. This is often called a blue green deploy or a red black deploy, but that's not the important part of the story. But you can start bleeding off traffic to the new fleet and, "Oh my god, turn it off, turn it off, turn it off. We were terribly wrong. The upgrade breaks everything." But you can do that; turn traffic on, turn traffic off to certain versions and see what happens.

Load balancers are simple in concept but they're doing increasingly complicated things. For instance, you're a load balancer. How do you determine which of the 200 servers that you're in front of that all do the same thing because they have the same website and the same application code running on them, how do you determine which one of those receives the next incoming request?

There are a few patterns that are common. The first and maybe the simplest is called round robin. You'll also see this referred to as next in loop. Let's say you have four web servers. Your first request goes to server one. Your second request goes to server two. Server three and server four, and the fifth request goes back to server one. It just rotates through the servers in order and passes out requests as they commit.

This can work super well for some use cases, but it does have some challenges. For example, if one of those servers get stuck or overloaded, piling more traffic onto it is very rarely going to be the right call. A modification of round robin is known as weighted round robin, which works more or less the same way, but it's smarter. Certain servers can get different percentages of the traffic.

Some servers, for example across a wide variety of fleets can be larger than others and can consequently handle more load. Other servers are going to have a new version of your software or your website and you only want to test that on 1% of your traffic to make sure that there's nothing horrifying that breaks things because you'd fundamentally rather break things for 1% of your users then 100% of your users. Ideally you'd like to break things for 0% of your users, but let's keep this shit semi-real, shall we?

You can also go with the least loaded metric type of approach. Some smarter load balancers can query each backend server or service that they're talking to about its health and get back a metric of some kind. If you wire logic into your application where it says how ready it is to take additional traffic, load balancers can then start making intelligent determinations as to which server to drop traffic onto next.

Probably one of the worst methods you can use to determine how to pass out traffic to load balancers is random, which does exactly what you'd think because randomness isn't. There's invariably going to be clusters and hotspots and the entire reason you have a load balancer is to not have to deal with hot spots; one server's overloaded and screaming while the one next to it is bored, wondering what the point of all of this is.

There are other approaches too that offer more deterministic ways of sending traffic over to specific servers. For example, taking the source IP address that a connection is coming from and hashing that. You can do the same type of thing with specific URLs where the hash of a given URL winds up going to specific backend services.

Why would you necessarily want to do that? Well, in an ideal world, each of those servers is completely stateless and each one can handle your request as well as any others. Here in the real world, things are seldom that clean. You'll find yourself very often with state living inside of your application. So if you have a backend server that handles your first request and then your next request goes to a different backend server, you could be prompted to log in again and that becomes really unpleasant for the end user experience.

The better approach generally is to abstract that login session into something else like Elasticache or Redis or Memcached D or Route 53. But there's a lot of ways to skin that cat that are all out of topic. But some sites do indeed use a hashing algorithm to deterministically drive the same connection to the same server. This is known incidentally as sticky sessions. The idea being that you want to make sure that you have the same server handling each request from a given client. It's not ideal, but being able to ensure that persistence is important to some workloads and I'm not going to sit here casting blame at all of them, just some of them, you know who you are.

And there are a few other approaches too that we're not going to go too far into. You can, for example, least connections; whichever server currently has the least number of active connections drive traffic there. That could cause problems when something has just been turned on and is just spinning up and suddenly it gets a bunch of traffic dropped on before it's ready.

And of course the worst of all worlds is fastest to respond, where you send a connection request to all of the servers and the first one to respond winds up winning it. That is a terrific way to wind up incentivizing all your servers to compete against one another. Try that on employees and let me know how that one goes before trying it on computers.

Now, none of those approaches want to drive traffic to servers that are unhealthy so they'll perform what are known as health checks. In other words, every 5 seconds or 30 seconds or however so often it's configured. You will see a load balancer doing a health check on all of its listening instances. Now, the fun part there is those health checks show up in the logs as the load balancer tries to validate continually that those instances are ready to receive traffic. If it's polling for specific metrics about how ready it is, that can be a little heavier. But one of the more annoying parts is if you look at your server logs for a relatively un-trafficked site, you'll see that the vast majority of your log data ends up being load balancer health checks, which is not just annoying, but it also becomes super expensive if you're paying a service to ingest your logs.

This message is sponsored by Splunk. I'm just kidding. It's sponsored by ThousandEyes who does not charge you for log ingest. In fact, they're not charging you at all for last week's state of the cloud performance benchmark report. We've talked about this in recent weeks, but it's still there. It is now public. You can get your own copy. We'll be talking about aspects of it in the coming weeks.

But they took five production tier clouds, AWS, Azure, GCP, Alibaba and IBM Cloud. Oracle Cloud was again not invited because they only tested this with real clouds. To get your free copy of the report, visit That's And my thanks once again to ThousandEyes for sponsoring this ridiculous mini-series podcast.

So that more or less covers how load balancing in a given region tends to work. Let's talk about global load balancing for a bit. Just because individual computers are fragile, individual data centers or individual cloud regions are also fragile in interesting ways. If you try and build a super redundant localized data center, very often the number one cause of outages is the redundancy stuff that you've built. Two different things are each convinced that they're now the one in charge and they wind up effectively yanking the rug out from each other. There's a whole series of failure modes there that are awful.

For things that are sufficiently valuable to the business you don't want to be dependent on any one facility or any one region. So the idea is you want to have something that balances load globally. Now, often you're going to use something like DNS or Anycast to wind up routing to various environments. Usually those environments are themselves offering up load balancers again that then in turn passes it out to individual servers.

The problem of course for doing anything on a localized basis that also works globally, things like DNS or Anycast wind up being subject to lag. It can also be subject to caching depending on how it works. So you're not going to be able to quickly turn off a malfunctioning region, but you don't generally have to move as quickly for that as you do for a single malfunctioning server. So usually a mix of approach is the right answer.

Let's talk specifically about what AWS offers in this space because once again, they are the 800 pound gorilla in the cloud space. If this offends you and you'd rather we talk about a different cloud provider, well, that cloud provider is welcome to start gaining market share to the point where they're the big player in the space and then we'll talk about them instead.

AWS does offer a few things at a global level. You can use CloudFront which is their CDN and that picks between a number of different origins based upon a variety of factors. Route 53's DNS offering when not being used as a database with my misuse approach offers interesting load balancing options as well. Global Accelerator can pick healthy end points where you can terminate your traffic. But after using all of those, once you hit a localized region, you probably want to use something else and the three most common options are all Amazon's Elastic Load Balancing offerings.

Now originally there was just one called, Elastic Load Balancer, ELD, that these days is called ELD Classic, which is because AWS has problems with their marketing team when they try and call it ELD old and busted. It has some limitations. It only scales so far. It requires pre-warming, namely it needs to have load pass through it before it scales up and can handle traffic efficiently. Otherwise, if you drop a whole bunch of traffic on it, when it's not prepared for this and hasn't been pre-warmed, it's response is, "Oh, shit, load," and then TCP terminates on the floor. Everyone's having a bad day when that happens.

So AWS looked at this and saw that it was good and then thought, "Okay, how can we A, create better offerings and B, make the billing suck far more?" They came up with two different offerings. One was the ALB, or Application Load Balancer and the other was NLB, the Network Load Balancer. Those two things split the world and complicated the living hell out of the billing system because instead of the ELD Classics charge model of per hour that you're running a load balancer and per gigabyte of traffic that passes through it, now the new versions of ALDs and NLDS charge per hour and per load balancer capacity unit.

There are five dimensions that comprise a load balancer capacity unit; new connections per system, new connections per minute, sustained traffic over a period of time and a few others. The correct answer to what will this cost me to run behind an NLD or an ALD is nobody freaking knows, try it yourself and see.

Now, NLDs are all these are fascinating because they claim that they don't need to be pre-warmed, which is awesome. ALDs need a little bit but both need way, way, way less than ELBs did. NLDs are network layer load balancers but also somehow manages to do TLS or SSL termination because the OSI model, which differentiates layer seven of the application layer with layer four which is what NLD does are very different, but the OSI model is a lie that we tell children to confuse them because we thought it was funny.

What the NLD does is it dumps traffic streams onto various places and in turn lets the destination handle it. They now support UDP as opposed to just TCP like they did at launch, so update your mental model. But by and large, if you want to wind up handling everything on instances themselves and just need something to drop the traffic onto them, NLD's a decent approach.

ALD's application load balancers on the other hand do a bunch of things, but mostly it's used to play slap and tickle with HTTPS requests. It can terminate TLS too just like NLDs can because everything's confusing and horrible, but it does a lot more. Specific headers can cause specific routing behaviors. It determines where to route traffic, not just based upon the things we've already talked about in the first part of this episode, but also through a whole bunch of different traffic rules.

You can have a whole bunch of different applications as a result living behind a single ALD and that's often not a terrible direction to go in just from a costing perspective. If you're spinning up a bunch of containerized workloads, you probably don't want to spin up 200 load balancers. Maybe you can do one load balancer and then just give it a bunch of rules that determine which application gets which traffic. It's something to consider in any case.

Now, obviously specializing in this stuff goes way deeper than we have time to cover in a single episode, but fundamentally, load balancers are a simple concept that get deceptively deep, deceptively quickly. Welcome to the entire world of networking in the cloud.

That wraps up what I have to say today about load balancers. Join us next week where I make fun of the AWS Global Accelerator and its horrible contributions to climate change.

Thanks again to ThousandEyes for sponsoring this ridiculous podcast. I am cloud economist Corey Quinn based in San Francisco, fixing AWS bills both here and elsewhere, and I'll talk to you next week.

Announcer: This has been a HumblePod production. Stay humble.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.