Corey: Welcome to the AWS Morning Briefs miniseries, Networking In the Cloud, sponsored by ThousandEyes. ThousandEyes has released their cloud performance benchmark report for 2020. They effectively race the top five cloud providers. That's AWS, Google Cloud Platform, Microsoft Azure, IBM Cloud, and Alibaba Cloud, notably not including Oracle Cloud, because it is restricted to real clouds, not law firms. It winds up being derived from an unbiased third party and metric-based perspective on cloud performance as it relates to end user experience. So this comes down to what real users see, not arbitrary benchmarks that can't be gamed. It talks about architectural and conductivity differences between those five cloud providers and how that impacts performance. It talks about AWS Global Accelerator in exhausting detail. It talks about the Great Firewall of China and what effect that has on cloud performance in that region, and it talks about why regions like Asia and Latin America experience increased network latency on certain providers. To get your copy of this fascinating and detailed report, visit snark.cloud/realclouds, because again, Oracle's not invited. That's snark.cloud/realclouds, and my thanks to ThousandEyes for their continuing sponsorship of this ridiculous podcast segment.
Now, let's say you go ahead and spin up a pair of EC2 instances, and as would never happen until suddenly it does, you find that those two EC2 instances can't talk to one another. This episode of the AWS Morning Brief's Networking in the Cloud Podcast focuses on diagnosing connectivity issues in EC2. It is something that people don't have to care about until suddenly they really, really do. Let's start with our baseline premise, that we've spun up an EC2 instance, and a second EC2 instance can't talk to it. How do we go about troubleshooting our way through that process?
The first thing to check, above all else, and this goes back to my grumpy Unix systems administrator days is: are both EC2 instances actually up?
Yes, the console says they're up. It is certainly billing you for both of those instances, I mean, this is the cloud we're talking about, and it even says that the monitoring checks, there are two by default for each instance, are passing. That doesn't necessarily mean as much as you might hope. If you go into the EC2 console, you can validate through the system logs that they booted successfully. You can pull a screenshot out of them. If everything else was working, you could use AWS Systems Manager Session Manager, and if you'll forgive the ridiculous name, that's not a half bad way to go about getting access to an instance. It spins up a shell instance in a browser that you can poke around inside that instance within, but that may or may not get you where it needs to go. I'm assuming you're trying to connect to one of those instances or both of those instances and failing, so validate that you can get into both of those instances independently.
Something else to check. Consider protocols. Very often, you may not have permitted SSH access to these things. Okay, or maybe you can't ping these and you're assuming they're down. Well, an awful lot of networks block certain types of ICMP traffic, echo requests, for example. Type eight. Otherwise, you may very well find that whatever protocol you're attempting to use isn't permitted all the way through. Note incidentally, just as an aside, that blocking all ICMP traffic is going to cause problems for your network. When things are fragmented and they need to have a different window size set for things that are being sent across the internet, ICMP traffic is how things are made aware of that. You'll see increased latency if you block all ICMP traffic, and it's very difficult to diagnose, so please, for the love of God, don't do that.
Something else to consider as you go down the process of tearing apart what could possibly be going on with these EC2 instances not able to speak to each other. Try and connect to them via IP addresses rather than DNS names. Just because there's ... I'm not saying the problem is always DNS, but it usually is DNS, and this removes a whole host of different problems that could be manifesting if you just go by IP address. Suddenly resolution, timeouts, bad DNS, et cetera, fall by the wayside. When you have a system that you're trying to talk to another system and you're only using IP, suddenly there's a whole host of problems you don't have to think about. It goes well.
Something else to consider in the wonderful world of AWS is network ACLs. The best practice around network ACLs is, of course, don't use them. Have an ACL that permits all traffic, and then do everything else further down the stack. The reason is that no one thinks about network ACLs when diagnosing these problems. So if this is the issue, you're going to spend a lot of time spinning around and trying to figure out what it is that's going on.
The next more likely approach, and something to consider whenever you're trying to set up different ways of dividing traffic across various regimes of segmentation, is security groups. Security groups are fascinating, and the way that they interact with one another is not hugely well understood. Some people treat security groups like they did old school IP address restrictions, where anything in the following network, and you can express that in CIDR notation the way one would expect, or C-I-D-R depending on how you enjoy pronouncing or mispronouncing things, can wind up being used, sure, but you can also say members of a particular security group are themselves allowed to speak to this other thing. That, in turn, is extraordinarily useful, but it also means extremely complex things, especially when you have multiple security groups layering upon one another.
Assuming that you have multiple security group rules in place, the one that allows traffic is likelier to have precedents. Note as well that there's a security group rule that is in place by default that allows all outbound traffic. If that's gotten removed, that could be a terrific reason why an instance is not able to speak to the larger internet.
One thing to consider when talking about the larger internet is what ThousandEyes does other than releasing cloud benchmark performance reports. That's right. They are a monitoring company that gives a global observer perspective on the current state of the internet. If certain providers are having problems, they're well positioned to be able to figure out who that provider is, where that provider is having the issue, and how that manifests, and then present that in real time to its customers. So if you have widely dispersed users and want to keep a bit ahead of what they're experiencing, this is not a bad way to go about doing it.
ThousandEyes provides a real time map, more or less, of the internet and its various problems, leading to faster time to resolution. You understand relatively quickly that it's a problem with the internet, not the crappy code that you've just pushed into production, meaning that you can focus your efforts on remediating the problem, where they can serve the customer better rather than diagnosing and playing the finger pointing game of, "Whose problem really is this?" To learn more, visit thousandeyes.com. That's thousandeyes.com, and my thanks to them for their continuing sponsorship of this ridiculous podcast.
Further things to consider when those two EC2 instances are unable to connect to each other. Are you using IPV4 or IPV6? IPV6 is increasingly becoming something of a standard across the internet. When things can't speak on IPV6, certain things are now manifesting as broken. Adoption is largely stalled in North America for some networks, but for others, it's becoming increasingly visible and increasingly valuable. Make sure that if you are trying to communicate via IPV6 that everything end to end is in fact working.
Something else to consider when you're doing diagnoses on these two instances that can't talk to each other. Can you get into each one of them individually? Can the instance that you get into speak to the broader internet? Can they hit other instances? Effectively, what you're trying to solve for here is fault isolation, of drilling down to a point where you're trying to figure out that this one instance has the problem. It's unlikely that the problem is going to apply to both instances at the same time.
Other things to consider is something more on the host side. Are both of these instances, for example, in the same region, in the same subnet of the same VPC? If they're supposed to be, are you sure that you've set them up that way?
Remember, private IP addressing can be the same in different VPCs in different regions. So if you think they're in the same region, and they're not, that could be a terrific instance of why you aren't actually able to speak. You'd have to communicate across the public internet and use public IP addresses rather than private IP addresses. Understanding what IP addresses each of these instances have is going to be critical for figuring out why it's not speaking correctly.
A further thing to consider while you're poking around on instances themselves. Are there host-based firewalls? On Linux, you have IP tables. On BSD and its derivatives, you have PF, but there are a bunch of different answers here just to ensure that on a local host basis, that you can handle packets accordingly.
Now, my best practice that I advise is don't do network controls with host-based firewalls. The reason is that it's very difficult to manage at large scale. It's challenging to remember that that's where it is, and as we go down this path, figuring out exactly what's causing these problems is challenging. It doesn't lead to a great outcome, and it adds work, and I don't think it's likely that this is going to be your problem, but it's certainly worth considering.
Something else to consider as well is, is it possible that there's a bad route, where one of these instances does not have a proper route either to the internet or to another subnet that you're attempting to speak to. This is largely handled for you by AWS, but by the time you've gotten to this level of the troubleshooting path, that's not necessarily guaranteed. It's something to consider. Is there a route that's missing? Is there a route that's incorrect? Can this thing talk to the broader internet, assuming it's supposed to be able to? If it can't, well, there's your potential problem. There's really a sort of a troubleshooting flow chart, mentally, that I go down when I start thinking about problems like this. It's not anything that's so formalized, but it's one of those, "What sorts of things cause these issues, and what would cause certain failure modes to manifest in a way that is aligning with what I'm seeing."
I'm a big believer as well in spinning up another copy of an instance, because hey, it's not that expensive to spin one of those things for five minutes right next to the one that's broken, and see, "Okay, is it something that is afflicting this instance as well, or is it something that is happening globally?"
Depending on your use case, it may not be an appropriate way of solving the problem, but if I can replicate the problem with a very small test case, suddenly it's a lot easier for me to take what I've learned and explain it to someone else when asking for help. When you're asking someone to help solve a problem, saying that I spin up an instance, and I attempt to connect on port 22 SSH to another instance in the same subnet and it doesn't work, that's a very small isolated problem case that you're likely to be able to get good help from, be it through a community support resource or even AWS support. Whereas if you have this complicated environment and you're only ever able to test it in that environment, well, I have this special series of instances spun up from a custom AMI. Yes, it's pronounced A-M-I. Do not let them call it Ah-mee, they're mispronouncing it if they do, and it's not able to speak to this particular instance on this particular port when the stars align, and we're seeing this certain level of traffic. If it has to be part of an existing larger environment, your troubleshooting is fundamentally going to be broken and bizarre in a whole host of different ways. It doesn't make it easier.
So I'm a big believer in getting down to the smallest possible test case, and again, because this is cloud resources, you can spin up effectively everything in a completely different account in not too much time. So sitting here spinning your wheels trying to diagnose network and conductivity issues should not be the sort of thing that takes you days on end. You should be able to clear out an entire test suite of everything I've just described in an hour or two. It doesn't take that much work to spin up new resources, but back in the days of data centers, it took weeks to get things provisioned, so of course we weren't able to provision a whole new stack. A whole new series of switches and routers and cables and servers and VMs on those servers was just not going to happen.
Instead, we don't have to see that problem now. We can spin up the entire stack quickly, and that's, from my perspective at least, one of the most transformative aspects of the cloud. When you have a question that you're not sure how to answer, you can spin up a test case and see for yourself.
That's all I've got on our connectivity troubleshooting episode of the AWS Morning Brief Networking in the Cloud podcast miniseries segment. My thanks, as always, to ThousandEyes for their generous sponsorship of this ridiculous, ideally entertaining, but nonetheless educational podcast segment. I will talk to you more next week about networking in the cloud, and what that looks like and how things break. Thank you again for listening. If you've enjoyed this podcast, please leave a five star review on Apple Podcasts. If you've hated this podcast, please still leave a five star review on Apple Podcasts and tell me exactly what my problem is.
Announcer: This has been a HumblePod production. Stay humble.