A Cloud Economist is Born - The AlterNAT Origin Story

Episode Summary

Ben Whaley, Staff Software Engineer at Chime Financial, joins Corey to discuss his new solution AlterNAT which is designed to solve for egregious NAT Gateway costs. Ben explains how he was inspired to create AlterNAT by searching for the biggest impact he could have on his company’s AWS bill, how he combined legacy NAT Instances with an automatic standby NAT Gateway to solve for reliability concerns, and describes his own journey to becoming a fellow cloud economist in Corey’s eyes. Ben also reveals why he’d consider his project a success even if it became irrelevant.

Episode Show Notes & Transcript

About Ben

Ben Whaley is a staff software engineer at Chime. Ben is co-author of the UNIX and Linux System Administration Handbook, the de facto standard text on Linux administration, and is the author of two educational videos: Linux Web Operations and Linux System Administration. He is an AWS Community Hero since 2014. Ben has held Red Hat Certified Engineer (RHCE) and Certified Information Systems Security Professional (CISSP) certifications. He earned a B.S. in Computer Science from Univ. of Colorado, Boulder.

Links Referenced:

Chime Financial: https://www.chime.com/
alternat.cloud: https://alternat.cloud
Twitter: https://twitter.com/iamthewhaley
LinkedIn: https://www.linkedin.com/in/benwhaley/

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH.

Basically you're SSHing the same way you manage access to your app. What's the benefit here? Built-in key rotation, permissions as code, connectivity between any two devices, reduce latency, and there's a lot more, but there's a time limit here. You can also ask users to reauthenticate for that extra bit of security. Sounds expensive?

Nope, I wish it were. Tailscale is completely free for personal use on up to 20 devices. To learn more, visit snark.cloud/tailscale. Again, that's snark.cloud/tailscale

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn and this is an episode unlike any other that has yet been released on this august podcast. Let’s begin by introducing my first-time guest somehow because apparently an invitation got lost in the mail somewhere. Ben Whaley is a staff software engineer at Chime Financial and has been an AWS Community Hero since Andy Jassy was basically in diapers, to my level of understanding. Ben, welcome to the show.

Ben: Corey, so good to be here. Thanks for having me on.

Corey: I’m embarrassed that you haven’t been on the show before. You’re one of those people that slipped through the cracks and somehow I was very bad at following up slash hounding you into finally agreeing to be here. But you certainly waited until you had something auspicious to talk about.

Ben: Well, you know, I’m the one that really should be embarrassed here. You did extend the invitation and I guess I just didn’t feel like I had something to drop. But I think today we have something that will interest most of the listeners without a doubt.

Corey: So, folks who have listened to this podcast before, or read my newsletter, or follow me on Twitter, or have shared an elevator with me, or at any point have passed me on the street, have heard me complain about the Managed NAT Gateway and it’s egregious data processing fee of four-and-a-half cents per gigabyte. And I have complained about this for small customers because they’re in the free tier; why is this thing charging them 32 bucks a month? And I have complained about this on behalf of large customers who are paying the GDP of the nation of Belize in data processing fees as they wind up shoving very large workloads to and fro, which is I think part of the prerequisite requirements for having a data warehouse. And you are no different than the rest of these people who have those challenges, with the singular exception that you have done something about it, and what you have done is so, in retrospect, blindingly obvious that I am embarrassed the rest of us never thought of it.

Ben: It’s interesting because when you are doing engineering, it’s often the simplest solution that is the best. I’ve seen this repeatedly. And it’s a little surprising that it didn’t come up before, but I think it’s in some way, just a matter of timing. But what we came up with—and is this the right time to get into it, do you want to just kind of name the solution, here?

Corey: Oh, by all means. I’m not going to steal your thunder. Please, tell us what you have wrought.

Ben: We’re calling it AlterNAT and it’s an alternative solution to a high-availability NAT solution. As everybody knows, NAT Gateway is sort of the default choice; it certainly is what AWS pushes everybody towards. But there is, in fact, a legacy solution: NAT instances. These were around long before NAT Gateway made an appearance. And like I said they’re considered legacy, but with the help of lots of modern AWS innovations and technologies like Lambdas and auto-scaling groups with max instance lifetimes and the latest generation of networking improved or enhanced instances, it turns out that we can maybe not quite get as effective as a NAT Gateway, but we can save a lot of money and skip those data processing charges entirely by having a NAT instance solution with a failover NAT Gateway, which I think is kind of the key point behind the solution. So, are you interested in diving into the technical details?

Corey: That is very much the missing piece right there. You’re right. What we used to use was NAT instances. That was the thing that we used because we didn’t really have another option. And they had an interface in the public subnet where they lived and an interface hanging out in the private subnet, and they had to be configured to wind up passing traffic to and fro.

Well, okay, that’s great and all but isn’t that kind of brittle and dangerous? I basically have a single instance as a single point of failure and these are the days early on when individual instances did not have the level of availability and durability they do now. Yeah, it’s kind of awful, but here you go. I mean, the most galling part of the Managed NAT Gateway service is not that it’s expensive; it’s that it’s expensive, but also incredibly good at what it does. You don’t have to think about this whole problem anymore, and as of recently, it also supports ipv4 to ipv6 translation as well.

It’s not that the service is bad. It’s that the service is stonkingly expensive, particularly at scale. And everything that we’ve seen before is either oh, run your own NAT instances or bend your knee and pays your money. And a number of folks have come up with different options where this is ridiculous. Just go ahead and run your own NAT instances.

Yeah, but what happens when I have to take it down for maintenance or replace it? It’s like, well, I guess you’re not going to the internet today. This has the, in hindsight, obvious solution, well, we just—we run the Managed NAT Gateway because the 32 bucks a year in instance-hour charges don’t actually matter at any point of scale when you’re doing this, but you wind up using that for day in, day out traffic, and the failover mode is simply you’ll use the expensive Managed NAT Gateway until the instance is healthy again and then automatically change the route table back and forth.

Ben: Yep. That’s exactly it. So, the auto-scaling NAT instance solution has been around for a long time well, before even NAT Gateway was released. You could have NAT instances in an auto-scaling group where the size of the group was one, and if the NAT instance failed, it would just replace itself. But this left a period in which you’d have no internet connectivity during that, you know, when the NAT instance was swapped out.

So, the solution here is that when auto-scaling terminates an instance, it fails over the route table to a standby NAT Gateway, rerouting the traffic. So, there’s never a point at which there’s no internet connectivity, right? The NAT instance is running, processing traffic, gets terminated after a certain period of time, configurable, 14 days, 30 days, whatever makes sense for your security strategy could be never, right? You could choose that you want to have your own maintenance window in which to do it.

Corey: And let’s face it, this thing is more or less sitting there as a network traffic router, for lack of a better term. There is no need to ever log into the thing and make changes to it until and unless there’s a vulnerability that you can exploit via somehow just talking to the TCP stack when nothing’s actually listening on the host.

Ben: You know, you can run your own AMI that has been pared down to almost nothing, and that instance doesn’t do much. It’s using just a Linux kernel to sit on two networks and pass traffic back and forth. It has a translation table that kind of keeps track of the state of connections and so you don’t need to have any service running. To manage the system, we have SSM so you can use Session Manager to log in, but frankly, you can just disable that. You almost never even need to get a shell. And that is, in fact, an option we have in the solution is to disable SSM entirely.

Corey: One of the things I love about this approach is that it is turnkey. You throw this thing in there and it’s good to go. And in the event that the instance becomes unhealthy, great, it fails traffic over to the Managed NAT Gateway while it terminates the old node and replaces it with a healthy one and then fails traffic back. Now, I do need to ask, what is the story of network connections during that failover and failback scenario?

Ben: Right, that’s the primary drawback, I would say, of the solution is that any established TCP connections that are on the NAT instance at the time of a route change will be lost. So, say you have—

Corey: TCP now terminates on the floor.

Ben: Pretty much. The connections are dropped. If you have an open SSH connection from a host in the private network to a host on the internet and the instance fails over to the NAT Gateway, the NAT Gateway doesn’t have the translation table that the NAT instance had. And not to mention, the public IP address also changes because you have an Elastic IP assigned to the NAT instance, a different Elastic IP assigned to the NAT Gateway, and so because that upstream IP is different, the remote host is, like, tracking the wrong IP. So, those connections, they’re going to be lost.

So, there are some use cases where this may not be suitable. We do have some ideas on how you might mitigate that, for example, with the use of a maintenance window to schedule the replacement, replaced less often so it doesn’t have to affect your workflow as much, but frankly, for many use cases, my belief is that it’s actually fine. In our use case at Chime, we found that it’s completely fine and we didn’t actually experience any errors or failures. But there might be some use cases that are more sensitive or less resilient to failure in the first place.

Corey: I would also point out that a lot of how software is going to behave is going to be a reflection of the era in which it was moved to cloud. Back in the early days of EC2, you had no real sense of reliability around any individual instance, so everything was written in a very defensive manner. These days, with instances automatically being able to flow among different hardware so we don’t get instance interrupt notifications the way we once did on a semi-constant basis, it more or less has become what presents is bulletproof, so a lot of people are writing software that’s a bit more brittle. But it’s always been a best practice that when a connection fails okay, what happens at failure? Do you just give up and throw your hands in the air and shriek for help or do you attempt to retry a few times, ideally backing off exponentially?

In this scenario, those retries will work. So, it’s a question of how well have you built your software. Okay, let’s say that you made the worst decisions imaginable, and okay, if that connection dies, the entire workload dies. Okay, you have the option to refactor it to be a little bit better behaved, or alternately, you can keep paying the Manage NAT Gateway tax of four-and-a-half cents per gigabyte in perpetuity forever. I’m not going to tell you what decision to make, but I know which one I’m making.

Ben: Yeah, exactly. The cost savings potential of it far outweighs the potential maintenance troubles, I guess, that you could encounter. But the fact is, if you’re relying on Managed NAT Gateway and paying the price for doing so, it’s not as if there’s no chance for connection failure. NAT Gateway could also fail. I will admit that I think it’s an extremely robust and resilient solution. I've been really impressed with it, especially so after having worked on this project, but it doesn’t mean it can’t fail.

And beyond that, upstream of the NAT Gateway, something could in fact go wrong. Like, internet connections are unreliable, kind of by design. So, if your system is not resilient to connection failures, like, there’s a problem to solve there anyway; you’re kind of relying on hope. So, it’s a kind of a forcing function in some ways to build architectural best practices, in my view.

Corey: I can’t stress enough that I have zero problem with the capabilities and the stability of the Managed NAT Gateway solution. My complaints about it start and stop entirely with the price. Back when you first showed me the blog post that is releasing at the same time as this podcast—and you can visit that at alternat.cloud—you sent me an early draft of this and what I loved the most was that your math was off because of a not complete understanding of the gloriousness that is just how egregious the NAT Gateway charges are.

Your initial analysis said, “All right, if you’re throwing half a terabyte out to the internet, this has the potential of cutting the bill by”—I think it was $10,000 or something like that. It’s, “Oh no, no. It has the potential to cut the bill by an entire twenty-two-and-a-half thousand dollars.” Because this processing fee does not replace any egress fees whatsoever. It’s purely additive. If you forget to have a free S3 Gateway endpoint in a private subnet, every time you put something into or take something out of S3, you’re paying four-and-a-half cents per gigabyte on that, despite the fact there’s no internet transitory work, it’s not crossing availability zones. It is simply a four-and-a-half cent fee to retrieve something that has only cost you—at most—2.3 cents per month to store in the first place. Flip that switch, that becomes completely free.

Ben: Yeah. I’m not embarrassed at all to talk about the lack of education I had around this topic. The fact is I’m an engineer primarily and I came across the cost stuff because it kind of seemed like a problem that needed to be solved within my organization. And if you don’t mind, I might just linger on this point and kind of think back a few months. I looked at the AWS bill and I saw this egregious ‘EC2 Other’ category. It was taking up the majority of our bill. Like, the single biggest line item was EC2 Other. And I was like, “What could this be?”

Corey: I want to wind up flagging that just because that bears repeating because I often get people pushing back of, “Well, how bad—it’s one Managed NAT Gateway. How much could it possibly cost? $10?” No, it is the majority of your monthly bill. I cannot stress that enough.

And that’s not because the people who work there are doing anything that they should not be doing or didn’t understand all the nuances of this. It’s because for the security posture that is required for what you do—you are at Chime Financial, let’s be clear here—putting everything in public subnets was not really a possibility for you folks.

Ben: Yeah. And not only that but there are plenty of services that have to be on private subnets. For example, AWS Glue services must run in private VPC subnets if you want them to be able to talk to other systems in your VPC; like, they cannot live in public subnet. So essentially, if you want to talk to the internet from those jobs, you’re forced into some kind of NAT solution. So, I dug into the EC2 Other category and I started trying to figure out what was going on there.

There’s no way—natively—to look at what traffic is transiting the NAT Gateway. There’s not an interface that shows you what’s going on, what’s the biggest talkers over that network. Instead, you have to have flow logs enabled and have to parse those flow logs. So, I dug into that.

Corey: Well, you’re missing a step first because in a lot of environments, people have more than one of these things, so you get to first do the scavenger hunt of, okay, I have a whole bunch of Managed NAT Gateways and first I need to go diving into CloudWatch metrics and figure out which are the heavy talkers. Is usually one or two followed by a whole bunch of small stuff, but not always, so figuring out which VPC you’re even talking about is a necessary prerequisite.

Ben: Yeah, exactly. The data around it is almost missing entirely. Once you come to the conclusion that it is a particular NAT Gateway—like, that’s a set of problems to solve on its own—but first, you have to go to the flow logs, you have to figure out what are the biggest upstream IPs that it’s talking to. Once you have the IP, it still isn’t apparent what that host is. In our case, we had all sorts of outside parties that we were talking to a lot and it’s a matter of sorting by volume and figuring out well, this IP, what is the reverse IP? Who is potentially the host there?

I actually had some wrong answers at first. I set up VPC endpoints to S3 and DynamoDB and SQS because those were some top talkers and that was a nice way to gain some security and some resilience and save some money. And then I found, well, Datadog; that’s another top talker for us, so I ended up creating a nice private link to Datadog, which they offer for free, by the way, which is more than I can say for some other vendors. But then I found some outside parties, there wasn’t a nice private link solution available to us, and yet, it was by far the largest volume. So, that’s what kind of started me down this track is analyzing the NAT Gateway myself by looking at VPC flow logs. Like, it’s shocking that there isn’t a better way to find that traffic.

Corey: It’s worse than that because VPC flow logs tell you where the traffic is going and in what volumes, sure, on an IP address and port basis, but okay, now you have a Kubernetes cluster that spans two availability zones. Okay, great. What is actually passing through that? So, you have one big application that just seems awfully chatty, you have multiple workloads running on the thing. What’s the expensive thing talking back and forth? The only way that you can reliably get the answer to that I found is to talk to people about what those workloads are actually doing, and failing that you’re going code spelunking.

Ben: Yep. You’re exactly right about that. In our case, it ended up being apparent because we have a set of subnets where only one particular project runs. And when I saw the source IP, I could immediately figure that part out. But if it’s a K8s cluster in the private subnets, yeah, how are you going to find it out? You’re going to have to ask everybody that has workloads running there.

Corey: And we’re talking about in some cases, millions of dollars a month. Yeah, it starts to feel a little bit predatory as far as how it’s priced and the amount of work you have to put in to track this stuff down. I’ve done this a handful of times myself, and it’s always painful unless you discover something pretty early on, like, oh, it’s talking to S3 because that’s pretty obvious when you see that. It’s, yeah, flip switch and this entire engagement just paid for itself a hundred times over. Now, let’s see what else we can discover.

That is always one of those fun moments because, first, customers are super grateful to learn that, oh, my God, I flipped that switch. And I’m saving a whole bunch of money. Because it starts with gratitude. “Thank you so much. This is great.” And it doesn’t take a whole lot of time for that to alchemize into anger of, “Wait. You mean, I’ve been being ridden like a pony for this long and no one bothered to mention that if I click a button, this whole thing just goes away?”

And when you mention this to your AWS account team, like, they’re solicitous, but they either have to present as, “I didn’t know that existed either,” which is not a good look, or, “Yeah, you caught us,” which is worse. There’s no positive story on this. It just feels like a tax on not knowing trivia about AWS. I think that’s what really winds me up about it so much.

Ben: Yeah, I think you’re right on about that as well. My misunderstanding about the NAT pricing was data processing is additive to data transfer. I expected when I replaced NAT Gateway with NAT instance, that I would be substituting data transfer costs for NAT Gateway costs, NAT Gateway data processing costs. But in fact, NAT Gateway incurs both data processing and data transfer. NAT instances only incur data transfer costs. And so, this is a big difference between the two solutions.

Not only that, but if you’re in the same region, if you’re egressing out of your say us-east-1 region and talking to another hosted service also within us-east-1—never leaving the AWS network—you don’t actually even incur data transfer costs. So, if you’re using a NAT Gateway, you’re paying data processing.

Corey: To be clear you do, but it is cross-AZ in most cases billed at one penny egressing, and on the other side, that hosted service generally pays one penny ingressing as well. Don’t feel bad about that one. That was extraordinarily unclear and the only reason I know the answer to that is that I got tired of getting stonewalled by people that later turned out didn’t know the answer, so I ran a series of experiments designed explicitly to find this out.

Ben: Right. As opposed to the five cents to nine cents that is data transfer to the internet. Which, add that to data processing on a NAT Gateway and you’re paying between thirteen-and-a-half cents to nine-and-a-half cents for every gigabyte egressed. And this is a phenomenal cost. And at any kind of volume, if you’re doing terabytes to petabytes, this becomes a significant portion of your bill. And this is why people hate the NAT Gateway so much.

Corey: I am going to short-circuit an angry comment I can already see coming on this where people are going to say, “Well, yes. But it’s a multi-petabyte scale. Nobody’s paying on-demand retail price.” And they’re right. Most people who are transmitting that kind of data, have a specific discount rate applied to what they’re doing that varies depending upon usage and use case.

Sure, great. But I’m more concerned with the people who are sitting around dreaming up ideas for a company where I want to wind up doing some sort of streaming service. I talked to one of those companies very early on in my tenure as a consultant around the billing piece and they wanted me to check their napkin math because they thought that at their numbers when they wound up scaling up, if their projections were right, that they were going to be spending $65,000 a minute, and what did they not understand? And the answer was, well, you didn’t understand this other thing, so it’s going to be more than that, but no, you’re directionally correct. So, that idea that started off on a napkin, of course, they didn’t build it on top of AWS; they went elsewhere.

And last time I checked, they’d raised well over a quarter-billion dollars in funding. So, that’s a business that AWS would love to have on a variety of different levels, but they’re never going to even be considered because by the time someone is at scale, they either have built this somewhere else or they went broke trying.

Ben: Yep, absolutely. And we might just make the point there that while you can get discounts on data transfer, you really can’t—or it’s very rare—to get discounts on data processing for the NAT Gateway. So, any kind of savings you can get on data transfer would apply to a NAT instance solution, you know, saving you four-and-a-half cents per gigabyte inbound and outbound over the NAT Gateway equivalent solution. So, you’re paying a lot for the benefit of a fully-managed service there. Very robust, nicely engineered fully-managed service as we’ve already acknowledged, but an extremely expensive solution for what it is, which is really just a proxy in the end. It doesn’t add any value to you.

Corey: The only way to make that more expensive would be to route it through something like Splunk or whatnot. And Splunk does an awful lot for what they charge per gigabyte, but it just feels like it’s rent-seeking in some of the worst ways possible. And what I love about this is that you’ve solved the problem in a way that is open-source, you have already released it in Terraform code. I think one of the first to-dos on this for someone is going to be, okay now also make it CloudFormation and also make it CDK so you can drop it in however you want.

And anyone can use this. I think the biggest mistake people might make in glancing at this is well, I’m looking at the hourly charge for the NAT Gateways and that’s 32-and-a-half bucks a month and the instances that you recommend are hundreds of dollars a month for the big network-optimized stuff. Yeah, if you care about the hourly rate of either of those two things, this is not for you. That is not the problem that it solves. If you’re an independent learner annoyed about the $30 charge you got for a Managed NAT Gateway, don’t do this. This will only add to your billing concerns.

Where it really shines is once you’re at, I would say probably about ten terabytes a month, give or take, in Managed NAT Gateway data processing is where it starts to consider this. The breakeven is around six or so but there is value to not having to think about things. Once you get to that level of spend, though it’s worth devoting a little bit of infrastructure time to something like this.

Ben: Yeah, that’s effectively correct. The total cost of running the solution, like, all-in, there’s eight Elastic IPs, four NAT Gateways, if you’re—say you’re four zones; could be less if you’re in fewer zones—like, n NAT Gateways, n NAT instances, depending on how many zones you’re in, and I think that’s about it. And I said right in the documentation, if any of those baseline fees are a material number for your use case, then this is probably not the right solution. Because we’re talking about saving thousands of dollars. Any of these small numbers for NAT Gateway hourly costs, NAT instance hourly costs, that shouldn’t be a factor, basically.

Corey: Yeah, it’s like when I used to worry about costing my customers a few tens of dollars in Cost Explorer or CloudWatch or request fees against S3 for their Cost and Usage Reports. It’s yeah, that does actually have a cost, there’s no real way around it, but look at the savings they’re realizing by going through that. Yeah, they’re not going to come back and complaining about their five-figure consulting engagement costing an additional $25 in AWS charges and then lowering it by a third. So, there’s definitely a difference as far as how those things tend to be perceived. But it’s easy to miss the big stuff when chasing after the little stuff like that.

This is part of the problem I have with an awful lot of cost tooling out there. They completely ignore cost components like this and focus only on the things that are easy to query via API, of, oh, we’re going to cost-optimize your Kubernetes cluster when they think about compute and RAM. And, okay, that’s great, but you’re completely ignoring all the data transfer because there’s still no great way to get at that programmatically. And it really is missing the forest for the trees.

Ben: I think this is key to any cost reduction project or program that you’re undertaking. When you look at a bill, look for the biggest spend items first and work your way down from there, just because of the impact you can have. And that’s exactly what I did in this project. I saw that ‘EC2 Other’ slash NAT Gateway was the big item and I started brainstorming ways that we could go about addressing that. And now I have my next targets in mind now that we’ve reduced this cost to effectively… nothing, extremely low compared to what it was, we have other new line items on our bill that we can start optimizing. But in any cost project, start with the big things.

Corey: You have come a long way around to answer a question I get asked a lot, which is, “How do I become a cloud economist?” And my answer is, you don’t. It’s something that happens to you. And it appears to be happening to you, too. My favorite part about the solution that you built, incidentally, is that it is being released under the auspices of your employer, Chime Financial, which is immune to being acquired by Amazon just to kill this thing and shut it up.

Because Amazon already has something shitty called Chime. They don’t need to wind up launching something else or acquiring something else and ruining it because they have a Slack competitor of sorts called Amazon Chime. There’s no way they could acquire you [unintelligible 00:27:45] going to get lost in the hallways.

Ben: Well, I have confidence that Chime will be a good steward of the project. Chime’s goal and mission as a company is to help everyone achieve financial peace of mind and we take that really seriously. We even apply it to ourselves and that was kind of the impetus behind developing this in the first place. You mentioned earlier we have Terraform support already and you’re exactly right. I’d love to have CDK, CloudFormation, Pulumi supports, and other kinds of contributions are more than welcome from the community.

So, if anybody feels like participating, if they see a feature that’s missing, let’s make this project the best that it can be. I suspect we can save many companies, hundreds of thousands or millions of dollars. And this really feels like the right direction to go in.

Corey: This is easily a multi-billion dollar savings opportunity, globally.

Ben: That’s huge. I would be flabbergasted if that was the outcome of this.

Corey: The hardest part is reaching these people and getting them on board with the idea of handling this. And again, I think there’s a lot of opportunity for the project to evolve in the sense of different settings depending upon risk tolerance. I can easily see a scenario where in the event of a disruption to the NAT instance, it fails over to the Managed NAT Gateway, but fail back becomes manual so you don’t have a flapping route table back and forth or a [hold 00:29:05] downtime or something like that. Because again, in that scenario, the failure mode is just well, you’re paying four-and-a-half cents per gigabyte for a while until you wind up figuring out what’s going on as opposed to the failure mode of you wind up disrupting connections on an ongoing basis, and for some workloads, that’s not tenable. This is absolutely, for the common case, the right path forward.

Ben: Absolutely. I think it’s an enterprise-grade solution and the more knobs and dials that we add to tweak to make it more robust or adaptable to different kinds of use cases, the best outcome here would actually be that the entire solution becomes irrelevant because AWS fixes the NAT Gateway pricing. If that happens, I will consider the project a great success.

Corey: I will be doing backflips like you wouldn’t believe. I would sing their praises day in, day out. I’m not saying reduce it to nothing, even. I’m not saying it adds no value. I would change the way that it’s priced because honestly, the fact that I can run an EC2 instance and be charged $0 on a per-gigabyte basis, yeah, I would pay a premium on an hourly charge based upon traffic volumes, but don’t meter per gigabyte. That’s where it breaks down.

Ben: Absolutely. And why is it additive to data transfer, also? Like, I remember first starting to use VPC when it was launched and reading about the NAT instance requirement and thinking, “Wait a minute. I have to pay this extra management and hourly fee just so my private hosts could reach the internet? That seems kind of janky.”

And Amazon established a norm here because Azure and GCP both have their own equivalent of this now. This is a business choice. This is not a technical choice. They could just run this under the hood and not charge anybody for it or build in the cost and it wouldn’t be this thing we have to think about.

Corey: I almost hate to say it, but Oracle Cloud does, for free.

Ben: Do they?

Corey: It can be done. This is a business decision. It is not a technical capability issue where well, it does incur cost to run these things. I understand that and I’m not asking for things for free. I very rarely say that this is overpriced when I’m talking about AWS billing issues. I’m talking about it being unpredictable, I’m talking about it being impossible to see in advance, but the fact that it costs too much money is rarely my complaint. In this case, it costs too much money. Make it cost less.

Ben: If I’m not mistaken. GCPs equivalent solution is the exact same price. It’s also four-and-a-half cents per gigabyte. So, that shows you that there’s business games being played here. Like, Amazon could get ahead and do right by the customer by dropping this to a much more reasonable price.

Corey: I really want to thank you both for taking the time to speak with me and building this glorious, glorious thing. Where can we find it? And where can we find you?

Ben: alternat.cloud is going to be the place to visit. It’s on Chime’s GitHub, which will be released by the time this podcast comes out. As for me, if you want to connect, I’m on Twitter. @iamthewhaley is my handle. And of course, I’m on LinkedIn.

Corey: Links to all of that will be in the podcast notes. Ben, thank you so much for your time and your hard work.

Ben: This was fun. Thanks, Corey.

Corey: Ben Whaley, staff software engineer at Chime Financial, and AWS Community Hero. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry rant of a comment that I will charge you not only four-and-a-half cents per word to read, but four-and-a-half cents to reply because I am experimenting myself with being a rent-seeking schmuck.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

A Cloud Economist is Born – The AlterNAT Origin Story

Episode Summary

Episode Show Notes & Transcript

You might also like

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

The Latest State of IaC with Ido Neeman

Get the Newsletter

Sponsor an Episode