Corey: This episode is sponsored in part by our friends at Fairwinds. Whether you’re new to Kubernetes or have some experience under your belt, and then definitely don’t want to deal with Kubernetes, there are some things you should simply never, ever do in Kubernetes. I would say, “run it at all.” They would argue with me, and that’s okay because we’re going to argue about that. Kendall Miller, president of Fairwinds, was one of the first hires at the company and has spent the last six years the dream of disrupting infrastructure a reality while keeping his finger on the pulse of changing demands in the market, and valuable partnership opportunities. He joins senior site reliability engineer Stevie Caldwell, who supports a growing platform of microservices running on Kubernetes in AWS. I’m joining them as we all discuss what Dev and Ops teams should not do in Kubernetes if they want to get the most out of the leading container orchestrator by volume and complexity. We’re going to speak anecdotally of some Kubernetes failures and how to avoid them, and they’re going to verbally punch me in the face. Sign up now at fairwinds.com/never. That’s fairwinds.com/never.
Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I am Pete Cheslock.
Jesse: I'm still Jesse DeRose.
Pete: We're still here. And you can also be here by sending us your questions at lastweekinaws.com/QA. We're continuing our Unconventional Guide to AWS Cost Management series, and today we're talking about moving data. It's not cheap, is it?
Jesse: No, it's definitely not cheap. It is expensive, and it's painful. And we're going to talk about why, today. And a reminder, if you haven't listened to some of the other episodes in this series, please go back and do so. Lots of really great information before this one and lots of really great information coming after this one. I'm really excited to dive in.
Pete: Yeah, look, they're all great episodes in the end of the day, right? They're just all fantastic.
Pete: If I do say so myself.
Jesse: All of the information is important; all of the information is individually important—I think that's probably the best way to put it. You can listen to all these episodes and implement maybe just a handful of things that work best for you; you can listen to all these episodes and implement all of them, all of the suggestions. There's lots of opportunities here.
Pete: If you do actually go and implement all of these suggestions, you really should go to lastweekinaws.com/QA and tell us about it. We'd be very curious to hear how it goes. But if you're struggling with any of these, just let us know as well. These are things that are measured in long periods of time.
It is rare that we run into engagements with clients that you can just click box, save money. Now, don't get me wrong; there's a whole bunch of those, too. But if you want to just fundamentally improve how you're using the Cloud and how you're saving money, those projects are multi-year investments. It's just all of this stuff takes a long time. And you just got to manage those expectations appropriately.
And specifically around this topic, moving data, it is—as Jesse said—painful. It is expensive, especially in Amazon. They will charge you to move the tiniest bit of data literally everywhere, with, like, two minor exceptions. And it's just the worst. Data storage costs, so Duckbill Group, we've kind of become these experts on data transfer and data storage costs, understanding just the complexity around them. And I feel like a lot of times folks only think about the storage being the biggest driver of their spend.
Pete: You know, you never delete your data. But you put it all on S3, right, Jesse? Like that's a cheap place to put your data.
Jesse: Absolutely. Worthwhile. Put it in S3 standard storage, call it a day. I'm done, right?
Pete: Yeah, just do my little, like, wipe my hands, and go on, and we're good. Most people put it in standard storage, just like most people use gp2 EBS volumes; that's the standard everything. And that could be a big driver of cost, but more likely the larger driver—because it's a little bit more hidden, it's a little bit more spread around your entire bill is the transferring of data, the moving data around. And I say moving specifically because there are some services that are charged via I/Os. Via actually putting data into it or taking data out, not just the data transfer.
Jesse: I think it's also really important to call out that most companies that move into the Cloud don't realize that data transfer is something that AWS will charge you for, so I want to make that explicitly clear. As Pete mentioned, in almost every case moving data around, AWS will charge you for that versus in a data center environment where that's kind of hidden, that's not really explicitly a line item in your bill. And here, it absolutely is a line item in your bill and absolutely should be thought of as an important component to optimize.
Pete: Exactly. In the data center world, for any of the folks out there that are in a data-center land, or maybe hybrid-cloud land, your networking costs are, I mean, it's largely a sunk cost. You've got your switches and your lines that run, maybe you're—get charged for the cross-connects, and interacting, data transferring to other areas and things like that. But within your racks, within your own secure domains, you don't have to really think about the cost of those network communications because it's already paid for. And you're definitely not charged at a per-gigabyte level like you are on Amazon.
Jesse: So, we talked about this a little bit before in a previous episode, when we talked about context is king. Context for your application infrastructure is really, really important; understanding how your application interacts with other applications within your cloud infrastructure ecosystem; how your data moves between workloads. All of these things are really, really important, and so specifically, when we talk about data transfer, it's really important to not just understand how your data is moved around, but why your data is moved around. So, we really like to suggest working with all of the teams within your organization. Again, product, potentially legal, maybe IT, to understand your data movement patterns and the business requirements for those data movement patterns.
Why does your data need to move multiple times within an availability zone? Why does it need to move between regions? Do you need to have data that is copied across multiple availability zones? Do you need that data to be cross-region? These are some examples of really important questions to ask to understand, do you need to continue transferring that data? Because the more you can optimize the way that that data is moving around within AWS, the less money you'll ultimately spend.
Pete: Yeah, and this ties into, again as you've noticed, there's a reoccurring theme is that all of these episodes in this series, they do tie into one another, they build on top of each other in many ways; you can independently do these things, but they can compound and bring you bigger benefits. And so in a previous episode, we talked about your network architecture diagrams, how you could overlay costs. But how you should overlay the data flows on top of there as well. Again, those data flows will have an inherent cost to it. And Jessie, I love that you pointed out talk to legal because there are potentially risk and compliance requirements as it relates to your data and data transfer. Think about—
Pete: —GDPR and keeping data inside of certain regions, or from risk and compliance side, keeping your data actually in many regions, replicating it to other regions. And not that you shouldn't ever replicate your data, but I think what's important is—and the biggest thing about a lot of this stuff that we're talking about is providing knowledge on the cost of a decision. So, think about your business, and they've made this decision to replicate all their data into five availability zones. Okay, well, that will have a cost to it. If no one knows what that cost is, they can't make an updated decision.
So, when your boss is coming over your desk and screaming at you—well, not to your desk because COVID—but popping into your Zoom and, “Why is this bill so high?” And, “What is going on?” If you have that knowledge and say, “Well, here are the places where we spend our money, and these are the decisions, from a product risk perspective, that have driven these costs.” Any one of these decisions can be changed, right? Nothing is set in stone. They're all just different things that businesses have to think about.
Jesse: I think it's also really important to call out that as you and possibly your team or the individuals that you work with from other teams start to have these conversations, start also thinking about the best practices that you want to see within your organization for data transfer, whether that is specifically for the data transfer for a specific type of solution, like distributed stateful systems, like your Cassandra, your Kafka, make sure you can start to create best practices for these solutions. And really start to build communities of practice within your organization to decide how best to implement these best practices. So, for example, we worked with a client previously who did a lot of data compression on disk for their Cassandra cluster, but the data was essentially not compressed when it was moved between components of the cluster or between regions when it was replicating. And so there was all this data transfer that was just flying around uncompressed, costing a lot of money. And the team that was managing the cluster really wanted to get some best practices in place and the teams that were sending data to the cluster and reading data from the cluster wanted to get some best practices in place, but nobody really understood whose responsibility it was to put those best practices in place.
And this is a fantastic opportunity to build that community of practice together to make sure that everybody knows what those best practices are and build those best practices together to ultimately bring those costs down.
Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Checkout CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.
Pete: Yeah, I remember at a previous company, we were handling large amounts of data. And I'm pretty sure we had compression enabled for some of our data pipeline activities and consuming data from a variety of sources, and I still remember the day that we migrated over to Snappy compression from whatever we were using prior. Just those graphs dropped—data transfer graphs—dropped by so much. We legit thought we broke something. But those graphs going down dramatically, luckily we didn't break anything, but watching those graphs go down so dramatically also meant my bill went down dramatically on data transfer.
And that's a really good point. People miss out on compression a lot. In some open source applications, it wasn't an option. Like, you couldn't do it. Looking at you Elasticsearch.
Now, you can I believe in some newer versions, there's some compression there. Cassandra, I know had some issues with that for a while. We see it a lot with things like Kafka as well, people are not configuring their consumers with it. So that's definitely something to look at. I always like to talk about some of the actual things that you could look at in your organization to improve your data transfer spend.
A couple of other places I like to call out as well is—especially when understanding data flows—is one of my favorite—and I'm using air quotes right now—“Favorite” services is the NAT gateway, the wonderful NAT gateway service, which kudos to the Amazon product owner there who must sleep on a bed full of $100 bills, and on their hundreds of millions of dollars of yacht because they are making an obscene amount of money on this service that does very little, in my mind; in my personal opinion. And we do look at a lot of Amazon bills, and there's a lot of folks spending millions of dollars a year for NAT gateways. So, you really have to ask yourself what is that service providing? There's a lot of folks online that probably talk about, “Well, you have to run your instances in a private VPC because of security reasons.” And sure, there's probably some reasons for that, but if you have some instances that are constantly communicating with the public internet, then those may actually be better off in a public subnet. You have security groups, right? Firewalls, these exist.
Jesse: I think it's also important to call out here that you can look at the actual data that is traversing your managed NAT gateway to decipher how much of it is internal AWS traffic to other AWS services like S3, DynamoDB, and other services that you can move to a VPC gateway instead. Now granted, some of those VPC gateways are free, some of those VPC gateways do charge you for the amount of time that they're run and the amount of data that is traversing that gateway, but it is absolutely worth running the numbers to see if the amount of data that you're sending to some of these internal AWS services, can move to a VPC gateway because we've seen clients save lots of money by keeping all that traffic internal rather than sending it out through a manage NAT gateway to the public internet and then back into AWS, through whatever AWS service gateway.
Pete: Yeah. And the two services that are actually free—you can go enable these right now—are for S3 and Dynamo. So, imagine these two scenarios. You have a server in a private VPC with a NAT gateway. That virtual private network, that VPC that that server is in is truly a secure network.
And when you're communicating with these other Amazon services, you are leaving. You're going to the public internet to talk to S3. Which means you're traversing that NAT gateway. So, if you have a service that is, maybe, pushing a lot of big binary content to S3, you're going to get charged not only for normal data transfer costs if it crosses AZ boundaries, but you're going to also get the four and a half cents per gigabyte added fee on top of that. And you can literally avoid that entire fee by—this is one of those click-box things: you click a box and you enable S3 endpoint to allow that service secure communication to S3.
It's almost like they're trying to force people to not use NAT gateways. Maybe NAT gateway, the architecture inside Amazon is just so terrible, that they're actively charging an obscene amount of money to get people to not use it. But it's clearly not stopping anyone, you can tell that I'm very salty about NAT gateways. And just from my own personal experience, I remember I used to run my own NAT instances. That's what you did before NAT gateway was a thing.
You spun up an instance. All of mine were t class whatever’s. Because, again, most of my instances weren't doing heavy communication out to the internet; I didn't need a ton of bandwidth, so I would spin up these t2s, I'd run my own NAT instances, they were inside an auto-scaling group. If they died, they came back, it was the easiest thing ever. And then one day NAT gateways came out and I thought, “Oh, well, it's a reason to just run less EC2. That's totally fine.” So, I flipped all those services over. And then in a few months, “I'm like, why am I spending 10,000 a month for these NAT gateways?” I was spending, like, $500 a month for my t class instances. And so I moved everything back. I'm just like—it's annoying. It's my most hated service at Amazon.
Jesse: Thanks, I hate it.
Pete: [laugh]. So don't give them your money for it. You can solve this problem set up those endpoints like we're talking about. We've seen so many clients save just tons of money by setting up those endpoints in their VPC to talk to other Amazon services or turn on flow logs for a couple of hours. Don't turn them on for a long time—
Pete: And surely do not send them to CloudWatch. Send them to S3. You can query this stuff in athena; there are tons of posts on the Amazon blog in the documentation that will teach you how to query those. And you can start looking to see, where are my things connecting to? Spoiler alert, if you're a data customer, they're talking to Datadog.
Jesse: And I think that's a really quick fun thing to point out to, which is some third-party software solutions that are also on AWS have their own private link gateways that you can configure and connect to so that you don't have to send your data out through the public internet and then back in through with their connectors, you can send the data directly to them through additional VPC gateways.
Pete: Yeah, those private links are just a great service. They will be a lot cheaper to send your data to those third-party vendors, and they're honestly more secure. One thing to keep in mind though, again—I love giving these actionable tips if at all possible—is if you do let's say your vendors can only take their connections in from us-east-1 and you’re in another region, you will have to pay to cross an AZ boundary, but in almost all scenarios—there's always some exceptions—it's still going to be cheaper to cross an AZ boundary and to send to the private link. Because again, NAT gateways are just so prohibitively expensive.
Jesse: Now that we've written a love letter to VPC gateways, I feel like, I want to spend a few minutes talking about the hidden costs of I/O before we wrap this up and send you out into the world to look at all of your data transfer bit by bit.
Pete: And become horrified.
Pete: So, we talk about moving the data, but when you move the data, you have to make, usually, an I/O; you have to make some sort of communication. And there are more and more services that are charging based on those I/Os, Aurora being probably the largest, I don't know maybe most popular, Jessie, if you think of it that way.
Pete: These I/O costs, they can be hard to predict, but we've definitely seen scenarios where folks are ingesting data into their Aurora from S3, but all of those I/Os, all of those writes, they're going to get charged for. The I/Os, plus the storage, plus the engine. So, you have these three vectors are being charged upon.
And so you really need to start thinking about these usage patterns. How are writes happening? How are reads happening? What are the size of those I/Os as well? Which you'll have to dive into the documents to figure out how best to optimize because of how, again, they charge for these I/Os. But if you are constantly reloading data into an Aurora database, you're getting additionally charged for all of this data movement. The movement is causing these I/Os to occur.
Jesse: Yeah, and this really makes the case for a data warehouse, which I hate the term—or data lake, whatever the hot new phrase is that all the kids are using. But it makes sense. You can keep your data in one place and set up access for different teams to be able to access different parts of that data. Data transfer in and out of S3 is free, and then you can use all of the query functionality of Athena or other tools within AWS to do all of the queries you need and all of the calculations you need with that data.
Pete: Yeah, I agree. Data warehouse, data lake. These are stupid names.
Jesse: [laugh]. I’m glad that that's the thing that you agree with me on.
Pete: [laugh]. Yeah, exactly. That's the only thing, Jesse. I actually had a joke at a conference I did. I hated these—I was basically preaching to this audience about how you should turn all of your monitoring into a data lake: your logs, your metrics, your traces, everything, and centralize it for query for analysis.
And I just hated the term so much that I just wanted to come up with something new that had equally no meaning, so I call it a bagel, a data bagel. And, again, it has no meaning, so what does it matter?
Jesse: So, is that like, a security around the outside, and then the massive security hole through the middle?
Pete: You know, we could spend a whole episode analyzing my nonsense. Not today, Satan. So it's true, though, right. Exploit the fact that you can transfer into and out of S3 from many or most Amazon services for free, and use that to your game, use that to your benefit to centralize as much data as possible into that one place. Some of the most mature organizations we work with push all of their data as much as possible into S3, and then will pull it into other services for query.
I mean, you could do ad hoc queries with Athena, but you can ingest it into a redshift cluster and do some analysis. If you're a Snowflake user, obviously, they can suck your data right out of your S3 into Snowflake for analysis. That data transfer is free. Crossing an AZ boundary is not free. Again, think about your data flows.
Can you have instances pushing data to S3 where other instances can pull that data out? It's in an incredible service. It's shocking that there's anything for free in Amazon so when you can find those slight benefits, exploit them for the greatest gain possible.
All right, well, I thoroughly enjoyed just hazing on NAT gateways and I could probably keep going for a long time. But we will save that for another time. If you did enjoy this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, go to lastweekinaws.com/review and still give us a five-star rating, but then also go to lastweekinaws.com/QA and give us your question, give us your feedback. We would love to answer them in a future episode. Thanks again.