Episode Summary

Join Pete and Jesse as they take a question from the field and talk about their experiences optimizing big data projects in the cloud. They touch upon how big data challenges are challenging whether you’re talking about terabytes or petabytes, the most popular services for big data projects in AWS, how people are essentially digital hoarders today and never throw any data out, why Pete believes more people should take advantage of Glacier Deep Archive, tricks for optimizing Parquet files, what the Kinesis outage meant for many Duckbill Group clients, why you may need to rethink your approach to compression, how Jesse thinks not enough clients use spot instances, and more.

Episode Show Notes & Transcript



Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I'm Pete Cheslock.

Jesse: I'm Jesse DeRose.

Pete: We're back again. Hashtag Triple F.

Jesse: It's going to be a thing.

Pete: We're still trying to make it a thing. Desperately trying to make it a thing. Otherwise, we're just going to look like fools, Jesse, if it's not a thing.

Jesse: Oh now, I wouldn't want to look like a fool, you know, next to anybody else in my company.

Pete: [laugh]. It definitely seems to be the one that trait you need to have to work at Duckbill is, to be okay looking like a fool. So, we are midway through the Unconventional Guide to AWS Cost Optimizations, cost savings. And we have been sharing a link on pretty much if not all of these recordings where you can send us feedback. And you can send us questions. And someone finally sent us a question. I think people are listening out there, Jesse. Isn't that great? 

Jesse: We have one follower. Yay.

Pete: It's amazing. So, we are really happy that someone asked us a question. You can be the next person to ask us a question by going into lastweekinaws.com/QA. That's not our quality assurance site for testing, new branding things, and new products. QA is for ‘question and answer.’ 

So, go there, check it out, drop in a message, you can put your name there or not, it's totally fine. But this first question—well, first, I need to actually, I need to admit something. I'm lying right now. This question actually came in months ago. We saw it and thought that was a great question, we should answer it at some point. And then we forgot about it. So we're bringing it back up again, and I think it's relevant so I don't feel too bad about it.

Jesse: Yeah, we saw this question around the time that we started recording the entire Unconventional Guide series. And apologies to this listener. This is a very good question. We want to talk about it, so we are talking about it today. But it took a little bit of a time for us to get to this. 

Pete: But you know what? We made it. We're here.

Jesse: We’re here.

Pete: We're here. So, Nick Moore asked this great question. He said, “Hey, Pete and Jesse. Very much enjoying your Friday segment on the Morning Brief.” Thank you very much for that. “If possible, I'd like to hear you talk about your experiences with cost optimization for quote, ‘big data’ projects in the cloud, i.e. Using platforms like Hadoop to process large and complex data, either using pass—like, EMR or [IS 00:03:03]. Is this something that your customers ask about often/at all? And how do or would you approach it? Thanks, again.” 

Well, hey, this is a truly awesome question. And at a high level, many of our clients actually are pretty heavy users of various Amazon services for their, kind of, big data needs. And big data, it's all relative, right? I mean, to some companies, big data is in the hundreds of terabytes, to other companies it's in the hundreds of petabytes. It's totally relative, but at the end of the day, it's going to be a challenge, no matter how big of a company you are. Your big data challenges are always a challenge.

Jesse: You've got some kind of data science or data analytics work that you want to do with large data sets. That may be large datasets comparatively to the work that you're doing; that may be large data sets comparatively to the industry. Doesn't matter. Either way, it is big data projects, and there are many, many, many, many solutions out there.

Pete: What's interesting, too, is I think the reason that this has grown in prevalence over the last year, more of our clients have been using more of these services is simply because the barrier to entry on these projects, on these engagements, is so low. You can get started on Amazon with some Athena and Glue, maybe some EMR, for just an incredibly low cost. And also, from a technical standpoint, it's not that challenging. I mean, as a good example, most reasonably technical people could take their cost and usage report, get it integrated into Athena using AWS Glue in minutes. I mean, without using CloudFormation. I mean just clicking through to set it up. And honestly, for some clients, their cost and usage reports, and that's a big data problem. That could be—if you're not storing it in Parquet, if you're actually storing it in CSV because you're a mad person, those could be hundreds of gigabytes a day in volume.

Jesse: Yeah. So, when we talk about big data tasks, there's a couple different services that we generally see folks using within AWS. We generally see S3, Kinesis, and most obviously, EMR.

Pete: Yeah, exactly. And we're seeing new services like Kinesis, expanding on Kinesis: Kinesis Firehose, when that came out; people are using that for some of their big data needs, especially when trying to stream data into S3. That's a really powerful feature that Firehose can do. And then, once it's in S3, the question that our clients often ask is, kind of, “What do I do with it now?” And if we dive into just S3, and you've got your data in S3, where are the kinds of places that we see unnecessary charges for data warehouse tasks? 

Jesse: Honestly, it's unfortunately kind of both of the major places that you're going to be charged for S3 which is, for your storage costs, and for your requests. 

Pete: So, what you're saying is that all S3 charges are unnecessary. [laugh].

Jesse: Just get rid of it. Just put all that on EBS volume somewhere. Turn off your S3, you're solid. 

Pete: Exactly. It is kind of funny, but it's true. I mean, there's ways to abuse both of those pricing models, whether it's storage or requests. The first place that we honestly see a lot of this is just people are data pack rats. And let's be honest; I'm one of them as well, I have a NAS setup at home with, like, 30 terabytes of hard drives on it. 

I don't throw anything away digitally. Turns out most of our other clients are the exact same way, and sadly, a lot of them use standard storage for S3, which we talk about often. It's common: you get started with the standard storage, that's a great place. But for big data tasks, it's often the wrong storage solution, especially for data that maybe has already been transformed and is stored in a more efficient format; maybe it's queried infrequently. There's two ways to solve this one. 

Obviously intelligent tiering can be a big help to automatically tier your data to the right location. But another thing that you can do, if you're already running some EMR clusters, you can set up a Spark task to automatically tier data to lower-cost locations really easily, and then you can avoid the intelligent tiering monitoring costs, kind of using an infrastructure you already have. So, the key thing that I always like to point out is when you're done with the data, move it to a cheaper storage or delete it. I mean, Glacier Deep Archive, and deleting it is almost the same price. Like that's how cheap Glacier Deep Archive is. If you're not sure if you're going to need it, or maybe compliance says you're going to need it, just deep archive it and move on with your day. But whatever you do, don't just leave it on standard.

Jesse: Yeah, so then if you think about this, from the request perspective, there's a lot of get and put requests when working with data in S3. Obviously, you are putting data into S3, you're pulling data out of S3, you're moving data around. We see this a lot, especially when folks use the Parquet file format. Now, again, we do recommend the Parquet file format, but there are ways to optimize how large your Parquet files are. So, for example, imagine your Parquet files around 100 megabytes in size. 

To complete a query, you need to make about 10 get requests to access 10 gigabytes of data. But if you right-size your Parquet files to about 500 megabytes to 1000 megabytes in size, you can cut those requests by 50 to 90 percent. And in many cases, we've seen clients implement this without any impact to their production workloads. So, keep in mind that it's not just about moving the data—getting and putting the data—it's about how often are you getting and putting the data? How large are those requests sizes?

Pete: Yeah, exactly. Because I know there's someone out there that's probably doing the math on what you just said, I think you meant 10 gets to access one gigabyte of data versus 10 gigabytes of data. Someone out there is doing the math on that. They're about to go to lastweekinaws.com/QA to tell us about it. But hopefully, I have preempted that.

Jesse: Yes, thank you, thank you.

Pete: But of course, it's an important point that we've actually seen in some scenarios, the request costs exceed the storage cost for a bucket. That is what we would consider to be an outlier in spend, and it's not needed in a lot of cases. So, think about that. Think about your file sizes. And likely you can increase the size of it. 

And this is a good test as well. This is something you can try out. Try it different file sizes, try different queries, see how it impacts performance. You could get some pretty dramatic cost savings by adjusting it, like you said, Jesse. So let's talk about Kinesis, which is great. It's like one of the best service I really love—well, I think most people thought it was the best service prior to that—

Jesse: Yeah…

Pete: —outage last year.

Jesse: Yeah…

Pete: A lot of folks that we spoke with, a lot of our clients, they were all in on Kinesis. And honestly, the outage has a few of them just maybe slowing, pumping the brakes a little bit, or rethinking their usage. So, that sentiment has changed a little bit. But we have seen a couple of places where Kinesis savings can be had by just looking a little bit closer on how you're using it. So what's one of the first things that we've seen, Jesse, around some of the Kinesis cost savings?

Jesse: One of the biggest things we've seen is Kinesis data duplication. You have data that needs to go to different places to be tweaked, analyzed, moved about this way and that, but ultimately, it's all coming from the same data source. If you have multiple Kinesis streams that all contain the same data, you're paying for each of those streams individually. You don't need to. Instead just have a single Kinesis stream with the enhanced fan-out option, which essentially allows you to have that single source of data, but then there could be multiple consumers that are receiving that data for their analysis purposes.

Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Check out CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.

Jesse: And similar to S3, we've also seen really high storage costs in Kinesis streams. Think about reducing the retention for your non-critical streams. Think about, do you ultimately need all of the same amount of data that you're sending through each of these streams. In some environments, you may need all of the data in every stream possible, but in some environments, you might not. Some environments, you might be using maybe just a smaller subset of the data, so you don't need to move all of that data from place to place. 

And don't forget about compression; compression is something that we've seen, many, many clients either not enable, forget to enable, or maybe they just don't have the best practices in place. Maybe it's something that nobody has stood up and said, “Hey, this is how we want to move our data, and this is important for us to optimize this spend.” Start doing that today. Be the first person in your team or your organization to say, “Let's put data compression on our Kinesis streams.” It will help you save money. Also consider there's binary formats like a Avro, Thrift, the protocol buffers, I mean, if you just end up shoving uncompressed JSON into Kinesis, you're just going to have a really, really bad time.

Pete: Yeah, and we've seen that, we have, and it's amazing when you move from an uncompressed JSON into a binary format how much that reduces it. What's important too, I always like to call out is the downstream effects of compression and basically moving away from uncompressed JSON, which is large data transferred over the wire. There's downstream effects too in network data transfer and I/Os and all those other good things. 

So, for some workloads, though, oftentimes, we actually do recommend Kafka. Now, it depends on how you're using Kinesis. There are some Kinesis features that do not directly translate over to Kafka. But for a lot of folks, Kafka, even the managed service for Kafka, that MSK service, is highly recommended. Because when it comes to scaling Kinesis, the units of scaling requires more shards, and you add a shard for every megabyte per second, or thousand records a second, you'll have to add more shards. 

And the downside of this is if you have bursty workloads. We ran into one client who had to scale their Kinesis out to support a pretty bursty workload that they needed to ensure that they were always accepting in this data. And within a few second time period, they were bursting up to thousands, tens of thousands, hundreds of thousands of records a second, but then sitting idle the rest of the time; you obviously have to scale out to support that. So, in that scenario, actually using Kafka is a better option because it can handle a lot more data through it without requiring as much kind of shard scaling and cost associated. It's just in the end cheaper. 

And so the other interesting thing, too, is MSK does not charge for the replication side of things like if you were to run Kafka yourself. So, before you go out there and say, “I’m going to run my own Kafka,” definitely plan on-network data transfer costs as well because I will tell you from personal experience, they will be larger than you expect. So EMR. Let's talk about the elephant in the room. That's a Hadoop joke out there. 

Jesse: [laugh].

Pete: So let's talk about EMR. What's the first thing that most people are not doing with EMR, Jesse?

Jesse: They are not using spot instances. And now I know what you're going to say: “But wait. Spot instances, aren't those the things that will ultimately just die whenever AWS needs the resources back and my workloads may be interrupted at any time?”

Pete: Yeah, that sounds terrible. I don't like that.

Jesse: Yeah. That's thankfully not quite the case anymore. EMR has integrated with spot in amazing ways, and specifically, there are amazing new features called spot fleets, and spot blocks that can help you guarantee your spot instances for one to six hours. It's a slightly higher price, but it's still less than you'd be paying for on-demand EC2 instances. It is absolutely worth looking into. 

Pete: Yeah, this one's a great one. Also, instance fleets are great because you can essentially augment on-demand with spots, and when the spots go away—even if you didn't use a spot block, but if the spots went away, they’d just get augmented with on-demand again. That's pretty powerful stuff. So, when using a spot block, though, it means that if you have a series of jobs that you know finish within three hours, then go and set up a three-hour spot block. You will have those instances available for those three hours. 

You don't have to pay for all three hours; you're still charged the normal per-second billing that you would see, but it will not be pulled out from underneath you. And again, the less time that you can commit to, the better your discount. If you are okay with an immediate interruption, that's going to be the cheapest way to run spot. But obviously, these spot blocks are a great way to get some predefined tasks. And use spot blocks for your master, your core, and your task nodes as well. 

Because again, if you're using those instance fleets, EMR will provision with on-demand when a spot goes away. So that's a really big one that a lot of people are missing out on. But it's not the only one. We also find that people are not adequately monitoring their jobs and their workload resource usage. This is just—it sounds crazy to say that, “Oh, my. People are over-provisioning EC2 instances.” I'm shocked.

Jesse: [laugh].

Pete: Just—right? Shocked. 

Jesse: “I mean, I just set it and forget it, right?”

Pete: [laugh]. But it happens even more so on EMR. We’ve found clusters that might be running for 30 minutes when the job is only running for five minutes. I mean, there are CloudWatch metrics that you can grab to identify these idle clusters. And this is free money. I mean, this is click button, save money, which—

Jesse: Absolutely.

Pete: —we love to see. But what are a couple other practices that we've often recommended to our clients, Jesse?

Jesse: Some other highlights: set your average CPU utilization for your instances to about 80% for your jobs; that's a big one. That's the sweet spot we've seen where our clients get the best bang for their buck utilizing these cluster resources. Also, try to aim for runtimes between one to three hours. Again, this is where we've really seen that efficiency sweet spot. If you can run your jobs in less than that time, fantastic. 

Because again, as Pete said, you will only be paying for spot instances as long as that spot instances active and running your workload. But if you can schedule your runtime for roughly one to three hours, that seems to be the ultimate spot instance sweet spot that we've seen.

Pete: Yeah, exactly. The other thing to do is audit your jobs that you've created and make sure that your engineering folks are not messing around with the Spark execute or memory setting in either your Python or Scala code. Honestly, most of the time that setting really never needs to be changed if you're using EMR. Instead, change the instance type you're running on. And this is where a little bit of research on instance type usage can really benefit huge savings. 

Most folks are not using the right instance types when they use EMR in EC2 instance types. They just kind of pick one at random and then they just roll with it. But another thing to do is check your job memory profiles and you want to adjust instance type to match. So, an example we like to give our clients is, let's assume there's a [fake 00:20:04] instance type, like an m1-fake with ten cores and 64 gigabytes of RAM. And we want to assume that we want to keep about four gigabytes of overhead for the OS. 

That's just a rough number, it could be anything depending on the OS you're using, but that's just going to leave us about 60 gigs of RAM available. So, if your executor is set for 6 gigs of RAM, then you'll have each node with ten of ten cores used. You'll have one job per core, each using 6 gigabytes of RAM. That is a fully-loaded instance. That's what you want. 

That's the ideal scenario. But if someone were to change that setting, from 6 gigs to 10 gigs—which you don't want to mess with that setting—now you can only run six of those jobs on that host, which means six of your ten cores are in use. That's only 60 percent, 40 percent waste. 40 percent of that host is just sitting there unused. So that's why we often say to not mess around with those memory settings. You probably want to change your instance type first. So that's the 20-minute fast guide to how to save on EMR. Right, Jesse?

Jesse: Yeah. And if you have other questions about this, please reach out to us. You can hit us up at lastweekinaws.com/QA, but also feel free to tag us on the social medias, especially on Twitter. We are happy to continue to talk about this more.

Pete: Yeah, absolutely. If you enjoyed this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review. Give it a five-star rating on your podcast platform of choice and tell us what you love most about Kinesis other than the fact that it went down horrifically that one time.

Jesse: [laugh].

Pete: Thanks, everyone.

Announcer: This has been a HumblePod production. Stay humble.
Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.