Episode Summary
Join Pete and Jesse as they take two questions from the field about practical approaches to applying some of their previous teachings to real-world scenarios. Listen in to learn why Pete believes Compute Optimizer is criminally underused, why teams should have a dedicated individual focused on cloud spend optimization instead of asking an engineer to take it on as a side project, how cloud finance teams are finally starting to emerge and why that’s a good thing, how it’s amazing to see an AWS bill go down because of a cloud finance team’s efforts, why you should put as many guardrails in place in your cloud environment as you can, and more.
Episode Show Notes & Transcript
Links:
- Unconventional Guide: https://www.duckbillgroup.com/resources/unconventional-guide-to-aws-cost-management/
Transcript
Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.
Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. We’re back again, my name is Pete Cheslock.
Jesse: I’m Jesse DeRose. So, happy to be back in the studio after our whirlwind tour of the Unconventional Guide that I feel like we’ve been on for roughly as long as the pandemic’s been going on at this point; probably a little bit less. But lots of really great content there that we were happy to talk about, and I’m happy to be moving on to some other topics.
Pete: Yeah, absolutely. And the topics, we actually get to move on to some of our favorite topics, which are answering your questions. And it turns out, Jesse, there’s more than two people that listen to us. There’s a lot of you; there are dozens of you out there, and we love it.
Jesse: You like me. You really like me.
Pete: So, great. So, great to see. We’ve been getting tons of fantastic questions, a few of which we’re going to answer right now. You can also have your question answered by going over to the lastweekinaws.com/QA and enter in your question there. You can enter in your name, or you can leave it blank, or you could just put something funny there. Anything works. We’re happy to dive in deeper on any particular topic, again, whether it’s about this recent Unconventional Guide series or just something you’re curious about in your day-to-day in your cost management life.
Jesse: Today’s questions are really great because they ultimately get at the practical side of all of our recommendations. Because I feel like every single time I subscribe to one of those self-help books or blogs and I read all these really great short, sweet tidbits, I think to myself, “This is perfect. I’ll go apply this to everything in my life.” But then doing the actual work part is so much harder. Where do you even start with that first step once you’ve got the big picture grand idea? So, today we’ve got some really, really great questions, focusing on the best ways to get started on your cloud cost management journey. So, let’s start off with these questions.
First question is, “Could you cover some practical approaches to applying some of your Cost Management Guide? A lot of your suggestions sound simple on paper, but in practice, they become quite complicated.” So, true. Absolutely, absolutely a concern. “I’ve had some success pulling in a small group of subject matter experts together for short periods of time focusing on low risk, easy things to do. How have you approached actually doing this? What meetings do you set up? What do you take for notes? How do you document your savings? How do you find new opportunities?” That’s from Brian O. Brian O., That’s a really, really great question.
The other one that I want to add to this: “We’re a big AWS shop, and I’ve spent some time inside the AWS beast in the past, and I still struggle with multi-account multi-region data transfer in general, but specifically analyzing cost and usage. There are examples specifically like if data transfer out goes up $25,000 last month, how do you attribute that? How do you know where to apply that? How do you know what ultimately prompted that spend? Love how you work through these types of challenges. What is relatively easy at a single account level gets exponentially more complex with every account and region we function in.” So, true. And that’s from Todd. Thank you, Todd. In both cases, absolutely true.
There’s this really great idea of we can give you the really short and sweet things to think about, but taking those first steps for practically applying these ideas is tough, and it needs to scale over time. And not every practice does.
Pete: Yeah, these are great questions. I, kind of, am remembering that meme that was around for a while, which was, how to draw an owl. “First, draw two circles, and then, you know, you draw the rest of the owl.”
Jesse: Yeah.
Pete: And honestly, oftentimes, some of the stuff even that we say, Jesse, feels that way, and it doesn’t intend to come across that way. It’s just, we could bore you all on a multi-hour long recording of some of these topics. I mean, we do this with our clients, and our clients pay for this pleasure [laugh] for us to put them to sleep with our soft tones of the cloud cost management world. But I think the reality is that it is complex and there are probably unlikely to be quick wins in a lot of these places. One thing that we found is honestly, monitoring, visibility, I think all the cool kids are calling it observability now—
Jesse: [laugh].
Pete: —you know, I can’t believe I’m going to say this, but CloudWatch is actually probably one of the best cloud cost reduction tools that exist out there. There are so many services within AWS that you’re probably using today, that by default, report data to CloudWatch. And those statistics are potentially a huge place to identify resources that are over-provisioned and underused, idle resources, things like that. I can’t tell you how many times that I will go into a client account, and one of the first places I go to is—after Cost Explorer—is probably CloudWatch. So, monitoring spend and monitoring what’s happening there is kind of a great way to get started on that cloud cost idea because you’re getting charged for everything that happens, so knowing what’s happening, and knowing how it’s changing over time is a great way to start understanding and reducing it.
Jesse: Yeah. And I think AWS is probably also using some of those CloudWatch metrics in their optimization recommendations that they make within their own optimization tooling. And it’s probably just not clearly defined or clearly outlined for AWS customers to be able to use the same metrics. So, I feel like if my Compute Optimizer could quickly load or link to a graph that showed me low CPU utilization across a number of instances, that’s a really handy way for me to start using more of CloudWatch’s metrics.
Pete: Yeah, I think Compute Optimizer is honestly, criminally underused out there. I don’t know why. Then honestly, one of the other complaints is like, “Well, you can’t get memory statistics unless you have a CloudWatch Agent.” Yes. So honestly, install the CloudWatch agent; have it report up, the, like, one or two memory metrics that Compute Optimizer needs to make a recommendation and the cost will more than pay for itself.
And now you can even output those statistics to S3 and do some fun programmatic stuff with it. Put those outputs in front of the engineers that own those resources and be like, “Hey, yo. This thing says, change your i3.24xl. Could you move it to something a little bit more useful, like a t3.small?”
Jesse: And these are just some practical applications for some of the specific metrics we’re talking about, but this is a practice that you might want to turn into a process, you might want to turn into an ongoing amount of work. And in a lot of cases, we’ve seen this start as one engineer who’s really interested in understanding AWS, really passionate about the bill, maybe isn’t in a leadership or management role so maybe they don’t have a direct business requirement to optimize their spend, but they’re really, really interested in this work and they grow into a role where they are taking on more and more of this work. And that’s not scalable; that engineer is going to get burned out very, very quickly if they have a day-to-day role that is focused in development and doing all of this optimization work, cost cloud management work on the side. We generally recommend at least one dedicated individual who starts building these dashboards, starts looking at some of these metrics, starts these conversations with teams, and ultimately grow that into a full team.
Pete: Right. I think that’s the biggest thing that we’re seeing in the industry is actual cloud finance teams coming into existence at companies. It’s such a critical role and it’s sad to see when people are like, “Arg, spend is out of control. We’re doubling year over year on spend and no one really seems to know why.” And honestly, it’s because no one cares about it. You don’t have any ownership on it. And, you know, we see it a lot, right? It’s like, “Well, everyone owns the Amazon bill.” That’s code for, “No one owns the Amazon bill.”
Jesse: Yeah.
Pete: But these cloud finance teams, and even the term cloud economist, as silly as it is, it’s centered in reality, which is we create financial models to understand spend and we dive into those numbers to make the usage makes sense to folks like CFOs inside of companies. Yeah, there’s a couple of ways that we have seen some of this done at scale. In one case using kind of active monitoring, and actively monitoring the spend based on really granular budgets, and reporting it as such. So, maybe you’re breaking these budgets out to be product-specific, or team specific, or business unit, or things like that, and then basically reaching out to these engineering teams. Because you are actively monitoring the spend on a recurring basis, you can reach out to those teams, when their spend goes over a given threshold that you’ve put in place, or when you, maybe, find some optimization opportunities.
You’re probably thinking to yourself, “Wow, I don’t have the time for that.” Yeah, but you need to create the time or you need to create the team for this. The companies that we work with who have a dedicated team around this are the ones that do the best. In some cases, we’ve seen having a Dedicated Cloud finance team causes the bill to actually decrease over time, which, you’re thinking to yourself, “Wow. An Amazon bill that goes down? We so rarely see that.”
Even for us, our clients come to us, we help them find optimizations. They’ll make those optimizations, but then they replace that spend with other investments. Usually, it’s new projects and new spend. But actually seeing the bill go down because of a dedicated effort of a team is still, again, amazing to see. The other side is we’ve seen maybe more of a passive monitoring or something around the background of things where you have a cloud platform team that provides abstractions and guardrails to the user.
So, you’re not really trying to actively stand in the way of users and what they’re able to do and reaching out to them in an ongoing way, but you’re abstracting away the kind of complexity of the cloud and letting them basically live in a safe space that you are controlling for them. And that’s another way that you can kind of build in some of this cloud financial knowledge where teams can get that visibility into what they’re spending and know, is this too high? Is it going out of a boundary? Is there a number that I need to keep inside of? I think these are important things that level of visibility around cost and that team’s actual charges get people to start thinking, “Well, hold on a second. We’re above budget.” Even though maybe it’s not a real budget, “We’re above a spend by 20%. We need to bring that down.” And you give them the tools they need and the dashboards to effect that change on their own.
Jesse: This idea of passive monitoring is really all about making the right thing the easy thing to do. If you, as a member of the cloud platform team or as a member of leadership who cares about cloud spend, wants to make sure that teams are managing their spend in some capacity, maybe not actively directly, but at some level make sure that there are these guide rails in place that keep them within the boundaries of what they’re ultimately able to do, or what you ultimately want them to work on. This makes it a lot easier for them to not spin up an i3 instance that they don’t ultimately need; it makes it a lot easier for them to not deploy resources that are missing tags. Put as many guardrails in place that keep the teams independently able to work within the space that they are building, and developing, and functioning, but ultimately gives them the opportunity to continue being independent and really thrive within whatever work they’re doing.
Pete: Yeah, the next thing that we recommend to everyone. And actually, we recommend it before even engaging with The Duckbill Group, you’ll get an onboarding document of things to do, and the thing we always recommend is turn on the Cost and Usage Report. If you’re listening to this and you’re like, “What’s the Cost and Usage Report?” Well, boy are you going to have a fun learning because it is a highly granular usage report of everything that you’ve ever done within Amazon, and it’s extremely powerful. The downside is that it can be hard to navigate; it takes a little time to learn.
But go turn it on; the cost is minimal; it’s the cost of storing this data in S3. Preferably when you turn this on, turn it on into Parquet format because it’ll allow you to query it with tools like Athena, or Tableau, or Looker or—God forbid—SageMaker. And this tool, this Cost and Usage Report, lets you dive in at an extremely granular level, down to the resource visibility—per hour, per resource visibility. And it’s something you have to enable, but again, highly recommend it to enable that resource level usage. Because now you can go and find out, well, for SageMaker I’m seeing a growth in spend.
Well, which resource is it within SageMaker? You can break that down really granularly. So, Cost and Usage Report is another place that, again, if you’re not using this today, if you don’t have at least a SageMaker dashboard, which costs basically nothing—a couple of dollars a month—pointed at your Cost and Usage Report, you’re missing out on some really great ways to understand the changes in spend over time.
Announcer: If your mean time to WTF for a security alert is more than a minute, it’s time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you’re building a secure business on AWS with compliance requirements, you don’t really have time to choose between antivirus or firewall companies to help you secure your stack. That’s why Lacework is built from the ground up for the Cloud: low effort, high visibility, and detection. To learn more, visit lacework.com.
Jesse: Another couple of really great options are the AWS Cost Anomaly Detection service and AWS Budgets. Both are free, which is absolutely fantastic. I highly recommend checking them out. AWS Cost Anomaly Detection, once enabled, will actually look for anomalies in your spend across different AWS services, across different cost attribution tags, across different cost categories. There’s a lot of opportunities here for you to see anomalous spend and act on it.
This can be shared with teams as soon as the anomaly occurs, through Slack notification or an email, or maybe you get email notifications on a weekly basis, or a monthly basis, or some kind of recurring basis, for all of the anomalies that you saw within a given time period. We recorded an episode about Cost Anomaly Detection a while back; highly recommend checking that episode out. It’s got a lot of really great features and recommendations for getting started.
The other one I mentioned is AWS Budgets. Again, if you’re not really sure where to start, try creating some budgets for your teams. Maybe look at the last six months of spend for each team, maybe look at spend across different tags, or team units, or business units, whatever makes the most sense for the way that your organization is set up, and create some budgets for those groups. These budgets could be for specific AWS services if you are a single team running within a single AWS account, it could be as complex as multiple business units across multiple different accounts across different parts of the organization. There’s lots of great opportunities here for you to start to better understand your spend, get better visibility into your cloud spend.
Pete: Yeah, absolutely. I think all of those are great tools that can really help you. And, Jesse, I know we’ve talked about this before. Even just monitoring your tagging, not like, “Oh, are we tagging 50% of our resources?” But you want to monitor for your untagged by spend. So, if 95% of your spend is tagged, you’re crushing it. That’s amazing. But that may only be 50% of your things.
So, I guess, care less about how many of your resources are tagged—because some of them just can’t be tagged, or are tagged in a painful way—but focus more on the money aspect of it. And that will lead you into the ability to start creating some governance strategies. And that term governance, it just—
Jesse: Oof.
Pete: —makes me feel gross. Yeah. Oh god, terrible word. But the [laugh] sad state of the world is, that’s what most companies we talk to need; they just don’t have it. When the companies that we talk to who are like, “Our spend is going up, and we’re not sure why.” Or, “How do we get our engineers to care about cost savings?” And things like that. You know, having a governance strategy, a way to react to those changes in spend in a, hopefully, automated way, is critical to helping control that spend.
Jesse: This really gets to the heart of why is cloud cost management important? It could be important for different reasons for different parts of the organization. Account structure, tagging, all of these different things can be important for different parts of the organization for different reasons. And that’s fine. The important thing is to socialize those reasons why to all the different parts of the company so that everybody understands what’s at stake.
Everybody understands how they can collaborate and create these best practices together. This really dives into the idea of behaviors and systems. I know it sounds a little bit not within the vein of engineering work, and finance, and cloud cost management, but what kind of behaviors do you ultimately want to see within your teams? What kind of actions do you want to see your engineers taking? Do you want them to start thinking about cost in all of their architecture discussions?
Do you want them to review the budgets that you’ve created for them every month? Every week? During stand-up meetings? What kind of things do you ultimately want to see them doing on a regular basis that maybe they aren’t doing right now, that maybe would ultimately help the company succeed with all of this cloud cost management work that you’re creating? And again, going back to the idea of making the right thing the easy thing to do, how can you improve the existing technical systems that you have within your organization to make the right thing the easy thing to do?
How can you change your CI/CD pipelines? How can you change the tools that you’re using for cost visibility, like Looker, or Tableau, or SageMaker, or something else, such that teams can quickly and easily self-service the information that they need to make their decisions to go about their days, go about their work more easily?
Pete: So, Jesse, you’re saying that it’s a mixture of software and culture? Kind of sounds like DevOps a little bit, doesn’t it? [laugh].
Jesse: Yeah. Yeah, it kind of does.
Pete: Yeah, it kind of does. So, you know, I think all of that is to say, it’s hard work, it’s not going to come easy, but how would we get started? Like, when we enter into an engagement with one of our clients, we’re coming in from total outsiders and we’re trying to navigate through a company with complex communication structures, and maybe teams that are entrenched in different ways. How do we get started? Well, we dive in; we start with big numbers, right?
What are your top ten places your money goes, just by service? I’ll answer it for you. It’s probably EC2, S3, RDS, and then dealer’s choice for the last ones, maybe data transfer, maybe Lambda, if you’re really weird. And if Lambda is in your top five, you should absolutely give us a call because that should not be the case. [laugh].
But start with those big numbers, understand where the money is going. But then go to the next level in. Okay, within EC2, where is your spend going? Or the dastardly EC2 ‘other cost’ category; okay’s the money going? Is it in regional data transfer, which is also what’s called cross-AZ data transfer? Is it in your NAT gateways? Why?
That’s the next question. Why is the spend high in that area? You may not be able to understand because it may not be tagged—we find that a lot—but start asking questions. And that’s what we do: We start reaching out to technical folks within the company. We’d say, “Hey, we see you’ve got a high amount of usage on EMR, but they’re all clusters that are running 24/7. They’re not scaling up and down as the jobs are happening. Who knows more about EMR?” And we just start asking questions. And we’re asking them, “Well, are you doing anything on the cost optimization side? Have you tried to do anything cost optimization-wise to reduce it, and you haven’t been able to? How does this infrastructure scale? Does it scale linearly with the number of users? Does it scale in a different way? Who are the consumers?”
And then you kind of even go another level down to see, do you find anything that just looks odd? I saw on one account for a client we were working with, VPC costs were just extremely high, much higher than I’ve ever seen before. What was interesting is that the cost was not data-transfer related; it was the pure number of endpoints that they had created that that cost far outweighed any other costs to data transfer; there was just a piece of technical debt that they were aware of, but the structure of their multi-accounts, they just couldn’t do anything about it. But again, you’re looking for things like that. And you know that you are doing a good job if, essentially, you can get to the end of this process—which could take months and it could take years depending on your scale—is if you can answer this question that if your customers, or users, or consumers of the applications on your cloud service if they increased by 200%, 500%, 1,000%. What would happen to your cloud spend? How would it change? That’s the end game you’re trying to get to. That’s the unit economics, the unit economic model and forecasting, and now you’re a superhero because now you can answer that question that not a lot of people are able to answer with their cloud usage.
Jesse: I also want to add that, as you’re asking questions, you’re going to find teams that specifically will tell you, “We created this infrastructure in this way because security told us to,” or, “Because our business requirements say that we have an SLA that means we need to keep data for this amount of time at this level of availability.” And that’s totally fine. That doesn’t mean that you need to necessarily change those requirements. But now you might have a dollar amount for those business decisions. Now, you might ultimately be able to say, okay, our product SLA may say that we need to keep data for 90 days, but keeping data for 90 days, that business decision is costing us hundreds of thousands of dollars every month because of the sheer volume of data that we now have to keep. Is that something that we ultimately are okay with? And are we okay spending that much money every month to keep this business decision, or do we need to revisit that business decision? And that’s only something that you and your teams can decide for yourselves.
Pete: Awesome. These are great questions. You could also send us a question lastweekinaws.com/QA. We would love to spend some time diving into it and just helping you out and helping you get through your day. That’s what we’re here for.
If you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review. Give it a five-star review on your podcast platform of choice, and tell us what is your favorite EC2 instance to turn off for your engineers.
Jesse: [laugh].
Pete: Thanks, everyone.
Announcer: This has been a HumblePod production. Stay humble.