Episode Summary

Rick Ochs, Principal Product Manager for AWS, joins Corey on Screaming in the Cloud to discuss the elephant in the Twitter feed - AWS cost optimization. Rick explains why recommendation accuracy is paramount for his team’s ability to instill trust in their customers, and how important it was to him that AWS prioritize cost optimization prior to him joining the team. Rick reveals the four major themes of cost optimization, as well as how his team is structured to help address not only cost optimization, but also scale, from multiple angles so they can meet the needs of all their customers. Rick also discusses the psychological and environmental impacts of optimizing savings on cloud.

Episode Show Notes & Transcript

About Rick

Rick is the Product Leader of the AWS Optimization team. He previously led the cloud optimization product organization at Turbonomic, and previously was the Microsoft Azure Resource Optimization program owner.

Links Referenced:

AWS: https://console.aws.amazon.com
LinkedIn: https://www.linkedin.com/in/rick-ochs-06469833/
Twitter: https://twitter.com/rickyo1138

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by our friends at Chronosphere. Tired of observability costs going up every year without getting additional value? Or being locked in to a vendor due to proprietary data collection, querying and visualization? Modern day, containerized environments require a new kind of observability technology that accounts for the massive increase in scale and attendant cost of data. With Chronosphere, choose where and how your data is routed and stored, query it easily, and get better context and control. 100% open source compatibility means that no matter what your setup is, they can help. Learn how Chronosphere provides complete and real-time insight into ECS, EKS, and your microservices, whereever they may be at snark.cloud/chronosphere That’s snark.cloud/chronosphere

Corey: This episode is bought to you in part by our friends at Veeam. Do you care about backups? Of course you don’t. Nobody cares about backups. Stop lying to yourselves! You care about restores, usually right after you didn’t care enough about backups. If you’re tired of the vulnerabilities, costs and slow recoveries when using snapshots to restore your data, assuming you even have them at all living in AWS-land, there is an alternative for you. Check out Veeam, thats V-E-E-A-M for secure, zero-fuss AWS backup that won't leave you high and dry when it’s time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. For those of you who’ve been listening to this show for a while, the theme has probably emerged, and that is that one of the key values of this show is to give the guest a chance to tell their story. It doesn’t beat the guests up about how they approach things, it doesn’t call them out for being completely wrong on things because honestly, I’m pretty good at choosing guests, and I don’t bring people on that are, you know, walking trash fires. And that is certainly not a concern for this episode.

But this might devolve into a screaming loud argument, despite my best effort. Today, I’m joined by Rick Ochs, Principal Product Manager at AWS. Rick, thank you for coming back on the show. The last time we spoke, you were not here you were at, I believe it was Turbonomic.

Rick: Yeah, that’s right. Thanks for having me on the show, Corey. I’m really excited to talk to you about optimization and my current role and what we’re doing.

Corey: Well, let’s start at the beginning. Principal product manager. It sounds like one of those corporate titles that can mean a different thing in every company or every team that you’re talking to. What is your area of responsibility? Where do you start and where do you stop?

Rick: Awesome. So, I am the product manager lead for all of AWS Optimizations Team. So, I lead the product team. That includes several other product managers that focus in on Compute Optimizer, Cost Explorer, right-sizing recommendations, as well as Reservation and Savings Plan purchase recommendations.

Corey: In other words, you are the person who effectively oversees all of the AWS cost optimization tooling and approaches to same?

Rick: Yeah.

Corey: Give or take. I mean, you could argue that oh, every team winds up focusing on helping customers save money. I could fight that argument just as effectively. But you effectively start and stop with respect to helping customers save money or understand where the money is going on their AWS bill.

Rick: I think that’s a fair statement. And I also agree with your comment that I think a lot of service teams do think through those use cases and provide capabilities, you know? There’s, like, S3 storage lines. You know, there’s all sorts of other products that do offer optimization capabilities as well, but as far as, like, the unified purpose of my team, it is, unilaterally focused on how do we help customers safely reduce their spend and not hurt their business at the same time.

Corey: Safely being the key word. For those who are unaware of my day job, I am a partial owner of The Duckbill Group, a consultancy where we fix exactly one problem: the horrifying AWS bill. This is all that I’ve been doing for the last six years, so I have some opinions on AWS bill reduction as well. So, this is going to be a fun episode for the two of us to wind up, mmm, more or less smacking each other around, but politely because we are both professionals. So, let’s start with a very high level. How does AWS think about AWS bills from a customer perspective? You talk about optimizing it, but what does that mean to you?

Rick: Yeah. So, I mean, there’s a lot of ways to think about it, especially depending on who I’m talking to, you know, where they sit in our organization. I would say I think about optimization in four major themes. The first is how do you scale correctly, whether that’s right-sizing or architecting things to scale in and out? The second thing I would say is, how do you do pricing and discounting, whether that’s Reservation management, Savings Plan Management, coverage, how do you handle the expenditures of prepayments and things like that?

Then I would say suspension. What that means is turn the lights off when you leave the room. We have a lot of customers that do this and I think there’s a lot of opportunity for more. Turning EC2 instances off when they’re not needed if they’re non-production workloads or other, sort of, stateful services that charge by the hour, I think there’s a lot of opportunity there.

And then the last of the four methods is clean up. And I think it’s maybe one of the lowest-hanging fruit, but essentially, are you done using this thing? Delete it. And there’s a whole opportunity of cleaning up, you know, IP addresses unattached EBS volumes, sort of, these resources that hang around in AWS accounts that sort of getting lost and forgotten as well. So, those are the four kind of major thematic strategies for how to optimize a cloud environment that we think about and spend a lot of time working on.

Corey: I feel like there’s—or at least the way that I approach these things—that there are a number of different levels you can look at AWS billing constructs on. The way that I tend to structure most of my engagements when I’m working with clients is we come in and, step one: cool. Why do you care about the AWS bill? It’s a weird question to ask because most of the engineering folks look at me like I’ve just grown a second head. Like, “So, why do you care about your AWS bill?” Like, “What? Why do you? You run a company doing this?”

It’s no, no, no, it’s not that I’m being rhetorical and I don’t—I’m trying to be clever somehow and pretend that I don’t understand all the nuances around this, but why does your business care about lowering the AWS bill? Because very often, the answer is they kind of don’t. What they care about from a business perspective is being able to accurately attribute costs for the service or good that they provide, being able to predict what that spend is going to be, and also yes, a sense of being good stewards of the money that has been entrusted to them by via investors, public markets, or the budget allocation process of their companies and make sure that they’re not doing foolish things with it. And that makes an awful lot of sense. It is rare at the corporate level that the stated number one concern is make the bills lower.

Because at that point, well, easy enough. Let’s just turn off everything you’re running in production. You’ll save a lot of money in your AWS bill. You won’t be in business anymore, but you’ll be saving a lot of money on the AWS bill. The answer is always deceptively nuanced and complicated.

At least, that’s how I see it. Let’s also be clear that I talk with a relatively narrow subset of the AWS customer totality. The things that I do are very much intentionally things that do not scale. Definitionally, everything that you do has to scale. How do you wind up approaching this in ways that will work for customers spending billions versus independent learners who are paying for this out of their own personal pocket?

Rick: It’s not easy [laugh], let me just preface that. The team we have is incredible and we spent so much time thinking about scale and the different personas that engage with our products and how they’re—what their experience is when they interact with a bill or AWS platform at large. There’s also a couple of different personas here, right? We have a persona that focuses in on that cloud cost, the cloud bill, the finance, whether that’s—if an organization is created a FinOps organization, if they have a Cloud Center of Excellence, versus an engineering team that maybe has started to go towards decentralized IT and has some accountability for the spend that they attribute to their AWS bill. And so, these different personas interact with us in really different ways, where Cost Explorer downloading the CUR and taking a look at the bill.

And one thing that I always kind of imagine is somebody putting a headlamp on and going into the caves in the depths of their AWS bill and kind of like spelunking through their bill sometimes, right? And so, you have these FinOps folks and billing people that are deeply interested in making sure that the spend they do have meets their business goals, meaning this is providing high value to our company, it’s providing high value to our customers, and we’re spending on the right things, we’re spending the right amount on the right things. Versus the engineering organization that’s like, “Hey, how do we configure these resources? What types of instances should we be focused on using? What services should we be building on top of that maybe are more flexible for our business needs?”

And so, there’s really, like, two major personas that I spend a lot of time—our organization spends a lot of time wrapping our heads around. Because they’re really different, very different approaches to how we think about cost. Because you’re right, if you just wanted to lower your AWS bill, it’s really easy. Just size everything to a t2.nano and you’re done and move on [laugh], right? But you’re [crosstalk 00:08:53]—

Corey: Aw, t3 or t4.nano, depending upon whether regional availability is going to save you less. I’m still better at this. Let’s not kid ourselves I kid. Mostly.

Rick: For sure. So t4.nano, absolutely.

Corey: T4g. Remember, now the way forward is everything has an explicit letter designator to define which processor company made the CPU that underpins the instance itself because that’s a level of abstraction we certainly wouldn’t want the cloud provider to take away from us any.

Rick: Absolutely. And actually, the performance differences of those different processor models can be pretty incredible [laugh]. So, there’s huge decisions behind all of that as well.

Corey: Oh, yeah. There’s so many factors that factor in all these things. It’s gotten to a point of you see this usually with lawyers and very senior engineers, but the answer to almost everything is, “It depends.” There are always going to be edge cases. Easy example of, if you check a box and enable an S3 Gateway endpoint inside of a private subnet, suddenly, you’re not passing traffic through a 4.5 cent per gigabyte managed NAT Gateway; it’s being sent over that endpoint for no additional cost whatsoever.

Check the box, save a bunch of money. But there are scenarios where you don’t want to do it, so always double-checking and talking to customers about this is critically important. Just because, the first time you make a recommendation that does not work for their constraints, you lose trust. And make a few of those and it looks like you’re more or less just making naive recommendations that don’t add any value, and they learn to ignore you. So, down the road, when you make a really high-value, great recommendation for them, they stop paying attention.

Rick: Absolutely. And we have that really high bar for recommendation accuracy, especially with right sizing, that’s such a key one. Although I guess Savings Plan purchase recommendations can be critical as well. If a customer over commits on the amount of Savings Plan purchase they need to make, right, that’s a really big problem for them.

So, recommendation accuracy must be above reproach. Essentially, if a customer takes a recommendation and it breaks an application, they’re probably never going to take another right-sizing recommendation again [laugh]. And so, this bar of trust must be exceptionally high. That’s also why out of the box, the compute optimizer recommendations can be a little bit mild, they’re a little time because the first order of business is do no harm; focus on the performance requirements of the application first because we have to make sure that the reason you build these workloads in AWS is served.

Now ideally, we do that without overspending and without overprovisioning the capacity of these workloads, right? And so, for example, like if we make these right-sizing recommendations from Compute Optimizer, we’re taking a look at the utilization of CPU, memory, disk, network, throughput, iops, and we’re vending these recommendations to customers. And when you take that recommendation, you must still have great application performance for your business to be served, right? It’s such a crucial part of how we optimize and run long-term. Because optimization is not a one-time Band-Aid; it’s an ongoing behavior, so it’s really critical that for that accuracy to be exceptionally high so we can build business process on top of it as well.

Corey: Let me ask you this. How do you contextualize what the right approach to optimization is? What is your entire—there are certain tools that you have… by ‘you,’ I mean, of course, as an organization—have repeatedly gone back to and different approaches that don’t seem to deviate all that much from year to year, and customer to customer. How do you think about the general things that apply universally?

Rick: So, we know that EC2 is a very popular service for us. We know that sizing EC2 is difficult. We think about that optimization pillar of scaling. It’s an obvious area for us to help customers. We run into this sort of industry-wide experience where whenever somebody picks the size of a resource, they’re going to pick one generally larger than they need.

It’s almost like asking a new employee at your company, “Hey, pick your laptop. We have a 16 gig model or a 32 gig model. Which one do you want?” That person [laugh] making the decision on capacity, hardware capacity, they’re always going to pick the 32 gig model laptop, right? And so, we have this sort of human nature in IT of, we don’t want to get called at two in the morning for performance issues, we don’t want our apps to fall over, we want them to run really well, so we’re going to size things very conservatively and we’re going to oversize things.

So, we can help customers by providing those recommendations to say, you can size things up in a different way using math and analytics based on the utilization patterns, and we can provide and pick different instance types. There’s hundreds and hundreds of instance types in all of these regions across the globe. How do you know which is the right one for every single resource you have? It’s a very, very hard problem to solve and it’s not something that is lucrative to solve one by one if you have 100 EC2 instances. Trying to pick the correct size for each and every one can take hours and hours of IT engineering resources to look at utilization graphs, look at all of these types available, look at what is the performance difference between processor models and providers of those processors, is there application compatibility constraints that I have to consider? The complexity is astronomical.

And then not only that, as soon as you make that sizing decision, one week later, it’s out of date and you need a different size. So, [laugh] you didn’t really solve the problem. So, we have to programmatically use data science and math to say, “Based on these utilization values, these are the sizes that would make sense for your business, that would have the lowest cost and the highest performance together at the same time.” And it’s super important that we provide this capability from a technology standpoint because it would cost so much money to try to solve that problem that the savings you would achieve might not be meaningful. Then at the same time… you know, that’s really from an engineering perspective, but when we talk to the FinOps and the finance folks, the conversations are more about Reservations and Savings Plans.

How do we correctly apply Savings Plans and Reservations across a high percentage of our portfolio to reduce the costs on those workloads, but not so much that dynamic capacity levels in our organization mean we all of a sudden have a bunch of unused Reservations or Savings Plans? And so, a lot of organizations that engage with us and we have conversations with, we start with the Reservation and Savings Plan conversation because it’s much easier to click a few buttons and buy a Savings Plan than to go institute an entire right-sizing campaign across multiple engineering teams. That can be very difficult, a much higher bar. So, some companies are ready to dive into the engineering task of sizing; some are not there yet. And they’re a little maybe a little earlier in their FinOps journey, or the building optimization technology stacks, or achieving higher value out of their cloud environments, so starting with kind of the low hanging fruit, it can vary depending on the company, size of company, technical aptitudes, skill sets, all sorts of things like that.

And so, those finance-focused teams are definitely spending more time looking at and studying what are the best practices for purchasing Savings Plans, covering my environment, getting the most out of my dollar that way. Then they don’t have to engage the engineering teams; they can kind of take a nice chunk off the top of their bill and sort of have something to show for that amount of effort. So, there’s a lot of different approaches to start in optimization.

Corey: My philosophy runs somewhat counter to this because everything you’re saying does work globally, it’s safe, it’s non-threatening, and then also really, on some level, feels like it is an approach that can be driven forward by finance or business. Whereas my worldview is that cost and architecture in cloud are one and the same. And there are architectural consequences of cost decisions and vice versa that can be adjusted and addressed. Like, one of my favorite party tricks—although I admit, it’s a weird party—is I can look at the exploded PDF view of a customer’s AWS bill and describe their architecture to them. And people have questioned that a few times, and now I have a testimonial on my client website that mentions, “It was weird how he was able to do this.”

Yeah, it’s real, I can do it. And it’s not a skill, I would recommend cultivating for most people. But it does also mean that I think I’m onto something here, where there’s always context that needs to be applied. It feels like there’s an entire ecosystem of product companies out there trying to build what amount to a better Cost Explorer that also is not free the way that Cost Explorer is. So, the challenge I see there’s they all tend to look more or less the same; there is very little differentiation in that space. And in the fullness of time, Cost Explorer does—ideally—get better. How do you think about it?

Rick: Absolutely. If you’re looking at ways to understand your bill, there’s obviously Cost Explorer, the CUR, that’s a very common approach is to take the CUR and put a BI front-end on top of it. That’s a common experience. A lot of companies that have chops in that space will do that themselves instead of purchasing a third-party product that does do bill breakdown and dissemination. There’s also the cross-charge show-back organizational breakdown and boundaries because you have these super large organizations that have fiefdoms.

You know, if HR IT and sales IT, and [laugh] you know, product IT, you have all these different IT departments that are fiefdoms within your AWS bill and construct, whether they have different ABS accounts or say different AWS organizations sometimes, right, it can get extremely complicated. And some organizations require the ability to break down their bill based on those organizational boundaries. Maybe tagging works, maybe it doesn’t. Maybe they do that by using a third-party product that lets them set custom scopes on their resources based on organizational boundaries. That’s a common approach as well.

We do also have our first-party solutions, they can do that, like the CUDOS dashboard as well. That’s something that’s really popular and highly used across our customer base. It allows you to have kind of a dashboard and customizable view of your AWS costs and, kind of, split it up based on tag organizational value, account name, and things like that as well. So, you mentioned that you feel like the architectural and cost problem is the same problem. I really don’t disagree with that at all.

I think what it comes down to is some organizations are prepared to tackle the architectural elements of cost and some are not. And it really comes down to how does the customer view their bill? Is it somebody in the finance organization looking at the bill? Is it somebody in the engineering organization looking at the bill? Ideally, it would be both.

Ideally, you would have some of those skill sets that overlap, or you would have an organization that does focus in on FinOps or cloud operations as it relates to cost. But then at the same time, there are organizations that are like, “Hey, we need to go to cloud. Our CIO told us go to cloud. We don’t want to pay the lease renewal on this building.” There’s a lot of reasons why customers move to cloud, a lot of great reasons, right? Three major reasons you move to cloud: agility, [crosstalk 00:20:11]—

Corey: And several terrible ones.

Rick: Yeah, [laugh] and some not-so-great ones, too. So, there’s so many different dynamics that get exposed when customers engage with us that they might or might not be ready to engage on the architectural element of how to build hyperscale systems. So, many of these customers are bringing legacy workloads and applications to the cloud, and something like a re-architecture to use stateless resources or something like Spot, that’s just not possible for them. So, how can they take 20% off the top of their bill? Savings Plans or Reservations are kind of that easy, low-hanging fruit answer to just say, “We know these are fairly static environments that don’t change a whole lot, that are going to exist for some amount of time.”

They’re legacy, you know, we can’t turn them off. It doesn’t make sense to rewrite these applications because they just don’t change, they don’t have high business value, or something like that. And so, the architecture part of that conversation doesn’t always come into play. Should it? Yes.

The long-term maturity and approach for cloud optimization does absolutely account for architecture, thinking strategically about how you do scaling, what services you’re using, are you going down the Kubernetes path, which I know you’re going to laugh about, but you know, how do you take these applications and componentize them? What services are you using to do that? How do you get that long-term scale and manageability out of those environments? Like you said at the beginning, the complexity is staggering and there’s no one unified answer. That’s why there’s so many different entrance paths into, “How do I optimize my AWS bill?”

There’s no one answer, and every customer I talk to has a different comfort level and appetite. And some of them have tried suspension, some of them have gone heavy down Savings Plans, some of them want to dabble in right-sizing. So, every customer is different and we want to provide those capabilities for all of those different customers that have different appetites or comfort levels with each of these approaches.

Corey: This episode is sponsored in part by our friends at Redis, the company behind the incredibly popular open source database. If you’re tired of managing open source Redis on your own, or if you are looking to go beyond just caching and unlocking your data’s full potential, these folks have you covered. Redis Enterprise is the go-to managed Redis service that allows you to reimagine how your geo-distributed applications process, deliver, and store data. To learn more from the experts in Redis how to be real-time, right now, from anywhere, visit redis.com/duckbill. That’s R - E - D - I - S dot com slash duckbill.

Corey: And I think that’s very fair. I think that it is not necessarily a bad thing that you wind up presenting a lot of these options to customers. But there are some rough edges. An example of this is something I encountered myself somewhat recently and put on Twitter—because I have those kinds of problems—where originally, I remember this, that you were able to buy hourly Savings Plans, which again, Savings Plans are great; no knock there. I would wish that they applied to more services rather than, “Oh, SageMaker is going to do its own Savings Pla”—no, stop keeping me from going from something where I have to manage myself on EC2 to something you manage for me and making that cost money. You nailed it with Fargate. You nailed it with Lambda. Please just have one unified Savings Plan thing. But I digress.

But you had a limit, once upon a time, of $1,000 per hour. Now, it’s $5,000 per hour, which I believe in a three-year all-up-front means you will cheerfully add $130 million purchase to your shopping cart. And I kept adding a bunch of them and then had a little over a billion dollars a single button click away from being charged to my account. Let me begin with what’s up with that?

Rick: [laugh]. Thank you for the tweet, by the way, Corey.

Corey: Always thrilled to ruin your month, Rick. You know that.

Rick: Yeah. Fantastic. We took that tweet—you know, it was tongue in cheek, but also it was a serious opportunity for us to ask a question of what does happen? And it’s something we did ask internally and have some fun conversations about. I can tell you that if you clicked purchase, it would have been declined [laugh]. So, you would have not been—

Corey: Yeah, American Express would have had a problem with that. But the question is, would you have attempted to charge American Express, or would something internally have gone, “This has a few too many commas for us to wind up presenting it to the card issuer with a straight face?”

Rick: [laugh]. Right. So, it wouldn’t have gone through and I can tell you that, you know, if your account was on a PO-based configuration, you know, it would have gone to the account team. And it would have gone through our standard process for having a conversation with our customer there. That being said, we are—it’s an awesome opportunity for us to examine what is that shopping cart experience.

We did increase the limit, you’re right. And we increased the limit for a lot of reasons that we sat down and worked through, but at the same time, there’s always an opportunity for improvement of our product and experience, we want to make sure that it’s really easy and lightweight to use our products, especially purchasing Savings Plans. Savings Plans are already kind of wrought with mental concern and risk of purchasing something so expensive and large that has a big impact on your AWS bill, so we don’t really want to add any more friction necessarily the process but we do want to build an awareness and make sure customers understand, “Hey, you’re purchasing this. This has a pretty big impact.” And so, we’re also looking at other ways we can kind of improve the ability for the Savings Plan shopping cart experience to ensure customers don’t put themselves in a position where you have to unwind or make phone calls and say, “Oops.” Right? We [laugh] want to avoid those sorts of situations for our customers. So, we are looking at quite a few additional improvements to that experience as well that I’m really excited about that I really can’t share here, but stay tuned.

Corey: I am looking forward to it. I will say the counterpoint to that is having worked with customers who do make large eight-figure purchases at once, there’s a psychology element that plays into it. Everyone is very scared to click the button on the ‘Buy It Now’ thing or the ‘Approve It.’ So, what I’ve often found is at that scale, one, you can reduce what you’re buying by half of it, and then see how that treats you and then continue to iterate forward rather than doing it all at once, or reach out to your account team and have them orchestrate the buy. In previous engagements, I had a customer do this religiously and at one point, the concierge team bought the wrong thing in the wrong region, and from my perspective, I would much rather have AWS apologize for that and fix it on their end, than from us having to go with a customer side of, “Oh crap, oh, crap. Please be nice to us.”

Not that I doubt you would do it, but that’s not the nervous conversation I want to have in quite the same way. It just seems odd to me that someone would want to make that scale of purchase without ever talking to a human. I mean, I get it. I’m as antisocial as they come some days, but for that kind of money, I kind of just want another human being to validate that I’m not making a giant mistake.

Rick: We love that. That’s such a tremendous opportunity for us to engage and discuss with an organization that’s going to make a large commitment, that here’s the impact, here’s how we can help. How does it align to our strategy? We also do recommend, from a strategic perspective, those more incremental purchases. I think it creates a better experience long-term when you don’t have a single Savings Plan that’s going to expire on a specific day that all of a sudden increases your entire bill by a significant percentage.

So, making staggered monthly purchases makes a lot of sense. And it also works better for incremental growth, right? If your organization is growing 5% month-over-month or year-over-year or something like that, you can purchase those incremental Savings Plans that sort of stack up on top of each other and then you don’t have that risk of a cliff one day where one super-large SP expires and boom, you have to scramble and repurchase within minutes because every minute that goes by is an additional expense, right? That’s not a great experience. And so that’s, really, a large part of why those staggered purchase experiences make a lot of sense.

That being said, a lot of companies do their math and their finance in different ways. And single large purchases makes sense to go through their process and their rigor as well. So, we try to support both types of purchasing patterns.

Corey: I think that is an underappreciated aspect of cloud cost savings and cloud cost optimization, where it is much more about humans than it is about math. I see this most notably when I’m helping customers negotiate their AWS contracts with AWS, where they are often perspectives such as, “Well, we feel like we really got screwed over last time, so we want to stick it to them and make them give us a bigger percentage discount on something.” And it’s like, look, you can do that, but I would much rather, if it were me, go for something that moves the needle on your actual business and empowers you to move faster, more effectively, and lead to an outcome that is a positive for everyone versus the well, we’re just going to be difficult in this one point because they were difficult on something last time. But ego is a thing. Human psychology is never going to have an API for it. And again, customers get to decide their own destiny in some cases.

Rick: I completely agree. I’ve actually experienced that. So, this is the third company I’ve been working at on Cloud optimization. I spent several years at Microsoft running an optimization program. I went to Turbonomic for several years, building out the right-sizing and savings plan reservation purchase capabilities there, and now here at AWS.

And through all of these journeys and experiences working with companies to help optimize their cloud spend, I can tell you that the psychological needle—moving the needle is significantly harder than the technology stack of sizing something correctly or deleting something that’s unused. We can solve the technology part. We can build great products that identify opportunities to save money. There’s still this psychological component of IT, for the last several decades has gone through this maturity curve of if it’s not broken, don’t touch it. Five-nines, six sigma, all of these methods of IT sort of rationalizing do no harm, don’t touch anything, everything must be up.

And it even kind of goes back several decades. Back when if you rebooted a physical server, the motherboard capacitors would pop, right? So, there’s even this anti—or this stigma against even rebooting servers sometimes. In the cloud really does away with a lot of that stuff because we have live migration and we have all of these, sort of, stateless designs and capabilities, but we still carry along with us this mentality of don’t touch it; it might fall over. And we have to really get past that.

And that means that the trust, we went back to the trust conversation where we talk about the recommendations must be incredibly accurate. You’re risking your job, in some cases; if you are a DevOps engineer, and your commitments on your yearly goals are uptime, latency, response time, load time, these sorts of things, these operational metrics, KPIs that you use, you don’t want to take a downsized recommendation. It has a severe risk of harming your job and your bonus.

Corey: “These instances are idle. Turn them off.” It’s like, yeah, these instances are the backup site, or the DR environment, or—

Rick: Exactly.

Corey: —something that takes very bursty but occasional traffic. And yeah, I know it costs us some money, but here’s the revenue figures for having that thing available. Like, “Oh, yeah. Maybe we should shut up and not make dumb recommendations around things,” is the human response, but computers don’t have that context.

Rick: Absolutely. And so, the accuracy and trust component has to be the highest bar we meet for any optimization activity or behavior. We have to circumvent or supersede the human aversion, the risk aversion, that IT is built on, right?

Corey: Oh, absolutely. And let’s be clear, we see this all the time where I’m talking to customers and they have been burned before because we tried to save money and then we took a production outage as a side effect of a change that we made, and now we’re not allowed to try to save money anymore. And there’s a hidden truth in there, which is auto-scaling is something that a lot of customers talk about, but very few have instrumented true auto-scaling because they interpret is we can scale up to meet demand. Because yeah, if you don’t do that you’re dropping customers on the floor.

Well, what about scaling back down again? And the answer there is like, yeah, that’s not really a priority because it’s just money. We’re not disappointing customers, causing brand reputation, and we’re still able to take people’s money when that happens. It’s only money; we can fix it later. Covid shined a real light on a lot of the stuff just because there are customers that we’ve spoken to who’s—their user traffic dropped off a cliff, infrastructure spend remained constant day over day.

And yeah, they believe, genuinely, they were auto-scaling. The most interesting lies are the ones that customers tell themselves, but the bill speaks. So, getting a lot of modernization traction from things like that was really neat to watch. But customers I don’t think necessarily intuitively understand most aspects of their bill because it is a multidisciplinary problem. It’s engineering, its finance, its accounting—which is not the same thing as finance—and you need all three of those constituencies to be able to communicate effectively using a shared and common language. It feels like we’re marriage counseling between engineering and finance, most weeks.

Rick: Absolutely, we are. And it’s important we get it right, that the data is accurate, that the recommendations we provide are trustworthy. If the finance team gets their hands on the savings potential they see out of right-sizing, takes it to engineering, and then engineering comes back and says, “No, no, no, we can’t actually do that. We can’t actually size those,” right, we have problems. And they’re cultural, they’re transformational. Organizations’ appetite for these things varies greatly and so it’s important that we address that problem from all of those angles. And it’s not easy to do.

Corey: How big do you find the optimization problem is when you talk to customers? How focused are they on it? I have my answers, but that’s the scale of anec-data. I want to hear your actual answer.

Rick: Yeah. So, we talk with a lot of customers that are very interested in optimization. And we’re very interested in helping them on the journey towards having an optimal estate. There are so many nuances and barriers, most of them psychological like we already talked about.

I think there’s this opportunity for us to go do better exposing the potential of what an optimal AWS estate would look like from a dollar and savings perspective. And so, I think it’s kind of not well understood. I think it’s one of the biggest areas or barriers of companies really attacking the optimization problem with more vigor is if they knew that the potential savings they could achieve out of their AWS environment would really align their spend much more closely with the business value they get, I think everybody would go bonkers. And so, I’m really excited about us making progress on exposing that capability or the total savings potential and amount. It’s something we’re looking into doing in a much more obvious way.

And we’re really excited about customers doing that on AWS where they know they can trust AWS to get the best value for their cloud spend, that it’s a long-term good bet because their resources that they’re using on AWS are all focused on giving business value. And that’s the whole key. How can we align the dollars to the business value, right? And I think optimization is that connection between those two concepts.

Corey: Companies are generally not going to greenlight a project whose sole job is to save money unless there’s something very urgent going on. What will happen is as they iterate forward on the next generation of services or a migration of a service from one thing to another, they will make design decisions that benefit those optimizations. There’s low-hanging fruit we can find, usually of the form, “Turn that thing off,” or, “Configure this thing slightly differently,” that doesn’t take a lot of engineering effort in place. But, on some level, it is not worth the engineering effort it takes to do an optimization project. We’ve all met those engineers—speaking is one of them myself—who, left to our own devices, will spend two months just knocking a few hundred bucks a month off of our AWS developer environment.

We steal more than office supplies. I’m not entirely sure what the business value of doing that is, in most cases. For me, yes, okay, things that work in small environments work very well in large environments, generally speaking, so I learned how to save 80 cents here and that’s a few million bucks a month somewhere else. Most folks don’t have that benefit happening, so it’s a question of meeting them where they are.

Rick: Absolutely. And I think the scale component is huge, which you just touched on. When you’re talking about a hundred EC2 instances versus a thousand, optimization becomes kind of a different component of how you manage that AWS environment. And while single-decision recommendations to scale an individual server, the dollar amount might be different, the percentages are just about the same when you look at what is it to be sized correctly, what is it to be configured correctly? And so, it really does come down to priority.

And so, it’s really important to really support all of those companies of all different sizes and industries because they will have different experiences on AWS. And some will have more sensitivity to cost than others, but all of them want to get great business value out of their AWS spend. And so, as long as we’re meeting that need and we’re supporting our customers to make sure they understand the commitment we have to ensuring that their AWS spend is valuable, it is meaningful, right, they’re not spending money on things that are not adding value, that’s really important to us.

Corey: I do want to have as the last topic of discussion here, how AWS views optimization, where there have been a number of repeated statements where helping customers optimize their cloud spend is extremely important to us. And I’m trying to figure out where that falls on the spectrum from, “It’s the thing we say because they make us say it, but no, we’re here to milk them like cows,” all the way on over to, “No, no, we passionately believe in this at every level, top to bottom, in every company. We are just bad at it.” So, I’m trying to understand how that winds up being expressed from your lived experience having solved this problem first outside, and then inside.

Rick: Yeah. So, it’s kind of like part of my personal story. It’s the main reason I joined AWS. And, you know, when you go through the interview loops and you talk to the leaders of an organization you’re thinking about joining, they always stop at the end of the interview and ask, “Do you have any questions for us?” And I asked that question to pretty much every single person I interviewed with. Like, “What is AWS’s appetite for helping customers save money?”

Because, like, from a business perspective, it kind of is a little bit wonky, right? But the answers were varied, and all of them were customer-obsessed and passionate. And I got this sense that my personal passion for helping companies have better efficiency of their IT resources was an absolute primary goal of AWS and a big element of Amazon’s leadership principle, be customer obsessed.

Now, I’m not a spokesperson, so [laugh] we’ll see, but we are deeply interested in making sure our customers have a great long-term experience and a high-trust relationship. And so, when I asked these questions in these interviews, the answers were all about, “We have to do the right thing for the customer. It’s imperative. It’s also in our DNA. It’s one of the most important leadership principles we have to be customer-obsessed.”

And it is the primary reason why I joined: because of that answer to that question. Because it’s so important that we achieve a better efficiency for our IT resources, not just for, like, AWS, but for our planet. If we can reduce consumption patterns and usage across the planet for how we use data centers and all the power that goes into them, we can talk about meaningful reductions of greenhouse gas emissions, the cost and energy needed to run IT business applications, and not only that, but most all new technology that’s developed in the world seems to come out of a data center these days, we have a real opportunity to make a material impact to how much resource we use to build and use these things. And I think we owe it to the planet, to humanity, and I think Amazon takes that really seriously. And I’m really excited to be here because of that.

Corey: As I recall—and feel free to make sure that this comment never sees the light of day—you asked me before interviewing for the role and then deciding to accept it, what I thought about you working there and whether I would recommend it, whether I wouldn’t. And I think my answer was fairly nuanced. And you’re working there now and we still are on speaking terms, so people can probably guess what my comments took the shape of, generally speaking. So, I’m going to have to ask now; it’s been, what, a year since you joined?

Rick: Almost. I think it’s been about eight months.

Corey: Time during a pandemic is always strange. But I have to ask, did I steer you wrong?

Rick: No. Definitely not. I’m very happy to be here. The opportunity to help such a broad range of companies get more value out of technology—and it’s not just cost, right, like we talked about. It’s actually not about the dollar number going down on a bill. It’s about getting more value and moving the needle on how do we efficiently use technology to solve business needs.

And that’s been my career goal for a really long time, I’ve been working on optimization for, like, seven or eight, I don’t know, maybe even nine years now. And it’s like this strange passion for me, this combination of my dad taught me how to be a really good steward of money and a great budget manager, and then my passion for technology. So, it’s this really cool combination of, like, childhood life skills that really came together for me to create a career that I’m really passionate about. And this move to AWS has been such a tremendous way to supercharge my ability to scale my personal mission, and really align it to AWS’s broader mission of helping companies achieve more with cloud platforms, right?

And so, it’s been a really nice eight months. It’s been wild. Learning AWS culture has been wild. It’s a sharp diverging culture from where I’ve been in the past, but it’s also really cool to experience the leadership principles in action. They’re not just things we put on a website; they’re actually things people talk about every day [laugh]. And so, that journey has been humbling and a great learning opportunity as well.

Corey: If people want to learn more, where’s the best place to find you?

Rick: Oh, yeah. Contact me on LinkedIn or Twitter. My Twitter account is @rickyo1138. Let me know if you get the 1138 reference. That’s a fun one.

Corey: THX 1138. Who doesn’t?

Rick: Yeah, there you go. And it’s hidden in almost every single George Lucas movie as well. You can contact me on any of those social media platforms and I’d be happy to engage with anybody that’s interested in optimization, cloud technology, bill, anything like that. Or even not [laugh]. Even anything else, either.

Corey: Thank you so much for being so generous with your time. I really appreciate it.

Rick: My pleasure, Corey. It was wonderful talking to you.

Corey: Rick Ochs, Principal Product Manager at AWS. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment, rightly pointing out that while AWS is great and all, Azure is far more cost-effective for your workloads because, given their lack security, it is trivially easy to just run your workloads in someone else’s account.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

The Complexities of AWS Cost Optimization with Rick Ochs

Episode Summary

Episode Show Notes & Transcript

About Rick

You might also like

Cloud Resilience Strategies with Seth Eliot

Replay – Memes, Streams & Software with Cassidy Williams

Replay – Breaking the Tech Mold with Stephanie Wong

Get the Newsletter

Sponsor an Episode