In this episode, they talk about why you shouldn’t treat the cloud as your own data center, how running apps in the cloud the same way you’d run them in your own data center is the most expensive way to do it, why lift-and-shift is a solid strategy for getting into the cloud quickly—and where the strategy fails, what it’s like to actually manage Cassandra clusters, why you should leverage AWS as a data center and explore the endless amount of tools that exist in the AWS ecosystem, and more.
- Forrest Brazeal article referenced: https://acloudguru.com/blog/engineering/the-lift-and-shift-shot-clock-cloud-migration
- Unconventional Guide: https://www.duckbillgroup.com/resources/unconventional-guide-to-aws-cost-management/
Corey: This episode is sponsored in part by our friends at Fairwinds. Whether you’re new to Kubernetes or have some experience under your belt, and then definitely don’t want to deal with Kubernetes, there are some things you should simply never, ever do in Kubernetes. I would say, “run it at all;” They would argue with me, and that’s okay because we’re going to argue about that. Kendall Miller, president of Fairwinds, was one of the first hires at the company and has spent the last six years the dream of disrupting infrastructure a reality while keeping his finger on the pulse of changing demands in the market, and valuable partnership opportunities. He joins senior site reliability engineer Stevie Caldwell, who supports a growing platform of microservices running on Kubernetes in AWS. I’m joining them as we all discuss what Dev and Ops teams should not do in Kubernetes if they want to get the most out of the leading container orchestrator by volume and complexity. We’re going to speak anecdotally of some Kubernetes failures and how to avoid them, and they’re going to verbally punch me in the face. Sign up now at fairwinds.com/never. That’s fairwinds.com/never.
Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field.
Jesse: I like that. I feel like that's good. That's a solid way to start us off.
Pete: Triple F. I am Pete Cheslock.
Jesse: I'm Jesse DeRose.
Pete: #TripleF. We should get some, I don’t know, jackets made? Mugs?
Jesse: Lapel pins? I'm open. I've always wanted a Members Only jacket.
Pete: If Guy Fieri can call diners, drive-ins, and dives, “Triple D,” then we can definitely call this Triple F.
Jesse: We can definitely make this happen.
Pete: It's not my high school transcript, either, we're talking about here. Oh, well, we are back again, continuing our series on The Unconventional Guide to Cost Management with Episode Two: the Cloud is not your data center.
Jesse: Yeah, this one's gonna be a fun one. I feel like this is a topic that comes up a lot in conversations, sometimes with clients, sometimes with potential clients that are asking, “What kind of things do you see day-to-day? What are some of the big pain points that you see with your cost optimization work?” And so real quick backstory, make sure that you've listened to the previous few episodes to get some context for this segment that we're doing and get some framing for this Unconventional Guide work that we are discussing. But talking about using the Cloud as a data center, I have a lot of thoughts on this.
Pete: Well, hold on a second. Isn't the Cloud just someone else's data center?
Jesse: [laugh] I—yeah, you know, this is the same argument of serverless isn't actually serverless. It's just somebody else's computer.
Pete: [laugh]. Someone else's Docker container. But really, there's a lot of ways we're going with this one. But we're coming at it from, obviously, a cost management perspective. And the big, bold, unpopular opinion that we're gonna say is, the most expensive way to run an application in the Cloud, is by treating the Cloud as just another data center; it's going to cost you way more than it would cost to run in a normal data center. And this goes to the world of, in the early days of Cloud, people just raging online and in conferences about the Cloud, it's so expensive. And yes, it is so expensive, if you treat it like an antiquated data center.
Jesse: And really quick before you get your pitchforks out, there is this concept of ‘lift and shift’ that everybody likes to talk about or ‘technical transformation’ that everybody likes to talk about: moving from a data center into the Cloud, which a lot of people see as this movement where they just uproot everything from their local data center into AWS. And to be clear, we do recommend that. That is a solid strategy to get into the Cloud as fast as possible; just move those workloads over. But it is going to be expensive, and it's not what you ultimately want to stick with long term. So, that's ultimately the big thing to think about here.
Yes, lifting and shifting from your data center into the Cloud is absolutely worthwhile. But it creates this shot clock that's now running after your migration is complete, where if you don't move on to all of the services, and opportunities, and solutions that AWS provides that are native solutions, cloud-native solutions, managed solutions, you're going to end up spending a lot more money than you want.
Pete: Yeah, “The Lift And Shift Shot Clock” that was a great blog post by Forrest from ACG—ACloudGuru. We'll include a link to that in the [00:04:35 show notes]. It talks about how not only do you have technical debt accruing as you lift and shift, but potentially the brain drain as people get sick of managing this hot mess that you've lifted and shifted over. That doesn't mean you shouldn't do it.
You absolutely should get into the Cloud, get into a singular vendor with your workloads as fast as possible so that you can then dedicate resources to refactoring all of that. Don't just forget about it and leave it behind. It's not going to end well for you. And you do have a time; the timer is running. So, when you're only using those core primitives—compute, object store, block store—yeah, you're going to have a pretty fixed cost on your cloud bill.
But to Jesse's point, there's a lot of other services. Some of those require an engineering effort. Some of those just involve correctly using an instance type, a storage location that is more specific to its access patterns. I mean, everything is basic as T class instances—for those services that maybe don't use a lot of CPU—to reminding yourself that there are multiple tiers of S3 storage. Even Intelligent Tiering will just tier it for you.
So, if you go and store everything on standard S3 storage and use GP2 volumes on EC2, yeah, it's gonna be expensive. And I know that because I look at a lot of Amazon bills, and Jesse does too, and we see the same thing. “Oh, you've got a really high bill.” “Yeah, we spend a lot on EC2.” It's, “Like, oh, let me guess. A lot of, like, I3s and C5s and M5s and a ton of EBS, right?” And they give you all this optionality, and I think it's that choice which is so overwhelming for many folks moving to the Cloud. I mean, that's, that's really the case. It's just, “What do I pick?” There's just so much.
Jesse: So, let's talk about ephemerality, especially in the world of compute. Ephemerality really means savings, in this context. When you think about workloads that maybe are intermittent workloads or request-based workloads, if you have peaks and valleys of demand, there's going to be times where that workload is extremely busy processing all of those requests. And then there's going to be times where there are no requests coming in, and your servers are sitting idle, and you're paying for all of that compute usage that's not doing anything. So, if you can move your compute resources towards ephemeral resources: when you think about spot instances when you think about moving from EC2 to ECS or Fargate, you will end up only paying for the time that your workloads are actually running and are actually processing requests rather than 24/7.
Pete: Yeah, I think we need to break away from this trope of, “Well, high CPU is bad.”—
Pete: —“Because anything less than a hundred percent CPU is waste.” Now, hold on a second, someone out there who runs a lot of stuff on the JVM says—
Jesse: Don't @ us, please.
Pete: Remember, you can go to lastweekinaws.com/QA and register your complaint there. So, I understand. You have to run this Java application and it is an unholy hot mess and you need to just put a whole bunch of memory in that box. It's just, “I need a lot of memory, so I need that big instance.”
Well, again, look at the CPU access patterns. That's what these T class instances are for. That's what they're for. That's what they're designed for is to take and let you have that memory you can allocate to heap for intermittent workloads. Try it out. Guess what? If it doesn't work, you can always move it to another instance. It exists, right? [laugh].
Jesse: I think this is again, getting to your point, Pete, that you mentioned before, which is there's such a wide variety of options within the realm of compute. Where do you begin?
Jesse: What do you want to start with? And most customers think, “Okay, my bare metal servers sitting in the data center had this amount of CPU and that amount of RAM, so I'm just going to spin up a bunch of servers that have the same thing.” That's not necessarily what you want. And that's not necessarily what you need.
Pete: Right. And I know, Jesse, you mentioned before all these higher-order services within Amazon. And when you look at the cost for those, oftentimes it can appear to be a lot more expensive.
Pete: And so you'd say to yourself, “Well hold on, I'm moving from these EC2 instances to Dynamo. Moving for my Cassandra cluster to Dynamo, this is going to be so much more expensive.” The trick to that is, again, you have to understand your usage patterns because especially on Dynamo, you can alternate between on-demand tables, that maybe don't cost you very much and are truly only charged for when you use them, versus provision tables. You can auto-scale up those tables and auto-scale them down again as needed. And that's truly revolutionary when you've dealt with managing a Cassandra cluster on EC2.
Jesse: My heart goes out to everybody who has actually managed Cassandra clusters.
Pete: Scars everywhere. I wake up in cold sweats some nights remembering some of those Cassandra management issues. But outside of just my mental health for managing Cassandra, there's the overhead of all those systems, of all that EBS involved, network and data transfer. Goes back to the great story of, “I can tell you're running Cassandra by looking at your data transfer bill on AWS.”
And it's the people, too. I think most companies are very bad at the opportunity cost of managing their own databases. Sure, if your business is DataStax, and it's running a hosted Cassandra cluster for your clients, yeah, that's your core business model; you should be very good at that. But for most other people, maybe focus your time on your product and making it a lot better versus messing around with self-managed databases.
Jesse: This is one of the big opportunity trade-offs with AWS managed services. There's a lot of freebies, essentially, that AWS managed services provide, that you wouldn't get if you were running something from scratch on on-demand EC2 instances. And so for example, a lot of stateful distributed services require a lot of replication, for example, to keep data up to date and keep the cluster up to date. So, let's say you've got a cluster in one region, you deploy it across a couple AZs to keep availability high, and then you deploy it also in another region for some form of disaster recovery. Or maybe you've got an active, active application setup that requires Cassandra running in two different regions simultaneously. That's a lot of data that's replicating back and forth, just within one region and then across regions. Now, if you move on to one of AWS’s managed service solutions for the same workloads, a lot of that data transfer is free.
Pete: It’s free. It's free. It's crazy that Amazon would give anything away for free.
Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Checkout CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.
Jesse: And this gets to the other point, too, about management overhead; all of these AWS managed services build in the cost of that management overhead. So, on its surface, these managed services are more expensive than straight up using an EC2 instance with your workloads, but when you start factoring in all of the other components, like engineering effort, data transfer, administrative overheads, these managed services start looking pretty good.
Pete: Yeah. Let's be really honest here. Open source is only free if your time is worth nothing. And many engineers have a very low value of their own time, and—you know because you want to solve problems, you want to make things better, you want to fix the thing in front of you. You can't envision a world where the thing in front of you isn't there anymore.
There's always something to fix. There's always something to make better. And so it's really a statement for the flexibility that these managed services give you. I think too, as well around some of these— the Dynamo example is a great one— you can dynamically modify Dynamo tables to match workload needs. Consider Postgres, running Postgres on EC2, versus RDS [00:14:06 unintelligible] running it on Aurora.
The flexibility you have in dynamically adjusting the size of that engine, that pays for the added cost. We roughly see it's about 20 percent more expensive to run a database on RDS. That 20 percent allowing me to very easily back it up, maintain the availability to do multi-availability-zone easily. Aurora has some multi-regional functionality; there is just so many features I don't have to think about. That if someone says, “Hey, can you increase the size of this instance?”
I don't have to go into sweats thinking about, “Oh, my God, I don't want to accidentally nuke all this data.” I can update my Terraform or, God forbid, go into the Amazon Console and click-click, and just make bigger, right? I wish you could say I can't put a price on that, but maybe the price is 20 percent more. But then I can go do something else that is far more valuable to the success of the business.
Jesse: Absolutely. So, when we talk about moving from data centers into the Cloud, and we talk about leveraging AWS as a data center, AWS has so many amazing features and opportunities for you, just waiting to be used to help you lower your bill. And yes, there is a little bit of lock-in, in terms of now you're using AWS native solutions that you can easily move to, let's say, GCP or, God forbid, Azure. But it doesn't mean that you aren't getting amazing bang for your buck; it doesn't mean that there aren't other opportunities to use those same—or similar services, I should say— in different cloud providers, which is a whole topic in and of itself. But don't just use the default resources in AWS. Make sure that you migrate into AWS and then move into all of the amazing native solutions that AWS has to offer.
Pete: You know, everyone is always so scared of vendor lock-in. I feel like people have been preaching about vendor lock-in for decades now, that I've been in tech. And the reality to vendor lock-in as it relates to Amazon specifically is that sure, you could run everything on EC2, use no native services at all. But wait, didn't you use IAM for all of your authentication and access control? Whoops, didn't you use all that Terraform— which is very specific to the AWS APIs?
There is real work in actually moving off. So, then let's say you end up moving to GCP, and your entire engineering and ops team quits because they are all experts on Amazon, not GCP. They don't want to have to deal with that. Or as you move to, let's say Oracle or Azure, and then you’d just have an armed rebellion. That may be a little bit too on the nose given recent times.
But vendor lock-in is, I don't believe, as much of a thing as people give it to be. Vendor lock-in, you're locked into really all the decisions you make in general, doesn't mean you're locked in forever. If the business wants to change the type of database that they run underneath the hood, they can prioritize that over maybe growing revenue. But I think the more conversations you have about that with actual executives at a business, they say, eh, just keep running the thing; we need to grow revenue instead.
See, at a high level the biggest gain that you can see within a business to move quickly, to spend the least amount of money, realize that there are a lot of these services that will help you increase your ephemerality: Fargate, Lambda, you can run spot instances with ECS, and spot instances with Fargate, defined duration spot instances. If you're really scared about instances being just ripped away from underneath you, but you still want to save some money, you can just define a duration, say I want this server for one hour. If you're running any sort of EMR Hadoop task, that's a great way to say, “Great, I'm just gonna run non stop for an hour.” I don't have to worry about this host going away.
So, again, a lot of tools out there that exist. Don't let the initial dollar amount scare you, and really try to take a more holistic approach, and add in the engineering time as well. And maybe what else you could be working on. When you start thinking about some of the improvements that you can make in being more cloud-native, I think is what the kids call it nowadays. Right, Jesse?
Pete: Well, if you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, still go to lastweekinaws.com/review, give us a five-star review, and then go ask us a question. Give us your feedback at lastweekinaws.com/QA. We will be pulling together those questions, and feedback, and hot takes, and warm takes, and even the cold takes, we're gonna read all of them. And we will answer those in a future episode as we talk about more of The Unconventional Guide to Cost Management. Thank you.
Announcer: This has been a HumblePod production. Stay humble.