- A Cloud Guru Blog post, Lift and Shift Shot Clock: https://acloudguru.com/blog/engineering/the-lift-and-shift-shot-clock-cloud-migration
- The Duckbill Group: https://www.duckbillgroup.com/
Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.
Pete: Hello, and welcome to the AWS Morning Brief: Whiteboard Confessional. I am again Pete Cheslock, not Corey Quinn. He is still out, so you're stuck with me for the time being. But not just me because I am pleased to have Jesse DeRose join me again today. Welcome back, Jesse.
Jesse: Thanks again for having me.
Pete: So, we are taking this podcast down a slightly different approach. If you've listened to the last few that Jessie and I have ran while Corey has been gone, we've been focusing on kind of deep-diving into some interesting, in some cases, new Amazon services. But today, we're actually not talking about any specific Amazon service. We're talking about another topic we're both very passionate about. And it's something we see a lot with our clients, at The Duckbill Group is people treating the Cloud like a data center.
And what we know is that the Cloud, Amazon, these are not just data centers, and if you treat it like one, you're not actually going to save any money, you're not going to get any of the benefits out of it. And so there's an impact that these companies will face when they choose between something like cloud-native versus cloud-agnostic or a hybrid-cloud model as they adopt cloud services. So, let's start with a definition of each one. Jessie, can you help me out on this?
Jesse: Absolutely. So, a lot of companies today are cloud-native. They focus primarily on one of the major cloud providers when they initially start their business, and they leverage whatever cloud-native offerings are available within that cloud provider, rather than leveraging a data center. So, they pay for things like AWS Lambda, or Azure Functions, or whatever cloud offering Google's about to shut down next, rather than paying for a data center, rather than investing in physical hardware and spinning up virtual machines, they focus specifically on the cloud-native offerings available to them within their cloud provider.
Whereas cloud-agnostic is usually leveraged by organizations that already use data centers so they're harder pressed to immediately migrate to the Cloud, the ROI is murkier, and there's definitely sunk costs involved. So, in some cases, they focus on the cloud-agnostic model where they leverage their own data centers, and cloud providers equally so that compute resources run virtual servers, no matter where they are. Effectively, all they're looking for is some kind of compute resources to run all their virtual servers, whether that is in their own data center, or one of the various cloud providers, and then their application runs on top of that in some form.
Last but not least, the hybrid-cloud model can take a lot of forms, but the one we see most often is clients moving from their physical data centers to cloud services. And effectively, this looks like continuing to run static workloads in physical data centers or running monolith infrastructure in data centers, and running new or ephemeral workloads in the Cloud. So, this often translates to: the old and busted stays where it is, and new development goes into the Cloud.
Pete: Yeah, we see this quite a bit where a client will be running in their existing data centers, and they want all the benefits that the Cloud can give them, but maybe they don't want to really truly go all-in on the Cloud. They don't want to adopt some of the PaaS services because of fear of lock-in. And we're definitely going to talk about vendor lock-in because I think that is a super-loaded term that gets used a lot. Hybrid-cloud, too, is an interesting one because some people think that this is actually running across multiple cloud providers, and that's just something we don't see a lot of. And I don't think there are a lot of clients, the companies out there running true multi-cloud, I think is the term that you would really hear.
And the main reason I believe that not a lot of people are doing this, running a single application across multiple clouds is that people don't talk about it at conferences. And at conferences, people talk about all the things that they do when in reality, it's so wishful thinking. And yet no one is willing to talk about this kind of, oh, we're multi-cloud in like, again, kind of, singular application world. So, one thing we do see across these three, you know, models, at a high level, cloud-native, agnostic, hybrid-cloud, the spend is just dramatically different. If you were to compare multiple companies across these different use cases. Jessie, what are some of the things that you've seen across these models that have impacted spend?
Jesse: I think first and foremost, it's really important to note that this is a hard decision to make from a business context because there's a lot of different players involved in the conversation. Engineering generally wants to move into the Cloud because that's what their engineers are familiar with. Whereas finance is familiar with an operating model that does not clearly fit the Cloud. Specifically, we're talking about CapEx versus OpEx: we're talking about capital expenditures versus operating expenditures. Finance comes from a mindset of capital expenditures, where they are writing off funds that are used to maintain, acquire, upgrade physical assets over time.
So, a lot of enterprise companies manage capital expenditure for all the physical hardware in their data centers. It's a very clear line item to say, “We bought this physical hardware; it's going to depreciate over time.” But moving into the Cloud, there is an operating expenditure model here instead, which focuses on ongoing costs for running a product because any cloud provider is going to charge you an on-demand price by default. You're rarely going to pay an upfront cost with any new service that you run in any cloud provider.
Pete: Yeah, I think that's a really good point, which is the model flips on its head, which is why it really trips up a lot of companies. In the old days, which for some companies really isn't that old, you have some servers that you purchased, maybe three to five years ago. From an accounting standpoint, they're fully depreciated, which means they're not really costing anything; they probably have no value. Which means they could sit idle, it doesn't cost the business anything. But if you spin up EC2—and if we use EC2 as an example, running an EC2 instance that is not at 100% CPU—that's not absolutely maxed out on its resources, whether it is CPU or memory—you are wasting money.
Pete: And again, with the exception of T classes, and there's a lot of other interesting ways around it. But by running everything on EC2 which we see when companies either adopt the Cloud or just due to architecture reason, there's a lot of hidden costs within there. There's a lot of the waste within EC2 that you can't get when you architect for ephemerality. Like when you can architect to use services like Lambda, Functions as a Service, or services like Fargate, where you can just run a container for a period of time. I think one of the bigger cost areas of EC2 comes to operational overhead that no one ever thinks about; no one ever considers the people involved in operating and managing all of the complexity around your multiple EC2 servers.
Pete: And running your application on EC2, running a database on top of EC2, and then your application on top of that database, so many levels within there. And we see it a lot, too, as people start to, for some reason, deploy their own Kubernetes to a cloud provider and then deploy an application on top of it. They're just adding the complexity on top. But there's actually another thing, too, that Jesse, I know you dive into a lot and you see with our clients with some of the hidden costs on EC2. What is that?
Jesse: It's data transfer. And it's absolutely phenomenal to see because moving from a data center world, data transfer is completely free in most cases. The network traffic between physical servers in a single data center doesn't cost anything, and there may be some costs involved with bandwidth to and from a given data center, but for the most part that data transfer is free. Whereas, again, in the OpEx model with on-demand spend in the Cloud world, data transfer costs you, for lack of a better phrase.
Think about your Kafka workloads. Think about your Cassandra and MongoDB workloads. Think about any of your distributed managed services that are running in your ecosystem: those require lots of replication traffic in order to run effectively. And that traffic isn't free if you run the services on top of EC2 instances. You're going to be charged for every bit of traffic that runs between nodes within availability zones—or across availability zones, and across regions. So, you're paying for a lot of data transfer upfront for these services.
Pete: Yeah, I remember talking with an Amazon account manager many years ago, and they said to me, “Oh, I can just look at your bill and tell you if you're running Cassandra or not,” because in deploying Cassandra, you're going to have to replicate your rights across multiple availability zones—I mean, if you care about your data—and there's a cost to replicate across availability zones within Amazon. And that's just something that people don't think about when they're running in their own data center. Sending data out to the internet, sure, has a cost. You know, the peering traffic, and things of that nature. But once you're inside a rack of servers, or multiple data centers, even, if you own those connections, you can just send whatever data you want.
And on paper, a lot of our clients look at the cost of Amazon managed services like ElasticSearch, they may look expensive. The Amazon ElasticSearch, Amazon's DocumentDB, Amazon's Aurora, things like that; these may look expensive when you compare to, “Well, I could just run it on EC2 myself.” But the thing you're missing, and it's not clear in a lot of ways, is the replication traffic cost for Amazon to replicate your data across multiple availability zones for durability reasons; there's just no cost there. It's essentially baked into what you're going to pay for data storage. And so, it's a cost that is hidden that people don't even think about.
Jesse: It's something that no engineering team that I have worked for, or talked to before in my career has thought about. They very much focus on the upfront numbers that are on paper on each cloud provider’s website, and they don't factor into their conversations. What is the overhead of data transfer? What is the overhead of engineering effort to manage this new infrastructure? And they don't think about it when they look at new features, new product offerings, all of those things need to be thought about when discussing a new product or a new feature offering. It's really important to make sure that cost is part of that conversation, and that cost is not just the price on the website, but the various components of the architecture that are going to combine to give you your overall architecture.
Corey: This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.
Pete: Let's move on to one of my favorite things ever. I love to talk to people about this, and mostly just rant about it. It's vendor lock-in. It's that term that you hear all the time that usually drives an ill-conceived architectural decision. “Oh, we can't do that. We don't want to be locked into that vendor.” But it’s o—
Jesse: I hate to break it to you, but almost no matter where you go, you've got vendor lock-in.
Pete: You're locked into so many decisions that you have no control over. Let's just say you get locked into Amazon, and you're on Amazon Web Services, and for some reason, you think you’re vendor-locked-in to Amazon Web Services. But what does your application run on? Does it run on Cassandra? You're locked into Cassandra. Does it run Mongo? Well, you've got some vendor lock-in there.
You could say, “Oh, well, these are open source solutions. I can change at any time.” Okay. I'll come back to you in two years, and I'm going to look at those existing databases that are still running Postgres, Cassandra, Mongo, whatever. You're locked into those things, and it's not a big deal. It's just something that—don't throw the vendor lock-in boogeyman and scare any sort of reasonable improvement in your infrastructure.
Jesse: And it's also worth noting that if you are already on a specific cloud provider, and talk about moving to another cloud provider because you're worried about lock-in, talk to your engineers first because in most cases, they already have skill sets for whatever cloud provider you're on, and if you move to another cloud provider, they may or may not stay with you. The learning curve may be astronomically high for them to move all of your infrastructure to a new vendor.
Pete: Yeah. That is one of the biggest points. You're really locked in by your ability to hire the expertise for the specific cloud provider you're on. And if you have a lot of engineers who are experts in Amazon Web Services, and you go to them—like Jesse said—and you go to them and say, “Yeah, we signed a deal with Azure and we're going to move there,” or, “We signed a deal with Oracle Cloud; we're going to move there.” My guess is, before you finish that sentence, half of those engineers are on job boards looking for their next move because for a lot of them, they have their own lock-in, right?
Their own sunk cost in learning all this stuff around Amazon. They may want to work in that ecosystem, so you could lose a lot of your engineers by the choice you make, depending on where you go. And this also counts for people who are moving into the Cloud. If you're in data centers now and you're moving to the Cloud, the choice you make isn't always who will give you the best deal. It’s, how can I retain my staff as well, right? That's a big part of the lock-in that, again, people don't even think about. No one thinks about the people side, Jesse, I don't understand it.
Jesse: It drives me crazy.
Pete: [laughs]. So, we've talked a little bit about, you know, those lift and shift, right. The enterprise folks that are running their data centers, they want to get in the Cloud, and the model you hear most often is lift and shift. Pick up your application as it exists and, kind of, drop it down there.
And those clients in many ways, follow this model we talked about, right? They go and spin this up on EC2, and they go and deploy their application. And that's fine, actually. That's a smart move, and that's actually a recommended move by most cloud providers. Just bring your service over; get it into the Cloud as soon as possible, then re-architect. But there's actually a great blog post out there from A Cloud Guru talking about the Lift and Shift Shot Clock. What was this concept that they talked about in that blog post, Jesse?
Jesse: The ‘lift and shift shot clock’ is something that I think every enterprise faces at some point. Every company that we've talked to that says they are either in a quote-unquote, “Hybrid model,” or in the enterprise data center space moving into the Cloud, the lift and shift shot clock is the time it takes after you lift and shift, and don't update applications and you lose your engineers. Effectively, you are counting down from that lift and shift. Once you have moved all of your infrastructure into the Cloud—whether it's AWS or another provider—and you don't then migrate to native services that that cloud provider offers.
If you don't make that move, you're effectively keeping your feet in two different places. You are focusing on data centers, and you are focusing on your cloud provider. And it becomes harder and harder for existing and new engineers to know, where do I deploy something? Do I deploy it to the data center? Do I deploy it to the Cloud?
We ran into this with a previous client where multiple different product offerings existed across both their data centers and the Cloud. And the teams that were managing these offerings, didn't know where to deploy their work because they didn't know, is everything supposed to be moving towards the Cloud? Or are we moving back to the data centers? Or are we splitting it between the two? There wasn't a clear business decision and roadmap saying, “Hey, this is the way that we need to move.”
It was really, really important for leadership to effectively point in one direction and continue to march forward. And not just leadership, but ultimately this needed to be a grassroots effort as well. It's really important for everybody in the company to be involved in this conversation to make sure that once a company decides to move from a data center into the Cloud, that they flesh out all parts of the migration and make that final step from lifting and shifting to cloud-native offerings.
Pete: Yeah, you can't just lift and shift over and 12, 18, 24 months later—
Pete: —still have all those systems there. You need to start adopting all of the benefits of the Cloud: the ephemerality, and all the different PaaS services that are arguably providing you a service much better than you could build yourself. And once you get two, three, four years out, you're just going to have this drain of talent as people don't want to deal with the old busted thing that has just been carried over to the new environment. Really leverage those engineers to adopt those technologies. And there's a lot of benefit there.
Jesse: I think it's also really important to note that this takes work. No matter which direction you go, migrating into the Cloud and moving things into the Cloud, or moving them from, maybe, EC2 instances into managed service offerings, is going to take work; this is not something that is going to be extremely easy. But the reward is absolutely worth the effort. You're absolutely going to get benefits from migrating from software that is running on virtual machines into a ephemeral service like AWS Fargate, or Lambda Functions. You are absolutely going to get benefit from this and receive ROI. But it is going to take work.
Pete: Exactly. And the main reason that we mention this—we recommend it—really all just comes down to having a single cloud provider. And we could probably fill another Whiteboard Confessional on why you should only have one cloud provider, but by choosing one single cloud provider, you remove a lot of the complexity that exists in trying to do multi-cloud, which no one really is doing, as we talked about earlier. But the biggest part is you have actually a much bigger position to negotiate for better discounts by having just one provider. By adopting, in Amazon's example, their PaaS services versus just EC2 you can negotiate for service-specific discounts that can actually make the cost of those PaaS services a lot more aggressive, and maybe the delta isn't as big as you were thinking.
Only in some very rare cases, if you're thinking to yourself, “Well, I'm negotiating a new discount program with Amazon,” or a new discount for this or that or whatever, “—and I'm just going to go to them and say ‘Well, I'm thinking that I might move all my infrastructure to Google or to Azure.’” unless you can actually move over your data and your workloads in, like, weeks—which most people really can't do; they don't have the capability of doing that—it's a really idle threat to threaten like, “Oh, I’m going to move over to this other cloud provider.” It's just too much of a lift to actually accomplish. So, because that lift is so high, that level of effort is so high, focus on trying to get the most out of what your cloud vendor is providing you, whether it's Amazon, Azure, Google, whoever, try to adopt as many of their PaaS services as possible that can help you move a lot faster, you don't have to worry about scaling up Cassandra because you can just use a PaaS service. It's not all roses; there's definitely reasons why using those PaaS services could be a big pain to the environment, maybe you're losing some visibility, losing the ability to maybe run the latest version, but from at least a pure cost perspective—and when you think about the overhead of the people—it is a lot less expensive. You don't have to send those engineers off to go and run the databases themselves. And you can get a lot of other benefits from there as well.
Jesse: It's also worth noting that your cloud account team wants to have these conversations with you. If you show that you are invested in their platform, their provider, their service, they will absolutely invest in you as well. They will provide benefits, they will provide discounts, they will have engaging conversations with you to figure out the best ways that you can receive discounts based on the amount of traffic, or compute resources, or usage that you have on their platform. So, don't be afraid to reach out to your account team and start these conversations, especially if you are planning to move more resources into your cloud provider. They absolutely want to have this conversation with you and they are open to having this conversation with you.
Pete: Yeah, for those folks out there that paid big money for enterprise support, use it; it's there; you pay that money for a reason. Reach out to your account manager, your technical account teams. Does not matter the vendor you're with. All cloud vendors should have an account team, you know, especially if you have a reasonable amount of spend.
And like Jesse said, talk to them. They want your business and they want to help you. They want you to feel like you're getting value for what you spend. But what we can say is definitely adopt all of the great things that the Cloud provides. If you treat it like just another data center, you're just going to end up with a lot of inherent waste in the system.
Well, Jesse, thanks again, for joining me for this rant about Cloud is not a data center.
Jesse: Thank you as always, for having me. I am always happy to rant about anything Cloud-related, especially in this context.
Pete: That's one thing that we always agree on is ranting about Cloud is a lot of fun, especially when you spend so much time in it like we do. If you've enjoyed this podcast, please go to lastweekonaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekonaws.com/review, give it a five-star rating on your podcast platform of choice and tell Corey congrats on the new addition to his family. Hopefully, he'll be back in, I don't know, a few more weeks from his paternity leave. But until then you are stuck with us. Thank you.
Announcer: This has been a HumblePod production. Stay humble.