The Multi-Cloud Counterculture with Tim Bray

Episode Summary

Today’s guest, rumor has it, might has something to do with the creation of “God’s true language, XML.” Tim Bray, a principal at Textuality Services, and Corey reconnect after Tim’s recent blog post where he discusses lock-in and multi-cloud. Two subjects that are close to Corey’s heart, and on which Tim’s opinion are fairly countercultural. Tim expands on his blog post, which in short states that multi-cloud is not as complicated anymore. His take, it is now a “reasonable” thing for companies to ponder. For Tim it isn’t realistic for larger companies, especially, to not be multi-cloud. Tim and Corey go into the ins and outs of multi-cloud, tackling the people side of multi-cloud, and more!

Episode Show Notes & Transcript

About Tim
Timothy William Bray is a Canadian software developer, environmentalist, political activist and one of the co-authors of the original XML specification. He worked for Amazon Web Services from December 2014 until May 2020 when he quit due to concerns over the terminating of whistleblowers. Previously he has been employed by Google, Sun Microsystemsand Digital Equipment Corporation (DEC). Bray has also founded or co-founded several start-ups such as Antarctica Systems.

Links Referenced:

Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.


Corey: This episode is sponsored in part by our friends at Vultr. Spelled V-U-L-T-R because they’re all about helping save money, including on things like, you know, vowels. So, what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that—while sure they claim its better than AWS pricing—and when they say that they mean it is less money. Sure, I don’t dispute that but what I find interesting is that it’s predictable. They tell you in advance on a monthly basis what it’s going to going to cost. They have a bunch of advanced networking features. They have nineteen global locations and scale things elastically. Not to be confused with openly, because apparently elastic and open can mean the same thing sometimes. They have had over a million users. Deployments take less that sixty seconds across twelve pre-selected operating systems. Or, if you’re one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vultr cloud compute they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something to scale all on their own. Try Vultr today for free by visiting: vultr.com/screaming, and you’ll receive a $100 in credit. Thats V-U-L-T-R.com slash screaming.


Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured and fully managed with built in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: make your data sing.


Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. My guest today has been on a year or two ago, but today, we’re going in a bit of a different direction. Tim Bray is a principal at Textuality Services.


Once upon a time, he was a Distinguished Engineer slash VP at AWS, but let’s be clear, he isn’t solely focused on one company; he also used to work at Google. Also, there is scuttlebutt that he might have had something to do, at one point, with the creation of God’s true language, XML. Tim, thank you for coming back on the show and suffering my slings and arrows.


Tim: Oh, you’re just fine. Glad to be here.


Corey: [laugh]. So, the impetus for having this conversation is, you had a blog post somewhat recently—by which I mean, January of 2022—where you talked about lock-in and multi-cloud, two subjects near and dear to my heart, mostly because I have what I thought was a fairly countercultural opinion. You seem to have a very closely aligned perspective on this. But let’s not get too far ahead of ourselves. Where did this blog posts come from?


Tim: Well, I advised a couple of companies and one of them happens to be using GCP and the other happens to be using AWS and I get involved in a lot of industry conversations, and I noticed that multi-cloud is a buzzword. If you go and type multi-cloud into Google, you get, like, a page of people saying, “We will solve your multi-cloud problems. Come to us and you will be multi-cloud.” And I was not sure what to think, so I started writing to find out what I would think. And I think it’s not complicated anymore. I think the multi-cloud is a reality in most companies. I think that many mainstream, non-startup companies are really worried about cloud lock-in, and that’s not entirely unreasonable. So, it’s a reasonable thing to think about and it’s a reasonable thing to try and find the right balance between avoiding lock-in and not slowing yourself down. And the issues were interesting. What was surprising is that I published that blog piece saying what I thought were some kind of controversial things, and I got no pushback. Which was, you know, why I started talking to you and saying, “Corey, you know, does nobody disagree with this? Do you disagree with this? Maybe we should have a talk and see if this is just the new conventional wisdom.”


Corey: There’s nothing worse than almost trying to pick a fight, but no one actually winds up taking you up on the opportunity. That always feels a little off. Let’s break it down into two issues because I would argue that they are intertwined, but not necessarily the same thing. Let’s start with multi-cloud because it turns out that there’s just enough nuance to—at least where I sit on this position—that whenever I tweet about it, I wind up getting wildly misinterpreted. Do you find that as well?


Tim: Not so much. It’s not a subject I have really had too much to say about, but it does mean lots of different things. And so it’s not totally surprising that that happens. I mean, some people think when you say multi-cloud, you mean, “Well, I’m going to take my strategic application, and I’m going to run it in parallel on AWS and GCP because that way, I’ll be more resilient and other good things will happen.” And then there’s another thing, which is that, “Well, you know, as my company grows, I’m naturally going to be using lots of different technologies and that might include more than one cloud.” So, there’s a whole spectrum of things that multi-cloud could mean. So, I guess when we talk about it, we probably owe it to our audiences to be clear what we’re talking about.


Corey: Let’s be clear, from my perspective, the common definition of multi-cloud is whatever the person talking is trying to sell you at that point in time is, of course, what multi-cloud is. If it’s a third-party dashboard, for example, “Oh, yeah, you want to be able to look at all of your cloud usage on a single pane of glass.” If it’s a certain—well, I guess, certain not a given cloud provider, well, they understand if you go all-in on a cloud provider, it’s probably not going to be them so they’re, of course, going to talk about multi-cloud. And if it’s AWS, where they are the 8000-pound gorilla in the space, “Oh, yeah, multi-clouds, terrible. Put everything on AWS. The end.” It seems that most people who talk about this have a very self-serving motivation that they can’t entirely escape. That bias does reflect itself.


Tim: That’s true. When I joined AWS, which was around 2014, the PR line was a very hard line. “Well, multi-cloud that’s not something you should invest in.” And I’ve noticed that the conversation online has become much softer. And I think one reason for that is that going all-in on a single cloud is at least possible when you’re a startup, but if you’re a big company, you know, a insurance company, a tire manufacturer, that kind of thing, you’re going to be multi-cloud, for the same reason that they already have COBOL on the mainframe and Java on the old Sun boxes, and Mongo running somewhere else, and five different programming languages.


And that’s just the way big companies are, it’s a consequence of M&A, it’s a consequence of research projects that succeeded, one kind or another. I mean, lots of big companies have been trying to get rid of COBOL for decades, literally, [laugh] and not succeeding and doing that. So—


Corey: It’s ‘legacy’ which is, of course, the condescending engineering term for, “It makes money.”


Tim: And works. And so I don’t think it’s realistic to, as a matter of principle, not be multi-cloud.


Corey: Let’s define our terms a little more closely because very often, people like to pull strange gotchas out of the air. Because when I talk about this, I’m talking about—like, when I speak about it off the cuff, I’m thinking in terms of where do I run my containers? Where do I run my virtual machines? Where does my database live? But you can also move in a bunch of different directions. Where do my Git repositories live? What Office suite am I using? What am I using for my CRM? Et cetera, et cetera? Where do you draw the boundary lines because it’s very easy to talk past each other if we’re not careful here?


Tim: Right. And, you know, let’s grant that if you’re a mainstream enterprise, you’re running your Office automation on Microsoft, and they’re twisting your arm to use the cloud version, so you probably are. And if you have any sense at all, you’re not running your own Exchange Server, so let’s assume that you’re using Microsoft Azure for that. And you’re running Salesforce, and that means you’re on Salesforce’s cloud. And a lot of other Software-as-a-Service offerings might be on AWS or Azure or GCP; they don’t even tell you.


So, I think probably the crucial issue that we should focus our conversation on is my own apps, my own software that is my core competence that I actually use to run the core of my business. And typically, that’s the only place where a company would and should invest serious engineering resources to build software. And that’s where the question comes, where should that software that I’m going to build run? And should it run on just one cloud, or—


Corey: I found that when I gave a conference talk on this, in the before times, I had to have a ever lengthier section about, “I’m speaking in the general sense; there are specific cases where it does make sense for you to go in a multi-cloud direction.” And when I’m talking about multi-cloud, I’m not necessarily talking about Workload A lives on Azure and Workload B lives on AWS, through mergers, or weird corporate approaches, or shadow IT that—surprise—that’s not revenue-bearing. Well, I guess we have to live with it. There are a lot of different divisions doing different things and you’re going to see that a fair bit. And I’m not convinced that’s a terrible idea as such. I’m talking about the single workload that we’re going to spread across two or more clouds, intentionally.


Tim: That’s probably not a good idea. I just can’t see that being a good idea, simply because you get into a problem of just terminology and semantics. You know, the different providers mean different things by the word ‘region’ and the word ‘instance,’ and things like that. And then there’s the people problem. I mean, I don’t think I personally know anybody who would claim to be able to build and deploy an application on AWS and also on GCP. I’m sure some people exist, but I don’t know any of them.


Corey: Well, Forrest Brazeal was deep in the AWS weeds and now he’s the head of content at Google Cloud. I will credit him that he probably has learned to smack an API around over there.


Tim: But you know, you’re going to have a hard time hiring a person like that.


Corey: Yeah. You can count these people almost as individuals.


Tim: And that’s a big problem. And you know, in a lot of cases, it’s clearly the case that our profession is talent-starved—I mean, the whole world is talent-starved at the moment, but our profession in particular—and a lot of the decisions about what you can build and what you can do are highly contingent on who you can hire. And you can’t hire a multi-cloud expert, well, you should not deploy, [laugh] you know, a multi-cloud application.


Now, having said that, I just want to dot this i here and say that it can be made to kind of work. I’ve got this one company I advise—I wrote about it in the blog piece—that used to be on AWS and switched over to GCP. I don’t even know why; this happened before I joined them. And they have a lot of applications and then they have some integrations with third-party partners which they implemented with AWS Lambda functions. So, when they moved over to GCP, they didn’t stop doing that.


So, this mission-critical latency-sensitive application of theirs runs on GCP that calls out to AWS to make calls into their partners’ APIs and so on. And works fine. Solid as a rock, reliable, low latency. And so I talked to a person I know who knows over on the AWS side, and they said, “Oh, yeah sure, you know, we talked to those guys. Lots of people do that. We make sure, you know, the connections are low latency and solid.” So, technically speaking, it can be done. But for a variety of business reasons—maybe the most important one being expertise and who you can hire—it’s probably just not a good idea.


Corey: One of the areas where I think is an exception case is if you are a SaaS provider. Let’s pick a big easy example: Snowflake, where they are a data warehouse. They’ve got to run their data warehousing application in all of the major clouds because that is where their customers are. And it turns out that if you’re going to send a few petabytes into a data warehouse, you really don’t want to be paying cloud egress rates to do it because it turns out, you can just bootstrap a second company for that much money.


Tim: Well, Zoom would be another example, obviously.


Corey: Oh, yeah. Anything that’s heavy on data transfer is going to be a strange one. And there’s being close to customers; gaming companies are another good example on this where a lot of the game servers themselves will be spread across a bunch of different providers, just purely based on latency metrics around what is close to certain customer clusters.


Tim: I can’t disagree with that. You know, I wonder how large a segment that is, of people who are, I think you’re talking about core technology companies. Now, of the potential customers of the cloud providers, how many of them are core technology companies, like the kind we’re talking about, who have such a need, and how many people who just are people who just want to run their manufacturing and product design and stuff. And for those, buying into a particular cloud is probably a perfectly sensible choice.


Corey: I’ve also seen regulatory stories about this. I haven’t been able to track them down specifically, but there is a pervasive belief that one interpretation of UK banking regulations stipulates that you have to be able to get back up and running within 30 days on a different cloud provider entirely. And also, they have the regulatory requirement that I believe the data remain in-country. So, that’s a little odd. And honestly, when it comes to best practices and how you should architect things, I’m going to take a distinct backseat to legal requirements imposed upon you by your regulator. But let’s be clear here, I’m not advising people to go and tell their auditors that they’re wrong on these things.


Tim: I had not heard that story, but you know, it sounds plausible. So, I wonder if that is actually in effect, which is to say, could a huge British banking company, in fact do that? Could they in fact, decamp from Azure and move over to GCP or AWS in 30 days? Boy.


Corey: That is what one bank I spoke to over there was insistent on. A second bank I spoke to in that same jurisdiction had never heard of such a thing, so I feel like a lot of this is subject to auditor interpretation. Again, I am not an expert in this space. I do not pretend to be—I know I’m that rarest of all breeds: A white guy with a microphone in tech who admits he doesn’t know something. But here we are.


Tim: Yeah, I mean, I imagine it could be plausible if you didn’t use any higher-level services, and you just, you know, rented instances and were careful about which version of Linux you ran and we’re just running a bunch of Java code, which actually, you know, describes the workload of a lot of financial institutions. So, it should be a matter of getting… all the right instances configured and the JVM configured and launched. I mean, there are no… architecturally terrifying barriers to doing that. Of course, to do that, it would mean you would have to avoid using any of the higher-level services that are particular to any cloud provider and basically just treat them as people you rent boxes from, which is probably not a good choice for other business reasons.


Corey: Which can also include things as seemingly low-level is load balancers, just based upon different provisioning modes, failure modes, and the rest. You’re probably going to have a more consistent experience running HAProxy or nginx yourself to do it. But Tim, I have it on good authority that this is the old way of thinking, and that Kubernetes solves all of it. And through the power of containers and powers combining and whatnot, that frees us from being beholden to any given provider and our workloads are now all free as birds.


Tim: Well, I will go as far as saying that if you are in the position of trying to be portable, probably using containers is a smart thing to do because that’s a more tractable level of abstraction that does give you some insulation from, you know, which version of Linux you’re running and things like that. The proposition that configuring and running Kubernetes is easier than configuring and running [laugh] JVM on Linux [laugh] is unsupported by any evidence I’ve seen. So, I’m dubious of the proposition that operating at the Kubernetes-level at the [unintelligible 00:14:42] level, you know, there’s good reasons why some people want to do that, but I’m dubious of the proposition that really makes you more portable in an essential way.


Corey: Well, you’re also not the target market for Kubernetes. You have worked at multiple cloud providers and I feel like the real advantage of Kubernetes is people who happen to want to protect that they do so they can act as a sort of a cosplay of being their own cloud provider by running all the intricacies of Kubernetes. I’m halfway kidding, but there is an uncomfortable element of truth to that to some of the conversations I’ve had with some of its more, shall we say, fanatical adherents.


Tim: Well, I think you and I are neither of us huge fans of Kubernetes, but my reasons are maybe a little different. Kubernetes does some really useful things. It really, really does. It allows you to take n VMs, and pack m different applications onto them in a way that takes reasonably good advantage of the processing power they have. And it allows you to have different things running in one place with different IP addresses.


It sounds straightforward, but that turns out to be really helpful in a lot of ways. So, I’m actually kind of sympathetic with what Kubernetes is trying to be. My big gripe with it is that I think that good technology should make easy things easy and difficult things possible, and I think Kubernetes fails the first test there. I think the complexity that it involves is out of balance with the benefits you get. There’s a lot of really, really smart people who disagree with me, so this is not a hill I’m going to die on.


Corey: This is very much one of those areas where reasonable people can disagree. I find the complexity to be overwhelming; it has to collapse. At this point, it’s finding someone who can competently run Kubernetes in production is a bit hard to do and they tend to be extremely expensive. You aren’t going to find a team of those people at every company that wants to do things like this, and they’re certainly not going to be able to find it in their budget in many cases. So, it’s a challenging thing to do.


Tim: Well, that’s true. And another thing is that once you step onto the Kubernetes slope, you start looking about Istio and Envoy and [fabric 00:16:48] technology. And we’re talking about extreme complexity squared at that point. But you know, here’s the thing is, back in 2018 I think it was, in his keynote, Werner said that the big goal is that all the code you ever write should be application logic that delivers business value, which you know rep—


Corey: Didn’t CGI say the same thing? Didn’t—like, isn’t there, like, a long history dating back longer than I believe either of us have been alive have, “With this, all you’re going to write is business logic.” That was the Java promise. That was the Google App Engine promise. Again, and again, we’ve had that carrot dangled in front of us, and it feels like the reality with Lambda is, the only code you will write is not necessarily business logic, it’s getting the thing to speak to the other service you’re trying to get it to talk to because a lot of these integrations are super finicky. At least back when I started learning how this stuff worked, they were.


Tim: People understand where the pain points are and are indeed working on them. But I think we can agree that if you believe in that as a goal—which I still do; I mean, we may not have got there, but it’s still a worthwhile goal to work on. We can agree that wrangling Istio configurations is not such a thing; it’s not [laugh] directly value-adding business logic. To the extent that you can do that, I think serverless provides a plausible way forward. Now, you can be all cynical about, “Well, I still have trouble making my Lambda to talk to my other thing.” But you know, I’ve done that, and I’ve also deployed JVM on bare metal kind of thing.


You know what? I’d rather do things at the Lambda level. I really rather would. Because capacity forecasting is a horribly difficult thing, we’re all terrible at it, and the penalties for being wrong are really bad. If you under-specify your capacity, your customers have a lousy experience, and if you over-specify it, and you have an architecture that makes you configure for peak load, you’re going to spend bucket-loads of money that you don’t need to.


Corey: “But you’re then putting your availability in the cloud providers’ hands.” “Yeah, you already were. Now, we’re just being explicit about acknowledging that.”


Tim: Yeah. Yeah, absolutely. And that’s highly relevant to the current discussion because if you use the higher-level serverless function if you decide, okay, I’m going to go with Lambda and Dynamo and EventBridge and that kind of thing, well, that’s not portable at all. I mean, APIs are totally idiosyncratic for AWS and GCP’s equivalent, and Azure’s—what do they call it? Permanent functions or something-a-rather functions. So yeah, that’s part of the trade-off you have to think about. If you’re going to do that, you’re definitely not going to be multi-cloud in that application.


Corey: And in many cases, one of the stated goals for going multi-cloud is that you can avoid the downtime of a single provider. People love to point at the big AWS outages or, “See? They were down for half a day.” And there is a societal question of what happens when everyone is down for half a day at the same time, but in most cases, what I’m seeing, your instead of getting rid of a single point of failure, introducing a second one. If either one of them is down your applications down, so you’ve doubled your outage surface area.


On the rare occasions where you’re able to map your dependencies appropriately, great. Are your third-party critical providers all doing the same? If you’re an e-commerce site and Stripe processes your payments, well, they’re public about being all-in on AWS. So, if you can’t process payments, does it really matter that your website stays up? It becomes an interesting question. And those are the ones that you know about, let alone the third, fourth-order dependencies that are almost impossible to map unless everyone is as diligent as you are. It’s a heavy, heavy lift.


Tim: I’m going to push back a little bit. Now, for example, this company I’m advising that running GCP and calling out to Lambda is in that position; either GCP or Lambda goes off the air. On the other hand, if you’ve got somebody like Zoom, they’re probably running parallel full stacks on the different cloud providers. And if you’re doing that, then you can at least plausibly claim that you’re in a good place because if Dynamo has an outage—and everything relies on Dynamo—then you shift your load over to GCP or Oracle [laugh] and you’re still on the air.


Corey: Yeah, but what is up as well because Zoom loves to sign me out on my desktop whenever I log into it on my laptop, and vice versa, and I wonder if that authentication and login system is also replicated full-stack to everywhere it goes, and what the fencing on that looks like, and how the communication between all those things works? I wouldn’t doubt that it’s possible that they’ve solved for this, but I also wonder how thoroughly they’ve really tested all of the, too. Not because I question them any; just because this stuff is super intricate as you start tracing it down into the nitty-gritty levels of the madness that consumes all these abstractions.


Tim: Well, right, that’s a conventional wisdom that is really wise and true, which is that if you have software that is alleged to do something like allow you to get going on another cloud, unless you’ve tested it within the last three weeks, it’s not going to work when you need it.


Corey: Oh, it’s like a DR exercise: The next commit you make breaks it. Once you have the thing working again, it sits around as a binder, and it’s a best guess. And let’s be serious, a lot of these DR exercises presume that you’re able to, for example, change DNS records on the fly, or be able to get a virtual machine provisioned in less than 45 minutes—because when there’s an actual outage, surprise, everyone’s trying to do the same things—there’s a lot of stuff in there that gets really wonky at weird levels.


Tim: A related similar exercise, which is people who want to be on AWS but want to be multi-region. It’s actually, you know, a fairly similar kind of problem. If I need to be able to fail out of us-east-1—well, God help you, because if you need to everybody else needs 
to as well—but you know, would that work?


Corey: Before you go multi-cloud go multi-region first. Tell me how easy it is because then you have full-feature parity—presumably—between everything; it should just be a walk in the park. Send me a postcard once you get that set up and I’ll eat a bunch of words. And it turns out, basically, no one does.


Tim: Mm-hm.


Corey: Another area of lock-in around a lot of this stuff, and I think that makes it very hard to go multi-cloud is the security model of how does that interface with various aspects. In many cases, I’m seeing people doing full-on network overlays. They don’t have to worry about the different security group models and VPCs and all the rest. They can just treat everything as a node sitting on the internet, and the only thing it talks to is an overlay network. Which is terrible, but that seems to be one of the only ways people are able to build things that span multiple providers with any degree of success.


Tim: Well, that is painful because, much as we all like to scoff and so on, in the degree of complexity you get into there, it is the case that your typical public cloud provider can do security better than you can. They just can. It’s a fact of life. And if you’re using a public cloud provider and not taking advantage of their security offerings, infrastructure, that’s probably dumb. But if you really want to be multi-cloud, you kind of have to, as you said.


In particular, this gets back to the problem of expertise because it’s hard enough to hire somebody who really understands IAM deeply and how to get that working properly, try and find somebody who can understand that level of thing on two different cloud providers at once. Oh, gosh.


Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com 
and tell them Corey sent you, and watch for the wince.


Corey: Another point you made in your blog post was the idea of lock-in, of people being worried that going all-in on a provider was setting them up to be, I think Oracle is the term that was tossed around where once you’re dependent on a provider, what’s to stop them from cranking the pricing knobs until you squeal?


Tim: Nothing. And I think that is a perfectly sane thing to worry about. Now, in the short term, based on my personal experience working with, you know, AWS leadership, I think that it’s probably not a big short-term risk. AWS is clearly aware that most of the growth is still in front of them. You know, the amount of all of it that’s on the cloud is still pretty small and so the thing to worry about right now is growth.


And they are really, really genuinely, sincerely focused on customer success and will bend over backwards to deal with the customers problems as they are. And I’ve seen places where people have negotiated a huge multi-year enterprise agreement based on Reserved Instances or something like that, and then realize, oh, wait, we need to switch our whole technology stack, but you’ve got us by the RIs and AWS will say, “No, no, it’s okay. We’ll tear that up and rewrite it and get you where you need to go.” So, in the short term, between now and 2025, would I worry about my cloud provider doing that? 
Probably not so much.


But let’s go a little further out. Let’s say it’s, you know, 2030 or something like that, and at that point, you know, Andy Jassy decided to be a full-time sports mogul, and Satya Narayana has gone off to be a recreational sailboat owner or something like that, and private equity operators come in and take very significant stakes in the public cloud providers, and get a lot of their guys on the board, and you have a very different dynamic. And you have something that starts to feel like Oracle where their priority isn’t, you know, optimizing for growth and customer success; their priority is optimizing for a quarterly bottom line, and—


Corey: Revenue extraction becomes the goal.


Tim: That’s absolutely right. And this is not a hypothetical scenario; it’s happened. Most large companies do not control the amount of money they spend per year to have desktop software that works. They pay whatever Microsoft’s going to say they pay because they don’t have a choice. And a lot of companies are in the same situation with their database.


They don’t get to budget, their database budget. Oracle comes in and says, “Here’s what you’re going to pay,” and that’s what you pay. You really don’t want to be in a situation with your cloud, and that’s why I think it’s perfectly reasonable for somebody who is doing cloud transition at a major financial or manufacturing or service provider company to have an eye to this. You know, let’s not completely ignore the lock-in issue.


Corey: There is a significant scale with enterprise deals and contracts. There is almost always a contractual provision that says if you’re going to raise a price with any cloud provider, there’s a fixed period of time of notice you must give before it happens. I feel like the first mover there winds up getting soaked because everyone is going to panic and migrate in other directions. I mean, Google tried it with Google Maps for their API, and not quite Google Cloud, but also scared the bejesus out of a whole bunch of people who were, “Wait. Is this a harbinger of things to come?”


Tim: Well, not in the short term, I don’t think. And I think you know, Google Maps [is absurdly 00:26:36] underpriced. That’s hellishly expensive service. And it’s supposed to pay for itself by, you know, advertising on maps. I don’t know about that.



I would see that as the exception rather than the rule. I think that it’s reasonable to expect cloud prices, nominally at least, to go on decreasing for at least the short term, maybe even the medium term. But that’s—can’t go on forever.


Corey: It also feels to me, like having looked at an awful lot of AWS environments that if there were to be some sort of regulatory action or some really weird outage for a year that meant that AWS could not onboard a single new customer, their revenue year-over-year would continue to increase purely by organic growth because there is no forcing function that turns the thing off when you’re done using it. In fact, they can migrate things around to hardware that works, they can continue building you for the things sitting there idle. And there is no governance path on that. So, on some level, winding up doing a price increase is going to cause a massive company focus on fixing a lot of that. It feels on some level like it is drawing attention to a thing that they don’t really want to draw attention to from a purely revenue extraction story.


When CentOS back-walked their ten-year support line two years, suddenly—and with an idea that it would drive [unintelligible 00:27:56] adoption. Well, suddenly, a lot of people looked at their environment, saw they had old [unintelligible 00:28:00] they weren’t using. And massively short-sighted, massively irritated a whole bunch of people who needed that in the short term, but by the renewal, we’re going to be on to Ubuntu or something else. It feels like it’s going to backfire massively, and I’d like to imagine the strategist of whoever takes the reins of these companies is going to be smarter than that. But here we are.


Tim: Here we are. And you know it’s interesting you should mention regulatory action. At the moment, there are only three credible public cloud providers. It’s not obvious the Google’s really in it for the long haul, as last time I checked, they were claiming to maybe be breaking even on it. That’s not a good number, you know? You’d like there to be more than that.


And if it goes on like that, eventually, some politician is going to say, “Oh, maybe they should be regulated like public utilities,” because they kind of are right? And I would think that anybody who did get into Oracle-izing would be—you know, accelerate that happening. Having said that, we do live in the atmosphere of 21st-century capitalism, and growth is the God that must be worshiped at all costs. Who knows. It’s a cloudy future. Hard to see.


Corey: It really is. I also want to be clear, on some level, that with Google’s current position, if they weren’t taking a small loss at least, on these things, I would worry. Like, wait, you’re trying to catch AWS and you don’t have anything better to invest that money into than just well time to start taking profits from it. So, I can see both sides of that one.


Tim: Right. And as I keep saying, I’ve already said once during this slot, you know, the total cloud spend in the world is probably on the order of one or two-hundred billion per annum, and global IT is in multiple trillions. So, [laugh] there’s a lot more space for growth. Years and years worth of it.


Corey: Yeah. The challenge, too, is that people are worried about this long-term strategic point of view. So, one thing you talked about in your blog post is the idea of using hosted open-source solutions. Like, instead of using Kinesis, you’d wind up using Kafka or instead of using DynamoDB you use their managed Cassandra service—or as I think of it Amazon Basics Cassandra—and effectively going down the path of letting them manage this thing, but you then have a theoretical Exodus path. Where do you land on that?


Tim: I think that speaks to a lot of people’s concerns, and I’ve had conversations with really smart people about that who like that idea. Now, to be realistic, it doesn’t make migration easy because you’ve still got all the CI and CD and monitoring and management and scaling and alarms and alerts and paging and et cetera, et cetera, et cetera, wrapped around it. So, it’s not as though you could just pick up your managed Kafka off AWS and drop a huge installation onto GCP easily. But at least, you know, your data plan APIs are the same, so a lot of your code would probably still run okay. So, it’s a plausible path forward. And when people say, “I want to do that,” well, it does mean that you can’t go all serverless. But it’s not a totally insane path forward.


Corey: So, one last point in your blog post that I think a lot of people think about only after they get bitten by it is the idea of data gravity. I alluded earlier in our conversation to data egress charges, but my experience has been that where your data lives is effectively where the rest of your cloud usage tends to aggregate. How do you see it?


Tim: Well, it’s a real issue, but I think it might perhaps be a little overblown. People throw the term petabytes around, and people don’t realize how big a petabyte is. A petabyte is just an insanely huge amount of data, and the notion of transmitting one over the internet is terrifying. And there are lots of enterprises that have multiple petabytes around, and so they think, “Well, you know, it would take me 26 years to transmit that, so I can’t.”


And they might be wrong. The internet’s getting faster all time. Did you notice? I’ve been able to move some—for purely personal projects—insane amounts of data, and it gets there a lot faster than you did. Secondly, in the case of AWS Snowmobile, we have an existence proof that you can do exabyte-ish scale data transfers in the time it takes to drive a truck across the country.


Corey: Inbound only. Snowmobiles are not—at least according to public examples—are valid for Exodus.


Tim: But you know, this is kind of place where regulatory action might come into play if what the people were doing was seen to be abusive. I mean, there’s an existence proof you can do this thing. But here’s another point. So, I suppose you have, like, 15 petabytes—that’s an insane amount of data—displayed in your corporate application. So, are you actually using that to run the application, or is a huge proportion of that stuff just logs and data gathered of various kinds that’s being used in analytics applications and AI models and so on?


Do you actually need all that data to actually run your app? And could you in fact, just pick up the stuff you need for your app, move it to a different cloud provider from there and leave your analytics on the first one? Not a totally insane idea.


Corey: It’s not a terrible idea at all. It comes down to the idea as well of when you’re trying to run a query against a bunch of that data, do you need all the data to transit or just the results of that query, as well? It’s a question of, can you move the compute closer to the data as opposed to the data to where the compute lives?


Tim: Well, you know and a lot of those people who have those huge data pools have it sitting on S3, and a lot of it migrated off into Glacier, so it’s not as if you could get at it in milliseconds anyhow. I just ask myself, “How much data can anybody actually use in a day? In the course of satisfying some transaction requests from a customer?” And I think 
it’s not petabyte. It just isn’t.


Now, there are—okay, there are exceptions. There’s the intelligence community, there’s the oil drilling community, there are some communities who genuinely will use insanely huge seas of data on a routine basis, but you know, I think that’s kind of a corner case, so before you shake your head and say, “Ah, they’ll never move because the data gravity,” you know… you need to prove that to me and I might be a little bit skeptical.


Corey: And I think that is probably a very fair request. Just tell me what it is you’re going to be doing here to validate the idea that is in your head because the most interesting lies I’ve found customers tell isn’t intentionally to me or anyone else; it’s to themselves. The narrative of what they think they’re doing from the early days takes root, and never mind the fact that, yeah, it turns out that now that you’ve scaled out, maybe development isn’t 80% of your cloud bill anymore. You learn things and your understanding of what you’re doing has to evolve with the evolution of the applications.


Tim: Yep. It’s a fun time to be around. I mean, it’s so great; right at the moment lock-in just isn’t that big an issue. And let’s be clear—I’m sure you’ll agree with me on this, Corey—is if you’re a startup and you’re trying to grow and scale and prove you’ve got a viable business, and show that you have exponential growth and so on, don’t think about lock-in; just don’t go near it. Pick a cloud provider, pick whichever cloud provider your CTO already knows how to use, and just go all-in on them, and use all their most advanced features and be serverless if you can. It’s the only sane way forward. You’re short of time, you’re short of money, you need growth.


Corey: “Well, what if you need to move strategically in five years?” You should be so lucky. Great. Deal with it then. Or, “Well, what if we want to sell to retail as our primary market and they hate AWS?”


Well, go all-in on a provider; probably not that one. Pick a different provider and go all in. I do not care which cloud any given company picks. Go with what’s right for you, but then go all in because until you have a compelling reason to do otherwise, you’re going to spend more time solving global problems locally.


Tim: That’s right. And we’ve never actually said this probably because it’s something that both you and I know at the core of our being, but it probably needs to be said that being multi-cloud is expensive, right? Because the nouns and verbs that describe what clouds do are different in Google-land and AWS-land; they’re just different. And it’s hard to think about those things. And you lose the capability of using the advanced serverless stuff. There are a whole bunch of costs to being multi-cloud.


Now, maybe if you’re existentially afraid of lock-in, you don’t care. But for I think most normal people, ugh, it’s expensive.


Corey: Pay now or pay later, you will pay. Wouldn’t you ideally like to see that dollar go as far as possible? I’m right there with you because it’s not just the actual infrastructure costs that’s expensive, it costs something far more dear and expensive, and that is the cognitive expense of having to think about both of these things, not just how each cloud provider works, but how each one breaks. You’ve done this stuff longer than I have; I don’t think that either of us trust a system that we don’t understand the failure cases for and how it’s going to degrade. It's, “Oh, right. You built something new and awesome. Awesome. How does it fall over? What direction is it going to hit, so what side should I not stand on?” It’s based on an understanding of what you’re about to blow holes in.


Tim: That’s right. And you know, I think particularly if you’re using AWS heavily, you know that there are some things that you might as well bet your business on because, you know, if they’re down, so is the rest of the world, and who cares? And, other things, eh, maybe a little chance here. So, understanding failure modes, understanding your stuff, you know, the cost of sharp edges, understanding manageability issues. It’s not obvious.


Corey: It’s really not. Tim, I want to thank you for taking the time to go through this, frankly, excellent post with me. If people want to learn more about how you see things, and I guess how you view the world, where’s the best place to find you?


Tim: I’m on Twitter, just @timbray T-I-M-B-R-A-Y. And my blog is at tbray.org, and that’s where that piece you were just talking about is, and that’s kind of my online presence.


Corey: And we will, of course, put links to it in the [show notes 00:37:42]. Thanks so much for being so generous with your time. It’s always a pleasure to talk to you.


Tim: Well, it’s always fun to talk to somebody who has shared passions, and we clearly do.


Corey: Indeed. Tim Bray principal at Textuality Services. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment that you then need to take to all of the other podcast platforms out there purely for redundancy, so you don’t get locked into one of them.


Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.


Announcer: This has been a HumblePod production. Stay humble.

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by our friends at Vultr. Spelled V-U-L-T-R because they’re all about helping save money, including on things like, you know, vowels. So, what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that—while sure they claim its better than AWS pricing—and when they say that they mean it is less money. Sure, I don’t dispute that but what I find interesting is that it’s predictable. They tell you in advance on a monthly basis what it’s going to going to cost. They have a bunch of advanced networking features. They have nineteen global locations and scale things elastically. Not to be confused with openly, because apparently elastic and open can mean the same thing sometimes. They have had over a million users. Deployments take less that sixty seconds across twelve pre-selected operating systems. Or, if you’re one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vultr cloud compute they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something to scale all on their own. Try Vultr today for free by visiting: vultr.com/screaming, and you’ll receive a $100 in credit. Thats V-U-L-T-R.com slash screaming.

Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured and fully managed with built in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: make your data sing.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. My guest today has been on a year or two ago, but today, we’re going in a bit of a different direction. Tim Bray is a principal at Textuality Services.

Once upon a time, he was a Distinguished Engineer slash VP at AWS, but let’s be clear, he isn’t solely focused on one company; he also used to work at Google. Also, there is scuttlebutt that he might have had something to do, at one point, with the creation of God’s true language, XML. Tim, thank you for coming back on the show and suffering my slings and arrows.

Tim: Oh, you’re just fine. Glad to be here.

Corey: [laugh]. So, the impetus for having this conversation is, you had a blog post somewhat recently—by which I mean, January of 2022—where you talked about lock-in and multi-cloud, two subjects near and dear to my heart, mostly because I have what I thought was a fairly countercultural opinion. You seem to have a very closely aligned perspective on this. But let’s not get too far ahead of ourselves. Where did this blog posts come from?

Tim: Well, I advised a couple of companies and one of them happens to be using GCP and the other happens to be using AWS and I get involved in a lot of industry conversations, and I noticed that multi-cloud is a buzzword. If you go and type multi-cloud into Google, you get, like, a page of people saying, “We will solve your multi-cloud problems. Come to us and you will be multi-cloud.” And I was not sure what to think, so I started writing to find out what I would think. And I think it’s not complicated anymore. I think the multi-cloud is a reality in most companies. I think that many mainstream, non-startup companies are really worried about cloud lock-in, and that’s not entirely unreasonable. So, it’s a reasonable thing to think about and it’s a reasonable thing to try and find the right balance between avoiding lock-in and not slowing yourself down. And the issues were interesting. What was surprising is that I published that blog piece saying what I thought were some kind of controversial things, and I got no pushback. Which was, you know, why I started talking to you and saying, “Corey, you know, does nobody disagree with this? Do you disagree with this? Maybe we should have a talk and see if this is just the new conventional wisdom.”

Corey: There’s nothing worse than almost trying to pick a fight, but no one actually winds up taking you up on the opportunity. That always feels a little off. Let’s break it down into two issues because I would argue that they are intertwined, but not necessarily the same thing. Let’s start with multi-cloud because it turns out that there’s just enough nuance to—at least where I sit on this position—that whenever I tweet about it, I wind up getting wildly misinterpreted. Do you find that as well?

Tim: Not so much. It’s not a subject I have really had too much to say about, but it does mean lots of different things. And so it’s not totally surprising that that happens. I mean, some people think when you say multi-cloud, you mean, “Well, I’m going to take my strategic application, and I’m going to run it in parallel on AWS and GCP because that way, I’ll be more resilient and other good things will happen.” And then there’s another thing, which is that, “Well, you know, as my company grows, I’m naturally going to be using lots of different technologies and that might include more than one cloud.” So, there’s a whole spectrum of things that multi-cloud could mean. So, I guess when we talk about it, we probably owe it to our audiences to be clear what we’re talking about.

Corey: Let’s be clear, from my perspective, the common definition of multi-cloud is whatever the person talking is trying to sell you at that point in time is, of course, what multi-cloud is. If it’s a third-party dashboard, for example, “Oh, yeah, you want to be able to look at all of your cloud usage on a single pane of glass.” If it’s a certain—well, I guess, certain not a given cloud provider, well, they understand if you go all-in on a cloud provider, it’s probably not going to be them so they’re, of course, going to talk about multi-cloud. And if it’s AWS, where they are the 8000-pound gorilla in the space, “Oh, yeah, multi-clouds, terrible. Put everything on AWS. The end.” It seems that most people who talk about this have a very self-serving motivation that they can’t entirely escape. That bias does reflect itself.

Tim: That’s true. When I joined AWS, which was around 2014, the PR line was a very hard line. “Well, multi-cloud that’s not something you should invest in.” And I’ve noticed that the conversation online has become much softer. And I think one reason for that is that going all-in on a single cloud is at least possible when you’re a startup, but if you’re a big company, you know, a insurance company, a tire manufacturer, that kind of thing, you’re going to be multi-cloud, for the same reason that they already have COBOL on the mainframe and Java on the old Sun boxes, and Mongo running somewhere else, and five different programming languages.

And that’s just the way big companies are, it’s a consequence of M&A, it’s a consequence of research projects that succeeded, one kind or another. I mean, lots of big companies have been trying to get rid of COBOL for decades, literally, [laugh] and not succeeding and doing that. So—

Corey: It’s ‘legacy’ which is, of course, the condescending engineering term for, “It makes money.”

Tim: And works. And so I don’t think it’s realistic to, as a matter of principle, not be multi-cloud.

Corey: Let’s define our terms a little more closely because very often, people like to pull strange gotchas out of the air. Because when I talk about this, I’m talking about—like, when I speak about it off the cuff, I’m thinking in terms of where do I run my containers? Where do I run my virtual machines? Where does my database live? But you can also move in a bunch of different directions. Where do my Git repositories live? What Office suite am I using? What am I using for my CRM? Et cetera, et cetera? Where do you draw the boundary lines because it’s very easy to talk past each other if we’re not careful here?

Tim: Right. And, you know, let’s grant that if you’re a mainstream enterprise, you’re running your Office automation on Microsoft, and they’re twisting your arm to use the cloud version, so you probably are. And if you have any sense at all, you’re not running your own Exchange Server, so let’s assume that you’re using Microsoft Azure for that. And you’re running Salesforce, and that means you’re on Salesforce’s cloud. And a lot of other Software-as-a-Service offerings might be on AWS or Azure or GCP; they don’t even tell you.

So, I think probably the crucial issue that we should focus our conversation on is my own apps, my own software that is my core competence that I actually use to run the core of my business. And typically, that’s the only place where a company would and should invest serious engineering resources to build software. And that’s where the question comes, where should that software that I’m going to build run? And should it run on just one cloud, or—

Corey: I found that when I gave a conference talk on this, in the before times, I had to have a ever lengthier section about, “I’m speaking in the general sense; there are specific cases where it does make sense for you to go in a multi-cloud direction.” And when I’m talking about multi-cloud, I’m not necessarily talking about Workload A lives on Azure and Workload B lives on AWS, through mergers, or weird corporate approaches, or shadow IT that—surprise—that’s not revenue-bearing. Well, I guess we have to live with it. There are a lot of different divisions doing different things and you’re going to see that a fair bit. And I’m not convinced that’s a terrible idea as such. I’m talking about the single workload that we’re going to spread across two or more clouds, intentionally.

Tim: That’s probably not a good idea. I just can’t see that being a good idea, simply because you get into a problem of just terminology and semantics. You know, the different providers mean different things by the word ‘region’ and the word ‘instance,’ and things like that. And then there’s the people problem. I mean, I don’t think I personally know anybody who would claim to be able to build and deploy an application on AWS and also on GCP. I’m sure some people exist, but I don’t know any of them.

Corey: Well, Forrest Brazeal was deep in the AWS weeds and now he’s the head of content at Google Cloud. I will credit him that he probably has learned to smack an API around over there.

Tim: But you know, you’re going to have a hard time hiring a person like that.

Corey: Yeah. You can count these people almost as individuals.

Tim: And that’s a big problem. And you know, in a lot of cases, it’s clearly the case that our profession is talent-starved—I mean, the whole world is talent-starved at the moment, but our profession in particular—and a lot of the decisions about what you can build and what you can do are highly contingent on who you can hire. And you can’t hire a multi-cloud expert, well, you should not deploy, [laugh] you know, a multi-cloud application.

Now, having said that, I just want to dot this i here and say that it can be made to kind of work. I’ve got this one company I advise—I wrote about it in the blog piece—that used to be on AWS and switched over to GCP. I don’t even know why; this happened before I joined them. And they have a lot of applications and then they have some integrations with third-party partners which they implemented with AWS Lambda functions. So, when they moved over to GCP, they didn’t stop doing that.

So, this mission-critical latency-sensitive application of theirs runs on GCP that calls out to AWS to make calls into their partners’ APIs and so on. And works fine. Solid as a rock, reliable, low latency. And so I talked to a person I know who knows over on the AWS side, and they said, “Oh, yeah sure, you know, we talked to those guys. Lots of people do that. We make sure, you know, the connections are low latency and solid.” So, technically speaking, it can be done. But for a variety of business reasons—maybe the most important one being expertise and who you can hire—it’s probably just not a good idea.

Corey: One of the areas where I think is an exception case is if you are a SaaS provider. Let’s pick a big easy example: Snowflake, where they are a data warehouse. They’ve got to run their data warehousing application in all of the major clouds because that is where their customers are. And it turns out that if you’re going to send a few petabytes into a data warehouse, you really don’t want to be paying cloud egress rates to do it because it turns out, you can just bootstrap a second company for that much money.

Tim: Well, Zoom would be another example, obviously.

Corey: Oh, yeah. Anything that’s heavy on data transfer is going to be a strange one. And there’s being close to customers; gaming companies are another good example on this where a lot of the game servers themselves will be spread across a bunch of different providers, just purely based on latency metrics around what is close to certain customer clusters.

Tim: I can’t disagree with that. You know, I wonder how large a segment that is, of people who are, I think you’re talking about core technology companies. Now, of the potential customers of the cloud providers, how many of them are core technology companies, like the kind we’re talking about, who have such a need, and how many people who just are people who just want to run their manufacturing and product design and stuff. And for those, buying into a particular cloud is probably a perfectly sensible choice.

Corey: I’ve also seen regulatory stories about this. I haven’t been able to track them down specifically, but there is a pervasive belief that one interpretation of UK banking regulations stipulates that you have to be able to get back up and running within 30 days on a different cloud provider entirely. And also, they have the regulatory requirement that I believe the data remain in-country. So, that’s a little odd. And honestly, when it comes to best practices and how you should architect things, I’m going to take a distinct backseat to legal requirements imposed upon you by your regulator. But let’s be clear here, I’m not advising people to go and tell their auditors that they’re wrong on these things.

Tim: I had not heard that story, but you know, it sounds plausible. So, I wonder if that is actually in effect, which is to say, could a huge British banking company, in fact do that? Could they in fact, decamp from Azure and move over to GCP or AWS in 30 days? Boy.

Corey: That is what one bank I spoke to over there was insistent on. A second bank I spoke to in that same jurisdiction had never heard of such a thing, so I feel like a lot of this is subject to auditor interpretation. Again, I am not an expert in this space. I do not pretend to be—I know I’m that rarest of all breeds: A white guy with a microphone in tech who admits he doesn’t know something. But here we are.

Tim: Yeah, I mean, I imagine it could be plausible if you didn’t use any higher-level services, and you just, you know, rented instances and were careful about which version of Linux you ran and we’re just running a bunch of Java code, which actually, you know, describes the workload of a lot of financial institutions. So, it should be a matter of getting… all the right instances configured and the JVM configured and launched. I mean, there are no… architecturally terrifying barriers to doing that. Of course, to do that, it would mean you would have to avoid using any of the higher-level services that are particular to any cloud provider and basically just treat them as people you rent boxes from, which is probably not a good choice for other business reasons.

Corey: Which can also include things as seemingly low-level is load balancers, just based upon different provisioning modes, failure modes, and the rest. You’re probably going to have a more consistent experience running HAProxy or nginx yourself to do it. But Tim, I have it on good authority that this is the old way of thinking, and that Kubernetes solves all of it. And through the power of containers and powers combining and whatnot, that frees us from being beholden to any given provider and our workloads are now all free as birds.

Tim: Well, I will go as far as saying that if you are in the position of trying to be portable, probably using containers is a smart thing to do because that’s a more tractable level of abstraction that does give you some insulation from, you know, which version of Linux you’re running and things like that. The proposition that configuring and running Kubernetes is easier than configuring and running [laugh] JVM on Linux [laugh] is unsupported by any evidence I’ve seen. So, I’m dubious of the proposition that operating at the Kubernetes-level at the [unintelligible 00:14:42] level, you know, there’s good reasons why some people want to do that, but I’m dubious of the proposition that really makes you more portable in an essential way.

Corey: Well, you’re also not the target market for Kubernetes. You have worked at multiple cloud providers and I feel like the real advantage of Kubernetes is people who happen to want to protect that they do so they can act as a sort of a cosplay of being their own cloud provider by running all the intricacies of Kubernetes. I’m halfway kidding, but there is an uncomfortable element of truth to that to some of the conversations I’ve had with some of its more, shall we say, fanatical adherents.

Tim: Well, I think you and I are neither of us huge fans of Kubernetes, but my reasons are maybe a little different. Kubernetes does some really useful things. It really, really does. It allows you to take n VMs, and pack m different applications onto them in a way that takes reasonably good advantage of the processing power they have. And it allows you to have different things running in one place with different IP addresses.

It sounds straightforward, but that turns out to be really helpful in a lot of ways. So, I’m actually kind of sympathetic with what Kubernetes is trying to be. My big gripe with it is that I think that good technology should make easy things easy and difficult things possible, and I think Kubernetes fails the first test there. I think the complexity that it involves is out of balance with the benefits you get. There’s a lot of really, really smart people who disagree with me, so this is not a hill I’m going to die on.

Corey: This is very much one of those areas where reasonable people can disagree. I find the complexity to be overwhelming; it has to collapse. At this point, it’s finding someone who can competently run Kubernetes in production is a bit hard to do and they tend to be extremely expensive. You aren’t going to find a team of those people at every company that wants to do things like this, and they’re certainly not going to be able to find it in their budget in many cases. So, it’s a challenging thing to do.

Tim: Well, that’s true. And another thing is that once you step onto the Kubernetes slope, you start looking about Istio and Envoy and [fabric 00:16:48] technology. And we’re talking about extreme complexity squared at that point. But you know, here’s the thing is, back in 2018 I think it was, in his keynote, Werner said that the big goal is that all the code you ever write should be application logic that delivers business value, which you know rep—

Corey: Didn’t CGI say the same thing? Didn’t—like, isn’t there, like, a long history dating back longer than I believe either of us have been alive have, “With this, all you’re going to write is business logic.” That was the Java promise. That was the Google App Engine promise. Again, and again, we’ve had that carrot dangled in front of us, and it feels like the reality with Lambda is, the only code you will write is not necessarily business logic, it’s getting the thing to speak to the other service you’re trying to get it to talk to because a lot of these integrations are super finicky. At least back when I started learning how this stuff worked, they were.

Tim: People understand where the pain points are and are indeed working on them. But I think we can agree that if you believe in that as a goal—which I still do; I mean, we may not have got there, but it’s still a worthwhile goal to work on. We can agree that wrangling Istio configurations is not such a thing; it’s not [laugh] directly value-adding business logic. To the extent that you can do that, I think serverless provides a plausible way forward. Now, you can be all cynical about, “Well, I still have trouble making my Lambda to talk to my other thing.” But you know, I’ve done that, and I’ve also deployed JVM on bare metal kind of thing.

You know what? I’d rather do things at the Lambda level. I really rather would. Because capacity forecasting is a horribly difficult thing, we’re all terrible at it, and the penalties for being wrong are really bad. If you under-specify your capacity, your customers have a lousy experience, and if you over-specify it, and you have an architecture that makes you configure for peak load, you’re going to spend bucket-loads of money that you don’t need to.

Corey: “But you’re then putting your availability in the cloud providers’ hands.” “Yeah, you already were. Now, we’re just being explicit about acknowledging that.”

Tim: Yeah. Yeah, absolutely. And that’s highly relevant to the current discussion because if you use the higher-level serverless function if you decide, okay, I’m going to go with Lambda and Dynamo and EventBridge and that kind of thing, well, that’s not portable at all. I mean, APIs are totally idiosyncratic for AWS and GCP’s equivalent, and Azure’s—what do they call it? Permanent functions or something-a-rather functions. So yeah, that’s part of the trade-off you have to think about. If you’re going to do that, you’re definitely not going to be multi-cloud in that application.

Corey: And in many cases, one of the stated goals for going multi-cloud is that you can avoid the downtime of a single provider. People love to point at the big AWS outages or, “See? They were down for half a day.” And there is a societal question of what happens when everyone is down for half a day at the same time, but in most cases, what I’m seeing, your instead of getting rid of a single point of failure, introducing a second one. If either one of them is down your applications down, so you’ve doubled your outage surface area.

On the rare occasions where you’re able to map your dependencies appropriately, great. Are your third-party critical providers all doing the same? If you’re an e-commerce site and Stripe processes your payments, well, they’re public about being all-in on AWS. So, if you can’t process payments, does it really matter that your website stays up? It becomes an interesting question. And those are the ones that you know about, let alone the third, fourth-order dependencies that are almost impossible to map unless everyone is as diligent as you are. It’s a heavy, heavy lift.

Tim: I’m going to push back a little bit. Now, for example, this company I’m advising that running GCP and calling out to Lambda is in that position; either GCP or Lambda goes off the air. On the other hand, if you’ve got somebody like Zoom, they’re probably running parallel full stacks on the different cloud providers. And if you’re doing that, then you can at least plausibly claim that you’re in a good place because if Dynamo has an outage—and everything relies on Dynamo—then you shift your load over to GCP or Oracle [laugh] and you’re still on the air.

Corey: Yeah, but what is up as well because Zoom loves to sign me out on my desktop whenever I log into it on my laptop, and vice versa, and I wonder if that authentication and login system is also replicated full-stack to everywhere it goes, and what the fencing on that looks like, and how the communication between all those things works? I wouldn’t doubt that it’s possible that they’ve solved for this, but I also wonder how thoroughly they’ve really tested all of the, too. Not because I question them any; just because this stuff is super intricate as you start tracing it down into the nitty-gritty levels of the madness that consumes all these abstractions.

Tim: Well, right, that’s a conventional wisdom that is really wise and true, which is that if you have software that is alleged to do something like allow you to get going on another cloud, unless you’ve tested it within the last three weeks, it’s not going to work when you need it.

Corey: Oh, it’s like a DR exercise: The next commit you make breaks it. Once you have the thing working again, it sits around as a binder, and it’s a best guess. And let’s be serious, a lot of these DR exercises presume that you’re able to, for example, change DNS records on the fly, or be able to get a virtual machine provisioned in less than 45 minutes—because when there’s an actual outage, surprise, everyone’s trying to do the same things—there’s a lot of stuff in there that gets really wonky at weird levels.

Tim: A related similar exercise, which is people who want to be on AWS but want to be multi-region. It’s actually, you know, a fairly similar kind of problem. If I need to be able to fail out of us-east-1—well, God help you, because if you need to everybody else needs to as well—but you know, would that work?

Corey: Before you go multi-cloud go multi-region first. Tell me how easy it is because then you have full-feature parity—presumably—between everything; it should just be a walk in the park. Send me a postcard once you get that set up and I’ll eat a bunch of words. And it turns out, basically, no one does.

Tim: Mm-hm.

Corey: Another area of lock-in around a lot of this stuff, and I think that makes it very hard to go multi-cloud is the security model of how does that interface with various aspects. In many cases, I’m seeing people doing full-on network overlays. They don’t have to worry about the different security group models and VPCs and all the rest. They can just treat everything as a node sitting on the internet, and the only thing it talks to is an overlay network. Which is terrible, but that seems to be one of the only ways people are able to build things that span multiple providers with any degree of success.

Tim: Well, that is painful because, much as we all like to scoff and so on, in the degree of complexity you get into there, it is the case that your typical public cloud provider can do security better than you can. They just can. It’s a fact of life. And if you’re using a public cloud provider and not taking advantage of their security offerings, infrastructure, that’s probably dumb. But if you really want to be multi-cloud, you kind of have to, as you said.

In particular, this gets back to the problem of expertise because it’s hard enough to hire somebody who really understands IAM deeply and how to get that working properly, try and find somebody who can understand that level of thing on two different cloud providers at once. Oh, gosh.

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Corey: Another point you made in your blog post was the idea of lock-in, of people being worried that going all-in on a provider was setting them up to be, I think Oracle is the term that was tossed around where once you’re dependent on a provider, what’s to stop them from cranking the pricing knobs until you squeal?

Tim: Nothing. And I think that is a perfectly sane thing to worry about. Now, in the short term, based on my personal experience working with, you know, AWS leadership, I think that it’s probably not a big short-term risk. AWS is clearly aware that most of the growth is still in front of them. You know, the amount of all of it that’s on the cloud is still pretty small and so the thing to worry about right now is growth.

And they are really, really genuinely, sincerely focused on customer success and will bend over backwards to deal with the customers problems as they are. And I’ve seen places where people have negotiated a huge multi-year enterprise agreement based on Reserved Instances or something like that, and then realize, oh, wait, we need to switch our whole technology stack, but you’ve got us by the RIs and AWS will say, “No, no, it’s okay. We’ll tear that up and rewrite it and get you where you need to go.” So, in the short term, between now and 2025, would I worry about my cloud provider doing that? Probably not so much.

But let’s go a little further out. Let’s say it’s, you know, 2030 or something like that, and at that point, you know, Andy Jassy decided to be a full-time sports mogul, and Satya Narayana has gone off to be a recreational sailboat owner or something like that, and private equity operators come in and take very significant stakes in the public cloud providers, and get a lot of their guys on the board, and you have a very different dynamic. And you have something that starts to feel like Oracle where their priority isn’t, you know, optimizing for growth and customer success; their priority is optimizing for a quarterly bottom line, and—

Corey: Revenue extraction becomes the goal.

Tim: That’s absolutely right. And this is not a hypothetical scenario; it’s happened. Most large companies do not control the amount of money they spend per year to have desktop software that works. They pay whatever Microsoft’s going to say they pay because they don’t have a choice. And a lot of companies are in the same situation with their database.

They don’t get to budget, their database budget. Oracle comes in and says, “Here’s what you’re going to pay,” and that’s what you pay. You really don’t want to be in a situation with your cloud, and that’s why I think it’s perfectly reasonable for somebody who is doing cloud transition at a major financial or manufacturing or service provider company to have an eye to this. You know, let’s not completely ignore the lock-in issue.

Corey: There is a significant scale with enterprise deals and contracts. There is almost always a contractual provision that says if you’re going to raise a price with any cloud provider, there’s a fixed period of time of notice you must give before it happens. I feel like the first mover there winds up getting soaked because everyone is going to panic and migrate in other directions. I mean, Google tried it with Google Maps for their API, and not quite Google Cloud, but also scared the bejesus out of a whole bunch of people who were, “Wait. Is this a harbinger of things to come?”

Tim: Well, not in the short term, I don’t think. And I think you know, Google Maps [is absurdly 00:26:36] underpriced. That’s hellishly expensive service. And it’s supposed to pay for itself by, you know, advertising on maps. I don’t know about that.

I would see that as the exception rather than the rule. I think that it’s reasonable to expect cloud prices, nominally at least, to go on decreasing for at least the short term, maybe even the medium term. But that’s—can’t go on forever.

Corey: It also feels to me, like having looked at an awful lot of AWS environments that if there were to be some sort of regulatory action or some really weird outage for a year that meant that AWS could not onboard a single new customer, their revenue year-over-year would continue to increase purely by organic growth because there is no forcing function that turns the thing off when you’re done using it. In fact, they can migrate things around to hardware that works, they can continue building you for the things sitting there idle. And there is no governance path on that. So, on some level, winding up doing a price increase is going to cause a massive company focus on fixing a lot of that. It feels on some level like it is drawing attention to a thing that they don’t really want to draw attention to from a purely revenue extraction story.

When CentOS back-walked their ten-year support line two years, suddenly—and with an idea that it would drive [unintelligible 00:27:56] adoption. Well, suddenly, a lot of people looked at their environment, saw they had old [unintelligible 00:28:00] they weren’t using. And massively short-sighted, massively irritated a whole bunch of people who needed that in the short term, but by the renewal, we’re going to be on to Ubuntu or something else. It feels like it’s going to backfire massively, and I’d like to imagine the strategist of whoever takes the reins of these companies is going to be smarter than that. But here we are.

Tim: Here we are. And you know it’s interesting you should mention regulatory action. At the moment, there are only three credible public cloud providers. It’s not obvious the Google’s really in it for the long haul, as last time I checked, they were claiming to maybe be breaking even on it. That’s not a good number, you know? You’d like there to be more than that.

And if it goes on like that, eventually, some politician is going to say, “Oh, maybe they should be regulated like public utilities,” because they kind of are right? And I would think that anybody who did get into Oracle-izing would be—you know, accelerate that happening. Having said that, we do live in the atmosphere of 21st-century capitalism, and growth is the God that must be worshiped at all costs. Who knows. It’s a cloudy future. Hard to see.

Corey: It really is. I also want to be clear, on some level, that with Google’s current position, if they weren’t taking a small loss at least, on these things, I would worry. Like, wait, you’re trying to catch AWS and you don’t have anything better to invest that money into than just well time to start taking profits from it. So, I can see both sides of that one.

Tim: Right. And as I keep saying, I’ve already said once during this slot, you know, the total cloud spend in the world is probably on the order of one or two-hundred billion per annum, and global IT is in multiple trillions. So, [laugh] there’s a lot more space for growth. Years and years worth of it.

Corey: Yeah. The challenge, too, is that people are worried about this long-term strategic point of view. So, one thing you talked about in your blog post is the idea of using hosted open-source solutions. Like, instead of using Kinesis, you’d wind up using Kafka or instead of using DynamoDB you use their managed Cassandra service—or as I think of it Amazon Basics Cassandra—and effectively going down the path of letting them manage this thing, but you then have a theoretical Exodus path. Where do you land on that?

Tim: I think that speaks to a lot of people’s concerns, and I’ve had conversations with really smart people about that who like that idea. Now, to be realistic, it doesn’t make migration easy because you’ve still got all the CI and CD and monitoring and management and scaling and alarms and alerts and paging and et cetera, et cetera, et cetera, wrapped around it. So, it’s not as though you could just pick up your managed Kafka off AWS and drop a huge installation onto GCP easily. But at least, you know, your data plan APIs are the same, so a lot of your code would probably still run okay. So, it’s a plausible path forward. And when people say, “I want to do that,” well, it does mean that you can’t go all serverless. But it’s not a totally insane path forward.

Corey: So, one last point in your blog post that I think a lot of people think about only after they get bitten by it is the idea of data gravity. I alluded earlier in our conversation to data egress charges, but my experience has been that where your data lives is effectively where the rest of your cloud usage tends to aggregate. How do you see it?

Tim: Well, it’s a real issue, but I think it might perhaps be a little overblown. People throw the term petabytes around, and people don’t realize how big a petabyte is. A petabyte is just an insanely huge amount of data, and the notion of transmitting one over the internet is terrifying. And there are lots of enterprises that have multiple petabytes around, and so they think, “Well, you know, it would take me 26 years to transmit that, so I can’t.”

And they might be wrong. The internet’s getting faster all time. Did you notice? I’ve been able to move some—for purely personal projects—insane amounts of data, and it gets there a lot faster than you did. Secondly, in the case of AWS Snowmobile, we have an existence proof that you can do exabyte-ish scale data transfers in the time it takes to drive a truck across the country.

Corey: Inbound only. Snowmobiles are not—at least according to public examples—are valid for Exodus.

Tim: But you know, this is kind of place where regulatory action might come into play if what the people were doing was seen to be abusive. I mean, there’s an existence proof you can do this thing. But here’s another point. So, I suppose you have, like, 15 petabytes—that’s an insane amount of data—displayed in your corporate application. So, are you actually using that to run the application, or is a huge proportion of that stuff just logs and data gathered of various kinds that’s being used in analytics applications and AI models and so on?

Do you actually need all that data to actually run your app? And could you in fact, just pick up the stuff you need for your app, move it to a different cloud provider from there and leave your analytics on the first one? Not a totally insane idea.

Corey: It’s not a terrible idea at all. It comes down to the idea as well of when you’re trying to run a query against a bunch of that data, do you need all the data to transit or just the results of that query, as well? It’s a question of, can you move the compute closer to the data as opposed to the data to where the compute lives?

Tim: Well, you know and a lot of those people who have those huge data pools have it sitting on S3, and a lot of it migrated off into Glacier, so it’s not as if you could get at it in milliseconds anyhow. I just ask myself, “How much data can anybody actually use in a day? In the course of satisfying some transaction requests from a customer?” And I think it’s not petabyte. It just isn’t.

Now, there are—okay, there are exceptions. There’s the intelligence community, there’s the oil drilling community, there are some communities who genuinely will use insanely huge seas of data on a routine basis, but you know, I think that’s kind of a corner case, so before you shake your head and say, “Ah, they’ll never move because the data gravity,” you know… you need to prove that to me and I might be a little bit skeptical.

Corey: And I think that is probably a very fair request. Just tell me what it is you’re going to be doing here to validate the idea that is in your head because the most interesting lies I’ve found customers tell isn’t intentionally to me or anyone else; it’s to themselves. The narrative of what they think they’re doing from the early days takes root, and never mind the fact that, yeah, it turns out that now that you’ve scaled out, maybe development isn’t 80% of your cloud bill anymore. You learn things and your understanding of what you’re doing has to evolve with the evolution of the applications.

Tim: Yep. It’s a fun time to be around. I mean, it’s so great; right at the moment lock-in just isn’t that big an issue. And let’s be clear—I’m sure you’ll agree with me on this, Corey—is if you’re a startup and you’re trying to grow and scale and prove you’ve got a viable business, and show that you have exponential growth and so on, don’t think about lock-in; just don’t go near it. Pick a cloud provider, pick whichever cloud provider your CTO already knows how to use, and just go all-in on them, and use all their most advanced features and be serverless if you can. It’s the only sane way forward. You’re short of time, you’re short of money, you need growth.

Corey: “Well, what if you need to move strategically in five years?” You should be so lucky. Great. Deal with it then. Or, “Well, what if we want to sell to retail as our primary market and they hate AWS?”

Well, go all-in on a provider; probably not that one. Pick a different provider and go all in. I do not care which cloud any given company picks. Go with what’s right for you, but then go all in because until you have a compelling reason to do otherwise, you’re going to spend more time solving global problems locally.

Tim: That’s right. And we’ve never actually said this probably because it’s something that both you and I know at the core of our being, but it probably needs to be said that being multi-cloud is expensive, right? Because the nouns and verbs that describe what clouds do are different in Google-land and AWS-land; they’re just different. And it’s hard to think about those things. And you lose the capability of using the advanced serverless stuff. There are a whole bunch of costs to being multi-cloud.

Now, maybe if you’re existentially afraid of lock-in, you don’t care. But for I think most normal people, ugh, it’s expensive.

Corey: Pay now or pay later, you will pay. Wouldn’t you ideally like to see that dollar go as far as possible? I’m right there with you because it’s not just the actual infrastructure costs that’s expensive, it costs something far more dear and expensive, and that is the cognitive expense of having to think about both of these things, not just how each cloud provider works, but how each one breaks. You’ve done this stuff longer than I have; I don’t think that either of us trust a system that we don’t understand the failure cases for and how it’s going to degrade. It's, “Oh, right. You built something new and awesome. Awesome. How does it fall over? What direction is it going to hit, so what side should I not stand on?” It’s based on an understanding of what you’re about to blow holes in.

Tim: That’s right. And you know, I think particularly if you’re using AWS heavily, you know that there are some things that you might as well bet your business on because, you know, if they’re down, so is the rest of the world, and who cares? And, other things, eh, maybe a little chance here. So, understanding failure modes, understanding your stuff, you know, the cost of sharp edges, understanding manageability issues. It’s not obvious.

Corey: It’s really not. Tim, I want to thank you for taking the time to go through this, frankly, excellent post with me. If people want to learn more about how you see things, and I guess how you view the world, where’s the best place to find you?

Tim: I’m on Twitter, just @timbray T-I-M-B-R-A-Y. And my blog is at tbray.org, and that’s where that piece you were just talking about is, and that’s kind of my online presence.

Corey: And we will, of course, put links to it in the [show notes 00:37:42]. Thanks so much for being so generous with your time. It’s always a pleasure to talk to you.

Tim: Well, it’s always fun to talk to somebody who has shared passions, and we clearly do.

Corey: Indeed. Tim Bray principal at Textuality Services. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment that you then need to take to all of the other podcast platforms out there purely for redundancy, so you don’t get locked into one of them.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.