The Controversy of Cloud Repatriation With Amy Tobey of Equinix

Episode Summary

Amy Tobey, Senior Principal Engineer at Equinix joins Corey to dive into the controversial idea of cloud repatriation and the complexity of running data centers. Amy explains how communication matters in regards to affecting change at a macro level, and the reasoning behind her position that building something from scratch should almost always take place in the cloud. Amy gives some details about Equinix Metal, Equinix’s bare metal service provider, and Corey and Amy discuss the power of storytelling in the context of building and working with tech.

Episode Show Notes & Transcript

About Amy

Amy Tobey has worked in tech for more than 20 years at companies of every size, working with everything from kernel code to user interfaces. These days she spends her time building an innovative Site Reliability Engineering program at Equinix, where she is a principal engineer. When she's not working, she can be found with her nose in a book, watching anime with her son, making noise with electronics, or doing yoga poses in the sun.



Links Referenced:



Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Welcome to Screaming in the Cloud, I’m Corey Quinn, and this episode is another one of those real profiles in shitposting type of episodes. I am joined again from a few months ago by Amy Tobey, who is a Senior Principal Engineer at Equinix, back for more. Amy, thank you so much for joining me.

Amy: Welcome. To your show. [laugh].

Corey: Exactly. So, one thing that we have been seeing a lot over the past year, and you struck me as one of the best people to talk about what you’re seeing in the wilderness perspective, has been the idea of cloud repatriation. It started off with something that came out of Andreessen Horowitz toward the start of the year about the trillion-dollar paradox, how, at a certain point of scale, repatriating to a data center is the smart and right move. And oh, my stars that ruffle some feathers for people?

Amy: Well, I spent all this money moving to the cloud. That was just mean.

Corey: I know. Why would I want to leave the cloud? I mean, for God’s sake, my account manager named his kid after me. Wait a minute, how much am I spending on that? Yeah—

Amy: Good question.

Corey: —there is that ever-growing problem. And there have been the examples that people have given of Dropbox classically did a cloud repatriation exercise, and a second example that no one can ever name. And it seems like okay, this might not necessarily be the direction that the industry is going. But I also tend to not be completely naive when it comes to these things. And I can see repatriation making sense on a workload-by-workload basis.

What that implies is that yeah, but a lot of other workloads are not going to be going to a data center. They’re going to stay in a cloud provider, who would like very much if you never read a word of this to anyone in public.

Amy: Absolutely, yeah.

Corey: So, if there are workloads repatriating, it would occur to me that there’s a vested interest on the part of every major cloud provider to do their best to, I don’t know if saying suppress the story is too strongly worded, but it is directionally what I mean.

Amy: They aren’t helping get the story out. [laugh].

Corey: Yeah, it’s like, “That’s a great observation. Could you maybe shut the hell up and never make it ever again in public, or we will end you?” Yeah. Your Amazon. What are you going to do, launch a shitty Amazon Basics version of what my company does? Good luck. Have fun. You’re probably doing it already.

But the reason I want to talk to you on this is a confluence of a few things. One, as I mentioned back in May when you were on the show, I am incensed and annoyed that we’ve been talking for as long as we have, and somehow I never had you on the show. So, great. Come back, please. You’re always welcome here. Secondly, you work at Equinix, which is, effectively—let’s be relatively direct—it is functionally a data center as far as how people wind up contextualizing this. Yes, you have higher level—

Amy: Yeah I guess people contextualize it that way. But we’ll get into that.

Corey: Yeah, from the outside. I don’t work there, to be clear. My talking points don’t exist for this. But I think of oh, Equinix. Oh, that means you basically have a colo or colo equivalent. The pricing dynamics have radically different; it looks a lot closer to a data center in my imagination than it does a traditional public cloud. I would also argue that if someone migrates from AWS to Equinix, that would be viewed—arguably correctly—as something of a repatriation. Is that directionally correct?

Amy: I would argue incorrectly. For Metal, right?

Corey: Ah.

Amy: So, Equinix is a data center company, right? Like that’s why everybody knows us as. Equinix Metal is a bare metal primitive service, right? So, it’s a lot more of a cloud workflow, right, except that you’re not getting the rich services that you get in a technically full cloud, right? Like, there’s no RDS; there’s no S3, even. What you get is bare metal primitives, right? With a really fast network that isn’t going to—

Corey: Are you really a cloud provider without some ridiculous machine-learning-powered service that’s going to wind up taking pictures, perform incredibly expensive operations on it, and then return something that’s more than a little racist? I mean, come on. That’s not—you’re not a cloud until you can do that, right?

Amy: We can do that. We have customers that do that. Well, not specifically that, but um—

Corey: Yeah, but they have to build it themselves. You don’t have the high-level managed service that basically serves as, functionally, bias laundering.

Amy: Yeah, you don’t get it in a box, right? So, a lot of our customers are doing things that are unique, right, that are maybe not exactly fit into the cloud well. And it comes back down to a lot of Equinix’s roots, which is—we talk but going into the cloud, and it’s this kind of abstract environment we’re reaching for, you know, up in the sky. And it’s like, we don’t know where it is, except we have regions that—okay, so it’s in Virginia. But the rule of real estate applies to technology as often as not, which is location, location, location, right?

When we’re talking about a lot of applications, a challenge that we face, say in gaming, is that the latency from the customer, so that last mile to your data center, can often be extremely important, right, so a few milliseconds even. And a lot of, like, SaaS applications, the typical stuff that really the cloud was built on, 10 milliseconds, 50 milliseconds, nobody’s really going to notice that, right? But in a gaming environment or some very low latency application that needs to run extremely close to the customer, it’s hard to do that in the cloud. They’re building this stuff out, right? Like, I see, you know, different ones [unintelligible 00:05:53] opening new regions but, you know, there’s this other side of the cloud, which is, like, the edge computing thing that’s coming alive, and that’s more where I think about it.

And again, location, location, location. The speed of light is really fast, but as most of us in tech know, if you want to go across from the East Coast to the West Coast, you’re talking about 80 milliseconds, on average, right? I think that’s what it is. I haven’t checked in a while. Yeah, that’s just basic fundamental speed of light. And so, if everything’s in us-east-1—and this is why we do multi-region, sometimes—the latency from the West Coast isn’t going to be great. And so, we run the application on both sides.

Corey: It has improved though. If you want to talk old school things that are seared into my brain from over 20 years ago, every person who’s worked in data centers—or in technology, as a general rule—has a few IP addresses seared. And the one that I’ve always had on my mind was 130.111.32.11. Kind of arbitrary and ridiculous, but it was one of the two recursive resolvers provided at the University of Maine where I had my first help desk job.

And it lives on-prem, in Maine. And generally speaking, I tended to always accept that no matter where I was—unless I was in a data center somewhere—it was about 120 milliseconds. And I just checked now; it is 85 and change from where I am in San Francisco. So, the internet or the speed of light have improved. So, good for whichever one of those it was. But yeah, you’ve just updated my understanding of these things. All of this is, which is to say, yes, latency is very important.

Amy: Right. Let’s forget repatriation to really be really honest. Even the Dropbox case or any of them, right? Like, there’s an economic story here that I think all of us that have been doing cloud work for a while see pretty clearly that maybe not everybody’s seeing that—that’s thinking from an on-prem kind of situation, which is that—you know, and I know you do this all the time, right, is, you don’t just look at the cost of the data center and the servers and the network, the technical components, the bill of materials—

Corey: Oh, lies, damned lies, and TCO analyses. Yeah.

Amy: —but there’s all these people on top of it, and the organizational complexity, and the contracts that you got to manage. And it’s this big, huge operation that is incredibly complex to do well that is almost nobody’s business. 

So the way I look at this, right, and the way I even talk to customers about it is, like, “What is your produ—” And I talk to people internally about this way? It’s like, “What are you trying to build?” “Well, I want to build a SaaS.” “Okay. Do you need data center expertise to build a SaaS?” “No.” “Then why the hell are you putting it in a data center?” Like we—you know, and speaking for my employer, right, like, we have Equinix Metal right here. You can build on that and you don’t have to do all the most complex part of this, at least in terms of, like, the physical plant, right? Like, right, getting a bare metal server available, we take care of all of that. Even at the primitive level, where we sit, it’s higher level than, say, colo.

Corey: There’s also the question of economics as it ties into it. It’s never just a raw cost-of-materials type of approach. Like, my original job in a data center was basically to walk around and replace hard drives, and apparently, to insult people. Now, the cloud has taken one of those two aspects away, and you can follow my Twitter account and figure out which one of those two it is, but what I keep seeing now is there is value to having that task done, but in a cloud environment—and Equinix Metal, let’s be clear—that has slipped below the surface level of awareness. And well, what are the economic implications of that?

Well, okay, you have a whole team of people at large companies whose job it is to do precisely that. Okay, we’re going to upskill them and train them to use cloud. Okay. First, not everyone is going to be capable or willing to make that leap from hard drive replacement to, “Congratulations and welcome to JavaScript. You’re about to hate everything that comes next.”

And if they do make that leap, their baseline market value—by which I mean what the market is willing to pay for them—approximately will double. And whether they wind up being paid more by their current employer or they take a job somewhere else with those skills and get paid what they are worth, the company still has that economic problem. Like it or not, you will generally get what you pay for whether you want to or not; that is the reality of it. And as companies are thinking about this, well, what gets into the TCO analysis and what doesn’t, I have yet to see one where the outcome was not predetermined. They’re less, let’s figure out in good faith whether it’s going to be more expensive to move to the cloud, or move out of the cloud, or just burn the building down for insurance money. The outcome is generally the one that the person who commissioned the TCO analysis wants. So, when a vendor is trying to get you to switch to them, and they do one for you, yeah. And I’m not saying they’re lying, but there’s so much judgment that goes into this. And what do you include and what do you not include? That’s hard.

Amy: And there’s so many hidden costs. And that’s one of the things that I love about working at a cloud provider is that I still get to play with all that stuff, and like, I get to see those hidden costs, right? Like you were talking about the person who goes around and swaps out the hard drives. Or early in my career, right, I worked with someone whose job it was this every day, she would go into data center, she’d swap out the tapes, you know, and do a few things other around and, like, take care of the billing system. And that was a job where it was kind of going around and stewarding a whole bunch of things that kind of kept the whole machine running, but most people outside of being right next to the data center didn’t have any idea that stuff even happen, right, that went into it.

And so, like you were saying, like, when you go to do the TCO analysis, I mean, I’ve been through this a couple of times prior in my career, where people will look at it and go like, “Well, of course we’re not going to list—we’ll put, like, two headcount on there.” And it’s always a lie because it’s never just to headcount. It’s never just the network person, or the SRE, or the person who’s racking the servers. It’s also, like, finance has to do all this extra work, and there’s all the logistic work, and there is just so much stuff that just is really hard to include. Not only do people leave it out, but it’s also just really hard for people to grapple with the complexity of all the things it takes to run a data center, which is, like, one of the most complex machines on the planet, any single data center.

Corey: I’ve worked in small-scale environments, maybe a couple of mid-sized ones, but never the type of hyperscale facility that you folks have, which I would say is if it’s not hyperscale, it’s at least directionally close to it. We’re talking thousands of servers, and hundreds of racks.

Amy: Right.

Corey: I’ve started getting into that, on some level. Now, I guess when we say ‘hyperscale,’ we’re talking about AWS-size things where, oh, that’s a region and it’s going to have three dozen data center facilities in it. Yeah, I don’t work in places like that because honestly, have you met me? Would you trust me around something that’s that critical infrastructure? No, you would not, unless you have terrible judgment, which means you should not be working in those environments to begin with.

Amy: I mean, you’re like a walking chaos exercise. Maybe I would let you in.

Corey: Oh, I bring my hardware destruction aura near anything expensive and things are terrible. It’s awful. But as I looked at the cloud, regardless of cloud, there is another economic element that I think is underappreciated, and to be fair, this does, I believe, apply as much to Equinix Metal as it does to the public hyperscale cloud providers that have problems with naming things well. And that is, when you are provisioning something as a customer of one of these places, you have an unbounded growth problem. When you’re in a data center, you are not going to just absentmindedly sign an $8 million purchase order for new servers—you know, a second time—and then that means you’re eventually run out of power, space, places to put things, and you have to go find it somewhere.

Whereas in cloud, the only limit is basically your budget where there is no forcing function that reminds you to go and clean up that experiment from five years ago. You have people with three petabytes of data they were using for a project, but they haven’t worked there in five years and nothing’s touched it since. Because the failure mode of deleting things that are important, or disasters—

Amy: That’s why Glacier exists.

Corey: Oh, exactly. But that failure mode of deleting things that should not be deleted are disastrous for a company, whereas if you’ve leave them there, well, it’s only money. And there’s no forcing function to do that, which means you have this infinite growth problem with no natural limit slash predator around it. And that is the economic analysis that I do not see playing out basically anywhere. Because oh, by the time that becomes a problem, we’ll have good governance in place. Yeah, pull the other one. It has bells on it.

Amy: That’s the funny thing, right, is a lot of the early drive in the cloud was those of us who wanted to go faster and we were up against the limitations of our data centers. And then we go out and go, like, “Hey, we got this cloud thing. I’ll just, you know, put the credit card in there and I’ll spin up a few instances, and ‘hey, I delivered your product.’” And everybody goes, “Yeah, hey, happy.” And then like you mentioned, right, and then we get down the road here, and it’s like, “Oh, my God, how much are we spending on this?”

And then you’re in that funny boat where you have both. But yeah, I mean, like, that’s just typical engineering problem, where, you know, we have to deal with our constraints. And the cloud has constraints, right? Like when I was at Netflix, one of the things we would do frequently is bump up against instance limits. And then we go talk to our TAM and be like, “Hey, buddy. Can we have some more instance limit?” And then take care of that, right?

But there are some bounds on that. Of course, in the cloud providers—you know, if I have my cloud provider shoes on, I don’t necessarily want to put those limits to law because it’s a business, the business wants to hoover up all the money. That’s what businesses do. So, I guess it’s just a different constraint that is maybe much too easy to knock down, right? Because as you mentioned, in a data center or in a colo space, I outgrow my cage and I filled up all that space I have, I have to either order more space from my colo provider, I expand to the cloud, right?

Corey: The scale I was always at, the limit was not the space because I assure you with enough shoving all things are possible. Don’t believe me? Look at what people are putting in the overhead bin on any airline. Enough shoving, you’ll get a Volkswagen in there. But it was always power constrained is what I dealt with it. And it’s like, “Eh, they’re just being conservative.” And the whole building room dies.

Amy: You want blade servers because that’s how you get blade servers, right? That movement was about bringing the density up and putting more servers in a rack. You know, there were some management stuff and [unintelligible 00:16:08], but a lot of it was just about, like, you know, I remember I’m picturing it, right—

Corey: Even without that, I was still power constrained because you have to remember, a lot of my experiences were not in, shall we say, data center facilities that you would call, you know, good.

Amy: Well, that brings up a fun thing that’s happening, which is that the power envelope of servers is still growing. The newest Intel chips, especially the ones they’re shipping for hyperscale and stuff like that, with the really high core counts, and the faster clock speeds, you know, these things are pulling, like, 300 watts. And they also have to egress all that heat. And so, that’s one of the places where we’re doing some innovations—I think there’s a couple of blog posts out about it around—like, liquid cooling or multimode cooling. And what’s interesting about this from a cloud or data center perspective, is that the tools and skills and everything has to come together to run a, you know, this year’s or next year’s servers, where we’re pushing thousands of kilowatts into a rack. Thousands; one rack right?

The bar to actually bootstrap and run this stuff successfully is rising again, compared to I take my pizza box servers, right—and I worked at a gaming company a long time ago, right, and they would just, like, stack them on the floor. It was just a stack of servers. Like, they were in between the rails, but they weren’t screwed down or anything, right? And they would network them all up. Because basically, like, the game would spin up on the servers and if they died, they would just unplug that one and leave it there and spin up another one.

It was like you could just stack stuff up and, like, be slinging cables across the data center and stuff back then. I wouldn’t do it that way now, but when you add, say liquid cooling and some of these, like, extremely high power situations into the mix, now you need to have, for example, if you’re using liquid cooling, you don’t want that stuff leaking, right? And so, it’s good as the pressure fittings and blind mating and all this stuff that’s coming around gets, you still have that element of additional training, and skill, and possibility for mistakes.

Corey: The thing that I see as I look at this across the space is that, on some level, it’s gotten harder to run a data center than it ever did before. Because again, another reason I wanted to have you on this show is that you do not carry a quota. Although you do often carry the conversation, when you have boring people around you, but quotas, no. You are not here selling things to people. You’re not actively incentivized to get people to see things a certain way.

You are very clearly an engineer in the right ways. I will further point out though, that you do not sound like an engineer, by which I mean, you’re going to basically belittle people, in many cases, in the name of being technically correct. You’re a human being with a frickin soul. And believe me, it is noticed.

Amy: I really appreciate that. If somebody’s just listening to hearing my voice and in my name, right, like, I have a low voice. And in most of my career, I was extremely technical, like, to the point where you know, if something was wrong technically, I would fight to the death to get the right technical solution and maybe not see the complexity around the decisions, and why things were the way they were in the way I can today. And that’s changed how I sound. It’s changed how I talk. It’s changed how I look at and talk about technology as well, right? I’m just not that interested in Kubernetes. Because I’ve kind of started looking up the stack in this kind of pursuit.

Corey: Yeah, when I say you don’t sound like an engineer, I am in no way shape or form—

Amy: I know.

Corey: —alluding in any respect to your technical acumen. I feel the need to clarify that statement for people who might be listening, and say, “Hey, wait a minute. Is he being a shithead?” No.

Amy: No, no, no.

Corey: Well, not the kind you’re worried I’m being anyway; I’m a different breed of shithead and that’s fine.

Amy: Yeah, I should remember that other people don’t know we’ve had conversations that are deeply technical, that aren't on air, that aren’t context anybody else has. And so, like, I bring that deep technical knowledge, you know, the ability to talk about PCI Express, and kilovolts [unintelligible 00:19:58] rack, and top-of-rack switches, and network topologies, all of that together now, but what’s really fascinating is where the really big impact is, for reliability, for security, for quality, the things that me as a person, that I’m driven by—products are cool, but, like, I like them to be reliable; that’s the part that I like—really come down to more leadership, and business acumen, and understanding the business constraints, and then being able to get heard by an audience that isn’t necessarily technical, that doesn’t necessarily understand the difference between PCI, PCI-X, and PCI Express. There’s a difference between those. It doesn’t mean anything to the business, right, so when we want to go and talk about why are we doing, for example, multi-region deployment of our application? If I come in and say, “Well, because we want to use Raft.” That’s going to fall flat, right?

The business is going to go, “I don’t care about Raft. What does that have to do with my customers?” Which is the right question to always ask. Instead, when I show up and say, “Okay, what’s going on here is we have this application sits in a single region—or in a single data center or whatever, right? I’m using region because that’s probably what most of the people listening understand—you know, so I put my application in a single region and it goes down, our customers are going to be unhappy. We have the alternative to spend, okay, not a little bit more money, probably a lot more money to build a second region, and the benefit we will get is that our customers will be able to access the service 24x7, and it will always work and they’ll have a wonderful experience. And maybe they’ll keep coming back and buy more stuff from us.”

And so, when I talk about it in those terms, right—and it’s usually more nuanced than that—then I start to get the movement at the macro level, right, in the systemic level of the business in the direction I want it to go, which is for the product group to understand why reliability matters to the customer, you know? For the individual engineers to understand why it matters that we use secure coding practices.

[midroll 00:21:56]

Corey: Getting back to the reason I said that you are not quota-carrying and you are not incentivized to push things in a particular way is that often we’ll meet zealots, and I’ve never known you to be one, you have always been a strong advocate for doing the right thing, even if it doesn’t directly benefit any given random employer that you might have. And as a result, one of the things that you’ve said to me repeatedly is if you’re building something from scratch, for God’s sake, put it in cloud. What is wrong with you? Do that. The idea of building it yourself on low-lying, underlying primitives for almost every modern SaaS style workload, there’s no reason to consider doing something else in almost any case. Is that a fair representation of your position on this?

Amy: It is. I mean, the simpler version right, “Is why the hell are you doing undifferentiated lifting?” Right? Things that don’t differentiate your product, why would you do it?

Corey: The thing that this has empowered then is I can build an experiment tonight—I don’t have to wait for provisioning and signed contracts and do all the rest. I can spend 25 cents and get the experiment up and running. If it takes off, though, it has changed how I move going forward as well because there’s no difference in the way that there was back when we were in data centers. I’m going to try and experiment I’m going to run it in this, I don’t know, crappy Raspberry Pi or my desktop or something under my desk somewhere. And if it takes off and I have to scale up, I got to do a giant migration to real enterprise-grade hardware. With cloud, you are getting all of that out of the box, even if all you’re doing with it is something ridiculous and nonsensical.

Amy: And you’re often getting, like, ridiculously better service. So, 20 years ago, if you and I sat down to build a SaaS app, we would have spun up a Linux box somewhere in a colo, and we would have spun up Apache, MySQL, maybe some Perl or PHP if we were feeling frisky. And the availability of that would be one machine could do, what we could handle in terms of one MySQL instance. But today if I’m spinning up a new stack for some the same kind of SaaS, I’m going to probably deploy it into an ASG, I’m probably going to have some kind of high availability database be on it—and I’m going to use Aurora as an example—because, like, the availability of an Aurora instance, in terms of, like, if I’m building myself up with even the very best kit available in databases, it’s going to be really hard to hit the same availability that Aurora does because Aurora is not just a software solution, it’s also got a team around it that stewards that 24/7. And it continues to evolve on its own.

And so, like, the base, when we start that little tiny startup, instead of being that one machine, we’re actually starting at a much higher level of quality, and availability, and even security sometimes because of these primitives that were available. And I probably should go on to extend on the thought of undifferentiated lifting, right, and coming back to the colo or the edge story, which is that there are still some little edge cases, right? Like I think for SaaS, duh right? Like, go straight to. But there are still some really interesting things where there’s, like, hardware innovations where they’re doing things with GPUs and stuff like that.

Where the colo experience may be better because you’re trying to do, like, custom hardware, in which case you are in a colo. There are businesses doing some really interesting stuff with custom hardware that’s behind an application stack. What’s really cool about some of that, from my perspective, is that some of that might be sitting on, say, bare metal with us, and maybe the front-end is sitting somewhere else. Because the other thing Equinix does really well is this product we call a Fabric which lets us basically do peering with any of the cloud providers.

Corey: Yeah, the reason, I guess I don’t consider you as a quote-unquote, “Cloud,” is first and foremost, rooted in the fact that you don’t have a bandwidth model that is free and grass and criminally expensive to send it anywhere that isn’t to you folks. Like, are you really a cloud if you’re not just gouging the living piss out of your customers every time they want to send data somewhere else?

Amy: Well, I mean, we like to say we’re part of the cloud. And really, that’s actually my favorite feature of Metal is that you get, I think—

Corey: Yeah, this was a compliment, to be very clear. I’m a big fan of not paying 1998 bandwidth pricing anymore.

Amy: Yeah, but this is the part where I get to do a little bit of, like, showing off for Metal a little bit, in that, like, when you buy a Metal server, there’s different configurations, right, but, like, I think the lowest one, you have dual 10 Gig ports to the server that you can get either in a bonded mode so that you have a single 20 Gig interface in your operating system, or you can actually do L3 and you can do BGP to your server. And so, this is a capability that you really can’t get at all on the other clouds, right? This lets you do things with the network, not only the bandwidth, right, that you have available. Like, you want to stream out 25 gigs of bandwidth out of us, I think that’s pretty doable. And the rates—I’ve only seen a couple of comparisons—are pretty good.

So, this is like where some of the business opportunities, right—and I can’t get too much into it, but, like, this is all public stuff I’ve talked about so far—which is, that’s part of the opportunity there is sitting at the crossroads of the internet, we can give you a server that has really great networking, and you can do all the cool custom stuff with it, like, BGP, right? Like, so that you can do Anycast, right? You can build Anycast applications.

Corey: I miss the days when that was a thing that made sense.

Amy: [laugh].

Corey: I mean that in the context of, you know, with the internet and networks. These days, it always feels like the network engineering as slipped away within the cloud because you have overlays on top of overlays and it’s all abstractions that are living out there right until suddenly you really need to know what’s going on. But it has abstracted so much of this away. And that, on some level, is the surprise people are often in for when they wind up outgrowing the cloud for a workload and wanting to move it someplace that doesn’t, you know, ride them like naughty ponies for bandwidth. And they have to rediscover things that we’ve mostly forgotten about.

I remember having to architect significantly around the context of hard drive failures. I know we’ve talked about that a fair bit as a thing, but yeah, it’s spinning metal, it throws off heat and if you lose the wrong one, your data is gone and you now have serious business problems. In cloud, at least AWS-land, that’s not really a thing anymore. The way EBS is provisioned, there’s a slight tick in latency if you’re looking at just the right time for what I think is a hard drive failure, but it’s there. You don’t have to think about this anymore.

Migrate that workload to a pile of servers in a colo somewhere, guess what? Suddenly your reliability is going to decrease. Amazon, and the other cloud providers as well, have gotten to a point where they are better at operations than you are at your relatively small company with your nascent sysadmin team. I promise. There is an economy of scale here.

Amy: And it doesn’t have to be good or better, right? It’s just simply better resourced—

Corey: Yeah.

Amy: Than most anybody else can hope. Amazon can throw a billion dollars at it and never miss it. In most organizations out there, you know, and most of the especially enterprise, people are scratching and trying to get resources wherever they can, right? They’re all competing for people, for time, for engineering resources, and that’s one of the things that gets freed up when you just basically bang an API and you get the thing you want. You don’t have to go through that kind of old world internal process that is usually slow and often painful.

Just because they’re not resourced as well; they’re not automated as well. Maybe they could be. I’m sure most of them could, in theory be, but we come back to undifferentiated lifting. None of this helps, say—let me think of another random business—Claire’s, whatever, like, any of the shops in the mall, they all have some kind of enterprise behind them for cash processing and all that stuff, point of sale, none of this stuff is differentiating for them because it doesn’t impact anything to do with where the money comes in. So again, we’re back at why are you doing this?

Corey: I think that’s also the big challenge as well, when people start talking about repatriation and talking about this idea that they are going to, oh, that cloud is too expensive; we’re going to move out. And they make the economics work. Again, I do firmly believe that, by and large, businesses do not intentionally go out and make poor decisions. I think when we see a company doing something inscrutable, there’s always context that we’re missing, and I think as a general rule of thumb, that at these companies do not hire people who are fools. And there are always constraints that they cannot talk about in public.

My general position as a consultant, and ideally as someone who aspires to be a decent human being, is that when I see something I don’t understand, I assume that there’s simply a lack of context, not that everyone involved in this has been foolish enough to make giant blunders that I can pick out in the first five seconds of looking at it. I’m not quite that self-confident yet.

Amy: I mean, that’s a big part of, like, the career progression into above senior engineer, right, is, you don’t get to sit in your chair and go, like, “Oh, those dummies,” right? You actually have—I don’t know about ‘have to,’ but, like, the way I operate now, right, is I remember in my youth, I used to be like, “Oh, those business people. They don’t know, nothing. Like, what are they doing?” You know, it’s goofy what they’re doing.

And then now I have a different mode, which is, “Oh, that’s interesting. Can you tell me more?” The feeling is still there, right? Like, “Oh, my God, what is going on here?” But then I get curious, and I go, “So, how did we get here?” [laugh]. And you get that story, and the stories are always fascinating, and they always involve, like, constraints, immovable objects, people doing the best they can with what they have available.

Corey: Always. And I want to be clear that very rarely is it the right answer to walk into a room and say, look at the architecture and, “All right, what moron built this?” Because always you’re going to be asking that question to said moron. And it doesn’t matter how right you are, they’re never going to listen to another thing out of your mouth again. And have some respect for what came before even if it’s potentially wrong answer, well, great. “Why didn’t you just use this service to do this instead?” “Yeah, because this thing predates that by five years, jackass.”

There are reasons things are the way they are, if you take any architecture in the world and tell people to rebuild it greenfield, almost none of them would look the same as they do today because we learn things by getting it wrong. That’s a great teacher, and it hurts. But it’s also true.

Amy: And we got to build, right? Like, that’s what we’re here to do. If we just kind of cycle waiting for the perfect technology, the right choices—and again, to come back to the people who built it at the time used—you know, often we can fault people for this—used the things they know or the things that are nearby, and they make it work. And that’s kind of amazing sometimes, right?

Like, I’m sure you see architectures frequently, and I see them too, probably less frequently, where you just go, how does this even work in the first place? Like how did you get this to work? Because I’m looking at this diagram or whatever, and I don’t understand how this works. Maybe that’s a thing that’s more a me thing, like, because usually, I can look at a—skim over an architecture document and be, like, be able to build the model up into, like, “Okay, I can see how that kind of works and how the data flows through it.” I get that pretty quickly.

And comes back to that, like, just, again, asking, “How did we get here?” And then the cool part about asking how did we get here is it sets everybody up in the room, not just you as the person trying to drive change, but the people you’re trying to bring along, the original architects, original engineers, when you ask, how did we get here, you’ve started them on the path to coming along with you in the future, which is kind of cool. But until—that storytelling mode, again, is so powerful at almost every level of the stack, right? And that’s why I just, like, when we were talking about how technical I bring things in, again, like, I’m just not that interested in, like, are you Little Endian or Big Endian? How did we get here is kind of cool. You built a Big Endian architecture in 2022? Like, “Ohh. [laugh]. How do we do that?”

Corey: Hey, leave me to my own devices, and I need to build something super quickly to get it up and running, well, what I’m going to do, for a lot of answers is going to look an awful lot like the traditional three-tier architecture that I was running back in 2008. Because I know it, it works well, and I can iterate rapidly on it. Is it a best practice? Absolutely not, but given the constraints, sometimes it’s the fastest thing to grab? “Well, if you built this in serverless technologies, it would run at a fraction of the cost.” It’s, “Yes, but if I run this thing, the way that I’m running it now, it’ll be $20 a month, it’ll take me two hours instead of 20. And what exactly is your time worth, again?” It comes down to the better economic model of all these things.

Amy: Any time you’re trying to make a case to the business, the economic model is going to always go further. Just general tip for tech people, right? Like if you can make the better economic case and you go to the business with an economic case that is clear. Businesses listen to that. They’re not going to listen to us go on and on about distributed systems.

Somebody in finance trying to make a decision about, like, do we go and spend a million bucks on this, that’s not really the material thing. It’s like, well, how is this going to move the business forward? And how much is it going to cost us to do it? And what other opportunities are we giving up to do that?

Corey: I think that’s probably a good place to leave it because there’s no good answer. We can all think about that until the next episode. I really want to thank you for spending so much time talking to me again. If people want to learn more, where’s the best place for them to find you?

Amy: Always Twitter for me, MissAmyTobey, and I’ll see you there. Say hi.

Corey: Thank you again for being as generous with your time as you are. It’s deeply appreciated.

Amy: It’s always fun.

Corey: Amy Tobey, Senior Principal Engineer at Equinix Metal. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment that tells me exactly what we got wrong in this episode in the best dialect you have of condescending engineer with zero people skills. I look forward to reading it.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.


Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.