Build vs Buy: The Hidden Costs of “Just Building It” with Ahmed Bebars

Episode Summary

Just because you can build it doesn’t mean you should.

Episode Video

Episode Show Notes & Transcript

In this episode, Ahmed Bebars, Principal Engineer at The New York Times, joins Corey Quinn to talk about real-world cloud decisions, Kubernetes complexity, and the constant trade-off between building your own solutions and buying existing ones. From home labs to enterprise architecture, they unpack what actually works, and what engineers often get wrong.

Show Highlights:
(00:19) Intro
(01:09) From Imposter Syndrome
(06:34) Honest Community Feedback
(09:29) EKS Versus ECS Debate
(21:32) Home Lab Reality Check
(22:40) Build vs Buy Long Game
(28:04) Focus on Core Business
(34:35) Uptime Tradeoffs and Standards
(39:41) Networking and IPv6 Debate
(41:28) Wrap Up and Where to Find

Links:
Ahmed's LinkedIn: https://www.linkedin.com/in/ahmedbebars

Sponsored by:
duckbillhq.com

Transcript

Ahmed: It's the idea of like build versus buy and all of the kinda stuff. It comes to a point where like, sure. This system is unstable, but unstable in a way that like you don't have to invest all of the resources, keeping the uptime, like all of the operational stuff, like all of the thing.

Corey: Welcome to Screaming in the Cloud. I'm Cory Quinn, and I am joined today by a man of many talents. Ahmed Babar is a principal engineer at the New York Times. He's an AWS container hero. He's a cloud native and. Ambassador and a prolific public speaker. Ahmed, welcome to the show.

Ahmed: Thank you Corey, for having me.

I'm excited to see what we're gonna dive into. Uh, I know that you have a lot of questions, so I'm looking forward to hear some of them.

Corey: We'll start with the, the direct insulting one. I suppose. You're an AWS hero, you're a cloud native ambassador. What, what got you down the path of, you know what I should do?

That's right. Do volunteer work for giant entities that frankly could afford to pay people to do this. If you really think about it the right way, I, I mostly kid, I, Lord knows I've spent enough time in the community myself. How do you wind up there?

Ahmed: Yeah, like to, to be honest, like I didn't know that I'm gonna end up there.

Like, so a few years ago when I started my journey on, like when I came to the United States, I was like, sure, yeah, I'll try to solve a couple problems in a couple organizations here and there. And then all, all a sudden after some time in 2019, I remember this was my first public speaking opportunities that I had.

It was like, it strike me as that like. I always thought that like, I don't know enough to share. Like that was like the really, the really tipping point to me. Like everything I do, like, yeah, everyone knows that. Everyone knows that. Until that moment and then I went in my first talk, I was like. Yeah, a lot of people didn't know what I'm gonna talk about and they liked it and they said like, this is great content.

So from there, like I started to say like if some don't know about some things that I'm doing, why I'm not sharing, at least like I have it out there. And that ends up to be like, sure, I contribute to many open source community. I can teach people how to go there. And then like all of these things came like.

Sure there's an ambassador program for the CCF F. Can I apply and see how can I explore the world from that space? Gives me a great opportunity. AWS Hero, it's kinda like pic, so it's kind of like a different story, but like also like I've been doing a lot of work with AWS, so that's what I've been picked.

But what's really my interest here is to share more on what I have done, on what I heard about, on what I have seen better in my opinion, and see if that helps anyone on the ecosystem.

Corey: It, it, it feels like you fall prey to the same trap that many of us do. Lord knows I still have to talk myself out of this, where I, I have this internalized perception that if I know something, therefore it's commonly known.

Everyone basically knows this. But if I don't know something, that's the hard stuff. That's the interesting piece of it, and it's never true. Uh, similarly, I've, I've found that making a talk more broadly accessible to a larger number of people has never been the wrong decision. Because it's, it, it people, everything is new to someone.

We live in a big world and a big space.

Ahmed: You nailed it. It's that like concept in your head, like when you like drew a circle and then like you always keep circling around and I'm like, everyone knows it. Like talk to someone, like I talk to, like how many people I talk to. Usually it's not a lot. And then like you talk to someone and you're like, oh yeah, I know this feature.

You talk to someone, I know this feature. But like then when you look. Over, like a lot of people don't know and sometimes actual, like even if you talk about the same topic over and over, some people may listen to that, not listen to the others. So sharing the same content sometimes in different ways, in different formats.

Like what I also have seen resonate with people is that. I talk, I ha I'm not selling anything. Like no one have to listen to me because like I'm solving a problem. So is that also coming from like, I'm being an end user, tried something, sharing my thoughts. I'm not pushing you to buy my software. I'm telling you my software works.

I tested something, I tried it. It works. You wanna use it, you wanna listen to it, you wanna correct me. It's, it's a community work. This is the feedback that I'm going to, but also like what I learned from that. Is by contributing. People might tell me like, oh, but have you looked into this? And that opened like a whole lot of can of worms where like, oh, you know, I didn't look into this.

Let me look into it. And actually many of my talks I have people say, sure, that was a great talk and all the kinda stuff, but I have people said, have you looked into that? We tried this before and it didn't work. And that striked me as a great conversation to know like. I didn't look into it. Let me try.

And then I start to look into it, and then it becomes a bigger thing.

Corey: It's why I love conferences and the rest of the community where I, I'll talk to someone. The most recent thing that still irritates me that I went this long without knowing about it, is at Tuin, A-T-U-I-N. It's an incredibly awesome shell history that sinks between machines.

Uh, I've discovered that, installed it everywhere. Cannot go back again to using the built-in nonsense, given how ephemeral most of my stuff tends to be. It, it's these weird things where, oh, well, why not build this tool? Like, and there's downsides to that too. After I first built out my original overly wrought newsletter publication system, someone said, well, why didn't you just use Curated Co?

It's, but why? Why didn't I use what now? Because I didn't know it exists. This would've been handy several months ago. Uh, yeah, there are, there are always ways to do it in talking to people and getting the real skinny on what people think about how something works is incredibly valuable.

Ahmed: Yeah. That this is, this is usually like how most of my learning has been over the year, and that like got me to a space where like, you know what?

I experienced it. Let's talk about, let's see, like how it goes. Is it bad or good? It solved the problem for my own experience. And sometimes also it's interesting to choose affiliates. 'cause like you wanna tell people like, what did you try and didn't work out? Because like you don't want them to sit in that trap.

So like, either I'm learning something. But usually I try most of the times as much as I can to set my talks into like an experience that I have done into a real story. I don't wanna like bring a topic and just like talk about, sure, I can talk about Kubernetes, I can talk about like a WSI can talk about anything, but I usually try to big topics that a problem I try to solve or like a, a situation where I've been in that gives me like more, I don't wanna say credibility, but gives me more like.

I'm in it, like usually I don't break much into the talk. I gives him a real story about what exactly happened.

Corey: I mean, something I find is that documentation falls down terribly when it just tries to do a list of, and here's all the features it does and all here's an API reference for whatever reason.

The thing I'm trying to do is never well documented in these things where, so I like experience reports. If I'm gonna build a to-do list app, to use an overdone example. Great. I wanna know how you use the tool to do it, what your steps were, how it wound up looking and you're driving to an outcome. Uh, I've also deeply appreciate the community stuff, especially the heroes folks on the AWS world because you are not beholden to AWS in the same way as an AWS employee is.

If an AWS employee talks about aspects of AWS being complete crap, they're likely not going to be AWS employees for very long. Whereas the rest of the community, we talk about this because. It does have sharp edges. These things are painful. How do you split the difference there? Because it, on some level, it feels weird to go and speak at a company's conference and use their platform and then use that to drag them.

I mean, I have a personal policy of not making people regret, inviting me to things, so I'm not gonna crap on them at their own conference. But I do sometimes feel like I have to strike a balance.

Ahmed: Yeah. Like the balance is always, is like being honest and like showing what is the real value of something.

Like I'm always, come on like many social medias and many platforms and say, that didn't work for me. That wasn't the right intention. There are meetings and spaces for things like what I should say. I've been saying over the years that a lot of people know this about me, that AWS user experience has been clunky all the time.

They didn't master it. That

Corey: is such a flattering way to put it.

Ahmed: Yeah. Yeah. Like, like in a way, like, you know, like it's, it's been like ridiculous how many times I have seen like, oh, I have to go all the way. Like, I, I go and talks with service teams and sometimes and say like, you know what, like, why we have this three times on the same page.

Like, why? Like, why, like they are reliable on something and, but they are not in, in something else. And that's where like always the balance comes in. But also like, it has to be like in, in our. I wanna give them feedback and I want it to be critical, but I want it to be like, I don't wanna say in a nice way, but I want it to be like honest feedback.

I don't wanna embrace something that like other vendors has done for years and say, this is great. Like I would say like, it's great you have this now, but like

Corey: what took so long? Yeah.

Ahmed: Yeah, like, it, it, it's been long time like to get into something like that, but there are some innovations that I have seen is a space that deserves that.

And like, what, what I said exactly about, I'm not behold to anything because I don't sell anything. So like, I'm not obligated to, to any of that. Like, I, I don't like AWS You're gonna what, like what's gonna happen? Like I'm not You gonna use AWS in the next three years. Sure, yeah. Like, don't work for the company.

It's my honest opinion. I think like that's what. I should be doing, because I'm not doing this for AWS specifically. I know the intricacies of AWS in some cases, but I'm doing this for the people. Like if someone asked me my opinion today about like, we have this debate all the time, like funny story, uh, you probably know that, uh, uh, because you have seen a lot.

Like when I talk to other people, I'm on the container, AWS hero space, if I talk to someone in the container, but like I'm have a favor. I favor EKS. Some others favor ECS. Like you have to seize the debate when like we trash some of the services sometimes for each other. Like say like, oh no, like EEC S is better than ES.

I'm say no EKS is better. Like, and all of the kinda stuff. And it's at the end of the day, like it's fun situation to compare things, but like at least we have honest opinion about like where exactly is the use case. We laugh about, we kid about it a lot of time, but you ask me in one day. What is one of the services to use?

I tell like if you are a small company, Kubernetes is not the right fit for you. Like either you're in a W that's shop or not. Like this is like irrelevant. But like if you are like using containers, if like it has to have like characteristics and criteria for like your decision. So like there's the fun talk, all of the tabs that we have, but there's also like the technological decisions that you need to make.

And this is situation based. So it's not like, hey. All the way Go containers EKS, because that doesn't work. So I, I have never seen a solution as that fits all.

Corey: At Duck Bill, we are building our product on top of ECS, which is, this makes sense for our scale and current constraints. We have a path forward and boom.

Surprise sponsorship. That's right. This show is sponsored by duck bill hq.com. My employer, we have a platform now, rather than just handling the. Consulting side of the world is we have historically with contract negotiations for large entities, debating with AWS what the future might hold for both parties.

Now we have software that we are systematizing part of this in. If that sounds relevant to what you're doing, please check us [email protected]. And also we are hiring And also also Ahmed, one of the best parts about this timing is that yesterday, for the first time in. Since it first came out, I spun up an EKS cluster because I'm building some, a bunch of weird projects that I want to throw at a wall.

All of my customers use EKS in some way, shape, or form. It's time for me to use it and it's gotten, from what I can tell, slightly better. It only took 10 minutes to spin up the EKS cluster instead of the 25 when I did several years ago. So it's improving bit by bit. What's your take on it?

Ahmed: Let's not talk about the start of time for the cluster.

That has been like a dilemma for a while. Like why this takes forever. Like I have seen the architecture for the control plane behind the scenes. I still like don't get it. Why it takes too long, because others have done it and it seems to be working for others. So I'm not sure like where is this coming from?

Corey: I'm certain there are reasons and good reasons for it and honestly. And how often do you spin up or down your production cluster? Oh, I don't, but that's kind of the point in development. When I'm testing my infrastructure stuff, I want to smoke, test it in a test account, and that adds a tremendous burden to how long it takes to run through those tests.

Please fix it. That's why I care.

Ahmed: Yeah. I, exactly. I went through this use case like, and, and I agree with you, like how often you re a Kubernetes cluster. Sure. Not too much, but like when I need this for testing, when I need to mimic something, when I'm doing a demo, like I have to wait like 10 minutes for like a cluster to get up.

But like, let's talk about like. Good. Other things that, like I haven't seen like in other, like there is new, uh, the ecosystem pattern of like, so let me tell you why I like Kubernetes in general, like the generality of it. It's just like, because it's a common pattern across multiple cloud provider, like I can get that flavor on, on a WSI can get this flavor and uh, as providers does a lot of the things behind the scene change.

Sure. Instances, all of the kinda stuff like the, how they author each other, all of the kinda stuff. But at the end of the day. It's a deployment, it's a bud, it's a container. It's all shared. I can get a similar flavor into it, into my machine to test, which is relevant to what you're saying. So in my CI, I can spin up whatever, like Kubernetes thing on a Docker, whatever ecosystem to test something with.

Problem was, it is like when you start having like despairs and like have solutions that like, sure I have some things that works for a cloud, but something also work for local. It's like becoming like a tangling effect and sometimes you cannot test the same stuff. So that's where like we have to come up with mockups, mockup APIs and see like, oh, now I have to call the E-K-S-A-P-I.

Now I have to get my Bud Identity. All this kinda stuff. This is the sad part of it. The good part of it. There are like more capabilities that's coming into, like, into services like this. One of the things that like I really embraced when I have seen it is like the concept of manage, add-on, uh, or not add, add-on.

They call it managed services now, whatever, but like it is a concept of having managed ARGUS complex. I, I've seen Argo runs and it's other controllers might not be, but like that's a good option if there's other community or open source projects that could run the same way. And it takes the complexity out of running the control plane.

Loves that. That's a great idea. Removes burden if you know how to run things like that. So from that perspective, seems like it's growing. Does it do other better jobs like other providers? Maybe, maybe not. Like I have to put them in a benchmark to see like what they can do. There are use cases that I hear about.

Obviously interesting to me. I dunno, like do I need to run a hundred thousand node on a single cluster? Never. Never had this use case in my entire life.

Corey: Well, yes. If you're trying to get the AWS bill high score, how else are you planning on doing it?

Ahmed: Sure. I don't have that money to spend that in my account.

Corey: Oh good lord. You never do it with your own money. That's what employers are for or someone else, or clients in the consulting world. I digress. Uh, I've been running a test cluster at home for Kubernetes for two years. I had to build a conference talk out of it because I mouthed off on the internet seven years ago and said, no one's gonna care about Kubernetes.

So I ended up having to give a talk called Terrible Ideas in Kubernetes, but I found it useful where now I can just write random nonsense or find it somewhere on GitHub. In a container, I can throw it onto the cluster, access it over my tail scale network, and I can just have a bunch of heterogeneous things running.

Unfortunately, I've become a victim of my own success in that. Some of my team have seen some of the tools I've built that are useful for what they're doing, even though they're not, not coupled to. Client data because, my God. But then, oh, like I built a great image manipulator for marketing purposes. It, it has some advantages.

And they said, they said, great, can I get a copy of that? Like, alright, time to build an internal cluster on this, but we're gonna do it. Right. And by, right, I mean, enterprisey, we are doing GI ops the whole way with Argo cd, we're using open tofu because Terraform gets really weird at scale. It is a wildly overbuilt solution for a single container at the moment.

But something I found about these clusters is they never tend to stay single tenant for long. They, you, you start adding things to it, and in the fullness of time, this becomes really straightforward to start launching a bunch of internal corporate tools, which is handy, but the teething exercise of getting up and running with it, I'm glad this is not critical path for anything right now.

'cause I don't know it well enough to support it.

Ahmed: It is not like I, like, I, I recall the days that I thought about like, oh, do I have to like manage an enterprise cluster for like many use cases and do I have to run like all of the cube admin and join instances together and do all of that in the cloud? I was like, yeah, I don't wanna,

Corey: because it depends on what laptop you ran it from.

Oh. And then you talk to like, oh no, no. You're only supposed to run that from the CICD system. It's, that would've been terrific to put on the warning label.

Ahmed: Yeah. So like all, all of that, like it solves a problem. It's like, you know, the whole cloud solves a problem for like, not have to care about hardware, but like, do I do my own stuff?

Sure, yeah. My entire home automation system runs on K three s. That's where like, I'm running

Corey: K three s myself. Home assistant.

Ahmed: Yeah. Home system. Yeah.

Corey: So I do not have that running on the cluster because that has gotten sizable and logic based enough that I have a, uh, I got a HP Mini PC that I put the whole thing on.

Because it's, and again, with, with my wife, we're definitely proven the old trope that when you have a couple and someone's really into IO ot, the dynamic is one of you loves the fact that you're living in the future and the other one thinks the house is haunted. It's great.

Ahmed: Yeah, that, that's exactly where I'm at right now.

Like I have like, and I can tell you. I spent few days where like my wife would call me like and say like, Hey, the house lights are not turning on. I was like, um, I dunno, like what's happening? And she said like, all of a sudden it's not working. I was like, yeah, probably you have to restart the cluster somehow.

And I'm like, go unplug it. Plug it again and it'll work. I was like, sure, yeah, that works. But now I'm hunted by my own clusters that I have to set it up. Sometimes I have to like upgrade it and do all of the work around it. But like to be honest, it works. Like I ran into like this is one of the things that you said, like setting it up one time was a complex stories that I have to get all of the things set up in my end.

I have seen like how complex I have to bake images and do all of that to get like a small cluster in my home running. So imagine this like running this in an enterprise scale. Like I have to bake my images, do all of the work to get this. Now it's easier. Now it's just like a couple clicks and you get a cluster up that was like.

Cool thing to have,

Corey: I just discovered a few weeks ago from the person who wrote Atune, Ellie. As it turns out, uh, that K three s has a built in registry that is distributed across the nodes, which is awesome. It's, I have to, I can stop pulling the same image again and again, which is freaking wonderful.

Ahmed: I didn't know about that.

Corey: It's a command. It's a system command argument. Spiegel, S-P-E-G-E-L. It is built into K three s. You pass the server a command line parameter and you're done.

Ahmed: Okay. I actually will look this up. Yeah. You see like that's why like I talked to you, I learned something, I'm gonna go implement it. Probably like my lights will not work tonight, but that's okay.

It's, you know, it's a greater good.

Corey: That's another trick. I switched all of the light switches I was using over to Lutron. Which is a little on the expensive side, but it's also what a lot of the smart home contractors build out. And what I love about them is if you don't hook it up to anything, it acts like a normal light switch.

And when the system fails, the way it works is like a normal light switch. You push the button, the lights turn on, and suddenly I get yelled at less.

Ahmed: I actually like this idea more. Like I ended up on that trend, not for all of my lights, but like, because I used the U Lights before and like the switch were like very like interesting.

But then the Luron, this office is running on a Lutron switch and it actually like, allows me to do also three-way switches and different things and all the kinda stuff to mimic like a normal environment, but also like when wifi is working and everything is stable, when the cloud is running, it runs beautifully from a remote perspective, but that, you know, it's, it's a balance between.

What do I need to do day to day and like how I tested things? I think like depends on like what I'm actually achieving for. I think, to be honest, my cluster is running up there and I, I barely touch it. Most of the cases, most of the time, like I don't need to touch it because it's working. It's an older upgrade.

All of the kinda stuff that it's running effectively doing what I need to do, but when I need to swir a container on it, this is the easy thing. Like just. Log into it, throw a container, get out, and it's all working. So yeah,

Corey: all my config lives in a GI repo that I just run, uh, Quebec to against for home stuff.

I have not, I haven't gi opted yet, but it means that when I tear down the cluster and rebuild it as I have to every year and a half or so, 'cause it gets wonky, it's pretty easy to get the stuff I care about back and running.

Ahmed: I have a backup, so my, like, I didn't get off it like just, I normally like would do anything for like my cloud stuff.

Corey: Yeah. For the home stuff. It's like I'll run my own RSS aggregator. Terrific. Awesome. If that breaks, it's annoying. I have to get it back up and running, but none of my business stuff goes down. Nothing breaks. This is a different RSS system than the one that feeds the newsletter. That stuff all lives in AWS, like a grownup might put something there.

It, it's also strange sometimes to look at. The monitoring for this and realize that my 11 node cluster that is all plugged into the same power strip has better uptime for a month than GitHub actions. And all right, that's, that's unfortunate, but okay. There's the other side of it too, that when it goes down, no one's coming to save me.

I've gotta get it up and running myself. And not just wait for a vendor to, to fix it for me, it's a mixed bag. I, I don't know that there's necessarily one right way for this. It's, it's just the reality of it. We, we've forgotten on some level how to run hardware ourselves.

Ahmed: This episode is sponsored by my own company, duck Bill.

Having trouble with your AWS bill, perhaps it's time to renegotiate a contract with

Corey: them. Maybe you're just wondering how to predict what's going on in the wide world of AWS. Well, that's where Duck Bill comes in to help. Remember, you can't duck the duck bill. Bill, which I am reliably informed by my business partner is absolutely not our motto.

To learn more, visit doc bill hq.com.

Ahmed: To be honest, it's a debate. It's a debate that I've been on with years. So let's talk about it from a software in general, not just like cloud perspective. A lot of solutions out there. Like you'll see like, oh, this solution provides me A, M, B, and C. Oh, but I can build A M, B and C and D.

Sure, you can build it, but like the problem is not in the building anymore. It's a problem like few years. Like how you maintain it, how you keep it up and running, how you do all of that kinda work.

Corey: It used to be day two problems. Great. Now it's like day 50.

Ahmed: Like this is just like I start to think about also from a business perspective, like just a business mindset.

Like what happened when like person maintains a system leaves or whatever, like a team or like something, or this technology gets old or you have to upgrade it or you have to run instances,

Corey: or you leave it running in AWS for more than a year, in which case now it's extended support, which costs six times more for the.

Cluster. Why? Because screw you. Another year goes past. Then they will blind upgrade you at a time of their choosing, not yours. So you're just kicking the can down the road, gaining nothing by it and banging through the nose for it. It's okay. That doesn't seem the most customer obsessed.

Ahmed: This was always interesting.

That part, like the extended support was always interesting because like they are trying to balance between like how to keep it sustainable for the team or like whatever the team is managing. That's from my perspective, but I also like. This is like sex X is a lot like, just like this is, it's a big number.

Like when you try to do something like that and you're always read well like y, but also like if you look at a couple clusters, you don't pay much. Like for example, like the a b is a control plane. You don't bail like a lot of money for still money, but like you don't, don't bail a lot. I would like have a heart attack in some way if this is goes for my nodes or something like that, which is gonna be like more complicated than we're gonna have a conversation about.

But again, the idea of like build versus buy and all of the kinda stuff, it comes to a point where like, sure, this system is unstable, but unstable in a way that like, you don't have to invest all of the resources, keeping the uptime, like all of the operational stuff, like all of the things, when I run something on my home, I understand the risk off.

Like this is not working. Like I have built a land, it's not like critical to my life. Like my light's still gonna turn on like turn off. But like my Alex, I wouldn't say. Hey, like I can turn on the light from that. That's all the impact here or there. But like when I run a system and then I have to maintain it, there's a lot of operation overhead that I have to spend in maintaining the system, maintaining the infrastructure, maintaining everything behind that.

So I always tend to tell people, like when someone asks like, should I build versus buy? It's just like, what do you have? Like are you building like a gigantic system and like you wanna do everything? Like I would rather like lean on like. Open source. What I have seen in my career in some way is that majority of the tech problem have been like.

Solved in some way. So like you're gonna find the solutions that solve like 50, 60% out of your way of doing it. Don't rebuild it. Like if Kubernetes works for you, 80%, don't try to

Corey: rebuild it. That's the rise of AI problem right there in a nutshell is that, well, I could just build my own custom solution on to out of spare, out of spare parts and that'll work and it will.

Mostly for the exact use case you've defined and tested. As soon as the requirement changes, now you have a problem to work with. And for weird back of house single purpose apps, I do that all the time. But for stuff that matters, of course, I'm paying vendors. I pay for notion at work with a smile on my face for a bunch of reasons that should be obvious.

I built my own newsletter publication system, rebuilt it finally the way I wanted to at the start of this year with the lessons learned. And this is the third generation of that's. System. It's much better than the previous generations, but I'm sure I'm gonna tear it down and replace it in a few years with something else.

And that's okay. It's understand where the right approach is. Someone had a tweet a while back that it's interesting that Anthropic as a company uses a DP for payroll instead of revive coding their own. And the answer is, is because they're not insane. They understand that you're not just paying for a piece of software, you're paying for understanding the nuances of.

Payroll law in a bunch of different jurisdictions in which you operate. Keeping up to date with legal changes and not having the Department of Labor kick your door off your a hinges three days after you miss a payroll run. It's the right move. It's not just the software, it's understanding the business context of what you're trying to do.

Ahmed: A hundred percent. It's not because you can do it, you should do it. There's like a big, big app. Like if you all wanna try something out, if I wanna like build something really quickly, like have a demo, all of that kinda stuff, sure, go build it. Try like do whatever you want. When I think about what, when I think about long-term sustainability, like not everything is like rebuildable.

Not because I can, I should, and this is where, where I stand by. Like if you solve the problem and then I look at like your solution and it, then it fits. Why not? Why not use this and add to it or like bake it into my like way of thinking rather than just like say, oh, it doesn't, it doesn't do all of the 10 things that I need to do.

But it does eight, like does seven, it does five. It like it spilled. There's like. 10 other people looking at it. 'cause like think about it, like if I rely on the software, like let's pick any project in the open source in the ecosystem, and they're like, there's not always a single person use it. So there's many people use it, so someone has interest in doing that, but you build your own software.

It's your only responsibilities. That's your things that you have to maintain.

Corey: And saying yes to something means saying no to something else. Take your day job. You work at the New York Times. The New York Times does a bunch of different things Officially, I suppose you're a news outlet. Personally, I think that your job is to employ history's greatest monster, whoever it is that organizes and runs the connections puzzle every day, which vexes me like you would not freaking believe because I don't think in the right.

Frame of reference sometimes, but at during, not through none of those perspectives is, oh, what does the New York Times do? That's right. You're a database company. You should build your own database. No, that is not where the value is. You have a website you should build and run your own web servers. That's something a fool would say.

If I'm dealing with a bank. Handling the money, ensuring compliance, making sure that the, that the money is there when you say it is. That's the key job. An airline's job is to get people and planes and cargo from place to place. It is not to push the boundaries of computer science. Companies tend to lose sight of this, especially when engineers in some cases get carried with resume driven development.

Ahmed: Yeah. That's where like scope and focus and specialty is one of the things that like anyone should look into. So I would rather like spend my time in my area of expertise, what I'm good at, like how I'm doing it. If I'm an engineer, if I wanna do a design, sure I can like do something quickly, but I don't necessarily have all of that understanding of how design work.

Again, not because I can, I should. It's always like the idea of like, you should get to a point where you have an SM e. That's why they call it an SME in some way. That's where like people have studied things. If I'm asking for a serverless opinion in any way, I'm gonna go ask a serverless person who dealt with this.

In a real production system, who knows when it breaks, who knows what are the bad things about it? Like a lot of people when we talk, say like Serverless is great, you can spin up. Sure, you can spin up. Have you ever run a serverless architecture that has like a thousand function? Let's talk about like how you govern all of them when you work together.

That's a different story. Like that story like I saw in a demo, seeing like sure. A Lambda function. Now any function in any cloud system runs in like the matter of second ship, a container and it pops up. Great. Let's talk about how to govern this in a bigger system. That's a different story. So like that's why, again, back to my point of like when I give a talk, I talk about my experience, I talk about like the things that I explored.

'cause like I have knowledge in that area. I have an understanding rather than just losing focus on what I'm trying to do.

Corey: Right? I'm a former SRE and I have a radically different perspective. On environments depending on where they are. And I was always considerably one of the most stodgy, conservative curmudgeonly types when it came to things like databases and file systems.

Because mistakes there are going to show. But in my test environment, ah, I have good backups of all the stuff I care about. Yeah, I'll go nuts. We'll do bleeding edge alpha thing. Oh, I guess that's why it's not GA yet. Whoops. Roll back and. I am fine with throwing things over the wall. I have a dedicated AWS account with no access to data that I have an EC2 box in upon which runs Claude Code in full permissions mode.

It has a EC2 rule that gives its root, uh, administrative access to the entire AWS environment. It is called Superfund because it is both toxic and expensive and the only blast radius worst case to hear is it spikes my AWS bill. Which I can handle that if that's what comes down to it. Honestly, if I call in begging for forgiveness to the a Ws billing department, it'll become a company-wide holiday.

Ladies and gentlemen, we got him. It'll, it'll be great.

Ahmed: I think like the separation of concerns most of the time, like works in, in a lot of cases where like you need to understand where's your competency is at,

Corey: right? Like, well, why not just do that in your production environment with like the customer database?

Because I'm not insane. Thank you for asking.

Ahmed: Exactly. Like that's why, like, uh, but like, it's also like, it's a, it's a situation where like. You should always think about like segregation, you should think about, like that's why like some people will like say, yeah, let's ship it to production. Have you tested it before?

Oh, oh, I tested it, but like also like I can tell you a funny story, so. Some of the environments we shaped like in multiple places in multiple uh, environments. Like people are like, have you tested that before? Yes, I tested it. Have you tested on the same skill? No. Like the only test that in a specific environment.

I was like, why you should always test with the same parameters like. Sometimes I, I worked in a company before where like, we're like, we're doing like some deployments and then one of the deployments we're like, oh, code is ready shipped. Everything is cool. Like it was a small company and so like all of a sudden shipped it to production.

It doesn't work. It doesn't work because like you don't have the same parameters like that you are running this into, like, you're running one thing, like you're testing it with like curl like shipping, like a single a BI and then like you're testing it in production with sending like a 50,000 request.

Like have you, have you done this?

Corey: Oh, testing is real.

Ahmed: Yeah. Met the standard expecta. Like all of these things like are. Actually like things that you have to think about or

Corey: even canary deployment, because at some point of scale you cannot test at the same scale. Facebook was, they gave a lot of talks about this back when they had reasonable approaches to things and because they didn't have a spare billion users to run in the dev environment, so they started off by having the developer run it themselves and then a small gated list of internal use test users.

And then it's like that effectively there were something like nine concentric circles from individual developer to the entire Facebook user base. And there was a scaled and measure, and they monitored the heck out of this. Like, Ooh, we're starting to see errors increase. Let's dial that back. Which works super well for Facebook, would work terribly for Stripe because every 500 error they get means someone didn't get paid.

And that's a worse, that's a worse outcome than, oh, the cat picture didn't load fast enough.

Ahmed: Yeah, that's, that's basically like. What they care about. Like some company cares about every single transaction, for example, like, and also like one of the thing that we, you mentioned here for strive, for example, like we, you don't, we don't know even like which transaction will fail.

That might be like a very expensive transaction to fail and that will cause like the business to lose a lot of money. But like if someone didn't hit like on my comment on Facebook and then didn't work, like I'm not gonna get too offended, like

Corey: Right. And there's also reputational damage too. I try to buy your book for $30.

I send it and the transaction fails. Both you and I are gonna be upset with that, depending on how technical I am. And definitely for you, that's gonna flavor our impression of Stripe. It's okay. That's not great. It has to work. It must. Whereas with other, with other use cases, the, the restrictions are, are different, but the product that we're building is for business back of house users.

Yes, we would like the site to be up when people are attempting to use it, but in the event that the site is down for an update or something for 20 minutes and has the maintenance page up it. It does not. It is not disastrous. It is not critical path for serving their customers that day. And I can see a future in which that potentially changes.

In which case, our approach to uptime and responsibility and maintenance windows will no longer be a thing. We're going to be very cognizant of the needs of it. But not everything needs to be hyperscale. Not everything needs five nines of uptime. Understand the use case and the problem you're trying to solve for it.

I'd rather just doing engineering fantasy. Build the thing that fits, solves the problem

Ahmed: that the thing about yp sometimes, like strike me because like some uh, use cases I have seen in my best were like, talk about something like in consultant opportunity or like in anyone and say like, what's your system uptime?

And he say like, oh, it's five nines. I was like, great. Why are you using? And I say like, this a BI is that a BS at ABIs at abi and 10 to one of them, like, oh, that's not five. And they're like, and it's critical for you. I'm like, yeah, but my system is five. I was like. How's that even work in the, like the, the back systems that you're using is not five nines, but like then you claim it's five nine.

It, it's just like, it's, it's a very complex world.

Corey: If you sincerely care about five nines of uptime on a service, you need to be in multiple regions to do it. Arguably multiple providers, though, I could be convinced of that. Otherwise. You cannot take third party dependencies because look, I can test in my account what happens if none of my stuff can reach S3 hypothetically, but I cannot test.

What if my third party vendor dependencies can't reach S3 or their third party dependencies can't reach S3. The only way you test that is by S3 going down, which fortunately is not a common occurrence, but if you're serious about this must stay up at all times. You have to own so much of that availability piece yourself.

Ahmed: Exactly, and just like where we have to think about like every beast that you put, so the more that like it's, it's all like at the end of the day, it's like. To me, like technology, like any other thing, it's like an architecture, it's a puzzle piece. It's like trade off somewhere. Like you get something, you lose something.

It's not like, oh, hey, you have to get like an all optimum, we all strive for like the perfect system all over, but like, sure, you wanna build it, you wanna own it. Get your data center, get your stuff, make sure that you have redundancy, all of that kinda stuff at a cost. Sure. Or it goes the other way at a cost.

Like it's, it's, it's one way of zr and then you build for what you need. And that's exactly what you have at the end of the day, but that's what I'm looking for. It's just like what is the right balance? I prefer sometimes to say like, these are, here are some Lego blocks, and then you build it, whatever like fits you.

Like you want it tall, you want it short, you want it wide. You want it like large. What do you need is what you build based on your requirements. Here's the thing, your requirements. Usually change over time because this is what we have seen like today. You built, as you said, like I built for a single app tomorrow, 10 apps after tomorrow, a hundred apps.

Like do you have that scale? Do you wanna get something to throw? And like you keep rearchitecturing every couple days. You want something plugable. That's why like usually like finding the right patterns. What other people have found, like, I think like a lot of people spend their times on Kubernetes figuring like, how did it make that work?

Corey: It's, it's a spectrum. Like anything, there are trade-offs. The decisions you should make early on that will not hamstring you in the. Future. Almost every hyperscaler has had this problem before where we're just gonna build a small thing for back of house stuff. Great. We're gonna use the local time zone for the database entries.

No, no, no, no, no. Talk to anyone who was at Google for about a decade and a half of time into there and use the phrase Google Standard time and watch them flinch because that is very painful to fix after the fact. It makes everything so much harder. So everything I build these days, even my dev box, it sits there running in UTC.

If I want to know what time it is locally, great that my user account can change outta the presentation layer. Awesome. But the system itself must be UTC.

Ahmed: That's where standards comes in. That's where like UTC is a common frame that like everyone's agree on and you convert based on your needs. Because I'm in whatever, I'm in New Jersey, someone else in California, we all can, we know what is a pattern in, but like if I start to ingest in my database data coming from like all local zones now, like I have and I have seen it in apps where like I go into an app and then I look at it and like, when is the last user has visited this app Tomorrow?

I was like. What, what, what is today, like today is like absurd, but like someone visited the app in tomorrow, like how does this even happen? And because like they inserted their time zone from their local machine to the system and then you have a got wrong representation because now it's. This Edge case,

Corey: the X-K-C-D-R-S-S feed always goes into the past for whatever reason.

By about, I think eight and a half hours from your UTC time, something is not right, so it always pops up. I had to scroll back to find it yesterday when I was building out that EKS cluster with open tofu, I had Claude code do most of the, uh, to terraform slash tofu code, and I had to correct it where it's first, it put it in the 10.0.

Great. That's gonna conflict with something somewhere because everyone uses that. In this case, the staging environment. Great. Put it somewhere else. Then it built a bunch of, uh, for the subnets slash 20 fours right next to each other. No, because when you run more than 255 containers, which can happen, sorry, 253 That's right.

You've broadcast a network as well, that, that. Oh and the dns, which you can't get rid of inside of the subnet two, so that drops it. Two more. Great point being is at above a certain threshold, you have to renumber and that is painful build room to expand without having to move things around and you'll be much happier for it.

Ahmed: Yeah, that is a problem that I had to solve in the many occasions and there's always a solution. It's like use a secondary sider. It's just like,

Corey: just use IPV six. It's like grownups are speaking please.

Ahmed: Sure. Yeah. Like this is, strikes me as like one of the most bogus standards that I have seen over so far.

Not for anything, but because like we can't yet agree on it. Like I think like we are still living in the IPV four world, but like we want more, but we cannot get more. But not everything supports IV six, but so we are stuck in IV four. What's your take on it?

Corey: IPV six is gonna save us all. They've been saying this since I was a child, and they'll be selling it to my grandkids as well.

It's not a problem that is top of mind for anyone except the, the salt of the earth folk who keep the internet moving. So we are going to continue to ignore it until we can't anymore. And then don't worry, the AI will fix it.

Ahmed: Yeah, AI will fix a lot of things and until like robots will not get, be able to get IBS and we are gonna be all stuck in that world.

Corey: Exactly. So I wanna thank you for taking the time to speak with me. If people wanna learn more about what you're up to and how you view the world and catch your next conference talk, where's the best place for them to find you?

Ahmed: Best place is linked in. That's where like I usually stay most up to date. If anyone wanna hit me on email, like they will find my links and all of my contact info.

But if you wanna need. Anything or just check where I'm going next, like I'm going to Q coupon and Mr. Dam next month, like where I'm doing like all of the things. Sure. Yeah. LinkedIn is a way,

Corey: I wanna say that maybe the first time in history that, uh, the, the phrase LinkedIn is the best place has ever been uttered because that is just, it is certainly a place I'm there a lot more myself.

And we will of course put links to that into the show notes. Ahmed, thank you so much for being so generous with your time. I deeply appreciate it.

Ahmed: Thank you, Corey. I really appreciate having. Me here and like looking forward to see you in many in-person events.

Corey: Oh, I'll be there. Ahmed Babar, principal engineer at the New York Times, and AWS Community Hero and Cloud native ambassador.

We're just stacking up the accomplishments these days. I am cloud economist, Corey Quinn, and. This is screaming in the cloud. If you've enjoyed this podcast, please leave a five star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five star review on your podcast platform of choice, along with an angry, insulting comment that I won't ever see because that podcast platform of choice runs on somebody's home.

K three s cluster.

Build vs Buy: The Hidden Costs of “Just Building It” with Ahmed Bebars

Episode Summary

Episode Video

Episode Show Notes & Transcript

Transcript

You might also like

FinOps, AI, and the Cost of Cloud Chaos with J.R. Storment

Everything Is a Graph (Even Your Dad Jokes) with Roi Lipman

AI, Authenticity, and the Future of Podcasting with Chris Hill

Get the Newsletter

Sponsor an Episode