Working on the Whiteboard from the Start with Tim Banks

Episode Summary

Tim Banks is back again! Now on board here at The Duckbill Group as a Principal Cloud Economist, Tim joins Corey for the rare third time around. Despite the unusual interview process via “Screaming” appearances, Tim is here to tell us what exactly a Principal Cloud Economist does and is. Tim’s insights are very useful, thus why he is back as a team member! Check them out. Tim and Corey go into the detais of Tim’s new job title, and how a background in engineering is fundamental to working well in that role. They also discuss Tim’s slightly diverging philosophy building out resilience, security, and costs on “the whiteboard.” By doing so Tim discusses how this can help you from accruing debt that needs to be paid later on and the importance that practice can have on cost optimization, observability, and more. Tune in for the details!

Episode Show Notes & Transcript

About Tim
Tim’s tech career spans over 20 years through various sectors. Tim’s initial journey into tech started as a US Marine. Later, he left government contracting for the private sector, working both in large corporate environments and in small startups. While working in the private sector, he honed his skills in systems administration and operations for largeUnix-based datastores.

Today, Tim leverages his years in operations, DevOps, and Site Reliability Engineering to advise and consult with clients in his current role. Tim is also a father of five children, as well as a competitive Brazilian Jiu-Jitsu practitioner. Currently, he is the reigning American National and 3-time Pan American Brazilian Jiu-Jitsu champion in his division.

Links:

Twitter: https://twitter.com/elchefe
The Duckbill Group: https://duckbillgroup.com

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by Honeycomb. When production is running slow, it's hard to know where problems originate: is it your application code, users, or the underlying systems? I’ve got five bucks on DNS, personally. Why scroll through endless dashboards, while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter? Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other, which one is up to you. Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business: production. With Honeycomb, you guess less and know more. Try it for free at Honeycomb.io/screaminginthecloud. Observability, it’s more than just hipster monitoring.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Periodically, I have a whole bunch of guests come on up, second time. Now, it’s easy to take the naive approach of assuming that it’s because it’s easier for me to find a guest if I know them and don’t have to reach out to brand new people all the time. This is absolutely correct; I’m exceedingly lazy. But I don’t have too many folks on a third time, but that changes today.

My guest is Tim Banks. I’ve had him on the show twice before, both times it led to really interesting conversations around a wide variety of things. Since those episodes, Tim has taken the job as a principal cloud economist here at The Duckbill Group. Yes, that is probably the strangest interview process you can imagine, but here we are. Tim, thank you so much for joining me both on the show and in the business.

Tim: My pleasure, Corey. It was definitely an interesting interview process, you know, but I was glad to be here. So, I’m happy to be here a third time. I don’t know if you get a jacket like you do in Saturday Night Live, if you host, like, a fifth time, but we’ll see. Maybe it’s a vest. A cool vest would be nice.

Corey: We can come up with something.[ effectively, it can be like reverse hangman where you wind up getting a vest and every time you come on after that you get a sleeve, then you get a second sleeve, and then you get a collar, and we can do all kinds of neat stuff.

Tim: I actually like that idea a lot.

Corey: So, I’m super excited to be able to have this conversation with you because I don’t normally talk a lot on this show about what cloud economics is because my guest usually is not as deep into the space as I am, and that’s fine; people should never be as deep into this space as I am, in the general sense, unless they work here. Awesome. But I do guest on other shows, and people ask me all kinds of questions about AWS billing and cloud economics, and that’s fine, it’s great, but they don’t ask the questions about the space in the same way that I would and the way that I think about it. So, it’s hard for me to interview myself. Now, I’m not saying I won’t try it someday, but it’s challenging. But today, I get to take the easy path out and talk to you about it. So Tim, what the hell is a principal cloud economist?

Tim: So, a principal cloud economist, is a cloud computing expert, both in architecture and practice, who looks at cloud cost in the same way that a lot of folks look at cloud security, or cloud resilience, or cloud performance. So, the same engineering concerns you have about making sure that your API stays up all the time, or to make sure that you don’t have people that are able to escape containers or to make sure that you can have super, super low response times, is the same engineering fundamentals that I look at when I’m trying to find a way to reduce your AWS bill.

Corey: Okay. When we say cloud cost and cloud economics, the natural picture that leads to mind is, “Oh, I get it. You’re an Excel jockey.” And sometimes, yeah, we all kind of play those roles, but what you’re talking about is something else entirely. You’re talking about engineering expertise.

And sure enough, if you look at the job postings we have for roles on the team from time to time, we have not yet hired anyone who does not have an engineering and architecture background. That seems odd to folks who do not spend a lot of time thinking about the AWS bill. I’m told those people are what is known as ‘happy.’ But here we are. Why do we care about the engineering aspect of any of this?

Tim: Well, I think first and foremost because what we’re doing in essence, is still engineering. People aren’t putting construction paper up on [laugh] AWS; sometimes they do put recipes up on there, but it still involves working on a computer, and writing code, and deploying it somewhere. So, to have that basic understanding of what it is that folks are doing on the platform, you have to have some engineering experience, first and foremost. Secondly, the fact of the matter is that most cost optimization, in my opinion, can be done on the whiteboard, before anything else, and really I think should be done on the whiteboard before anything else. And so the Excel aspect of it is always reactive. “We have now spent this much. How much was it? Where did it go?” And now we have to figure out where it went.

I like to figure out and get a ballpark on how much something is going to cost before I write the first line of code. I want to know, hey, we have a tier here, we’re using this kind of storage, it’s going to take this kind of instance types. Okay, well, I’ve got an idea of how much it’s going to cost. And I was like, “You know, that’s going to be expensive. Before we do anything, is there a way that we can reduce costs there?”

And so I’m reverse engineering that on already deployed workloads. Or when customers want to say, “Hey, we were thinking about doing this, and this is our proposed architecture,” I’m going to look at it and say, “Well, if you do this and this and this and this, you can save money.”

Corey: So, it sounds like you and I have a bit of a philosophical disagreement in some ways. One of my recurring talking points has always been that, “Oh, by and large, application developers don’t need to think overly much about cloud cost. What they need to know generally fits on an index card.” It’s, okay, big things cost more than small things; if you turn something on, it will never get turned off and will bill you in perpetuity; data transfer has some weird stuff; and if you store data, you pay for data, like, that level of baseline understanding. When I’m trying to build something out my immediate thought is, great, is this thing possible?

Because A, I don’t always know that it is, and B, I’m super bad at computers so for me, it may absolutely not be, whereas you’re talking about
baking cost assessments into the architecture as a day one type of approach, even when sketching ideas out on the whiteboard. I’m curious as to how we diverge there. Can you talk more about your philosophy?

Tim: Sure. And the reason I do that is because, as most folks that have an engineering background in cloud infrastructure will tell you, you want to build resilience in, on the whiteboard. You certainly want to build performance in, on the whiteboard, right? And security folks will tell you you want to do security on the whiteboard. Because those things are hard to fix after they’re deployed.

As soon as they’re deployed, without that, you now have technical debt. If you don’t consider cost optimization and cost efficiency on the whiteboard, and then you try and do it after it’s deployed, you not only have technical debt, you may have actual real debt.

Corey: One of the comments I tend to give a lot is that architecture and cost are the same thing in the world of cloud. And I think that we might be in violent agreement, as Liz Fong-Jones is fond of framing it, where I am acutely aware of aspects of cost and that does factor into how I build things on the whiteboard—let’s also be very clear, most of the things that I build are very small scale; the largest cost by a landslide is the time I spend building it—in practice, that’s an awful lot of environments; people are always more expensive than the AWS environment they’re working on. But instead, it’s about baking in the assumptions and making sure you’re not coming up with something that is going to just be wasteful and horrible out of the gate, and I guess part of that also is the fact that I am at a level of billing understanding that I sort of absorbed these concepts intrinsically. Because to me, there is no difference between cost and architecture in an environment like this. You’re right, there’s always an inherent trade-off between cost and durability. On the one hand, I don’t like that. On the other, it feels like it’s been true forever and I don’t see a way out of it.

Tim: It is inescapable. And it’s interesting because you talk about the level of an application developer or something like that, like what is your level of concern, but retroactively, we’ll go in for cost optimization houses—and I’ve done this as far back as when I was working at AWS has a TAM—and I’ll ask the question to an application developer or database administrator, and I’m like, “Why do you do this? What do you have a string value for something that could be a Boolean?” And you’ll ask, “Well, what difference does that make?” Well, it makes a big difference when you’re talking about cycles for CPU.

You can reduce your CPU consumption on a database instance by changing a string to a Boolean, you need fewer instances, or you need a less powerful instance, or you need less memory. And now you can run a less expensive instance for your database architecture. Well, maybe for one node it’s not that biggest difference, but if you’re talking about something that’s multi-AZ and multi-node, I mean, that can be a significant amount of savings just by making one simple change.

Corey: And that might be the difference right there. I didn’t realize that, offhand. It makes sense if you think about it, but just realizing that I’ve made that mistake on one of my DynamoDB tables. It costs something like seven cents a month right now, so it’s not something I’m rushing to optimize, but you’re right, expand that out by a factor of a million or so, and we’re talking serious money, and then that sort of optimization makes an awful lot of sense. I think that my position on it is that when you’re building out something small scale as a demo or a proof of concept, spending time on optimizations like this is not the best use of anyone’s time or brain sweat, for lack of a better term. How do you wind up deciding when it’s time to focus on stuff like that?

Tim: Well, first, I will say that—I daresay that somewhere in the 80% of production workloads are just—were the POC, [laugh] right? Because, like, “It worked for this to get funding, let’s run it,” right?

Corey: Let they who does not have a DynamoDB table in production with the word ‘test’ or ‘dev’ in it cast the first stone.

Tim: It’s certainly not me. So, I understand how some of those decisions get made. And that’s why I think it’s better to think about it early. Because as I mentioned before, when you start something and say, “Hey, this works for now,” and you don’t give consideration to that in the future, or consideration for what it’s going to be like in the future, and when you start doing it, you’ll paint yourself into corners. That’s how you get something like static values put in somewhere, or that’s how you get something like, well, “We have to run this instance type because we didn’t build in the ability to be more microservice-based or stateless or anything like that.”

You’ve seen people that say, “Hey, we could save you a lot of money if you can move this thing off to a different tier.” And it’s like, “Well, that would be an extensive rewrite of code; that’d be very expensive.” I daresay that’s the main reason why most AS/400s are still being used right now is because it’s too expensive to rewrite the code.

Corey: Yeah, and there’s no AWS/400 that they can migrate to. Yet. Re:Invent is nigh.

Tim: So, I think that’s why, even at the very beginning, even if you were saying, “Well, this is something we will do later.” Don’t make it impossible for you to do later in your code. Don’t make it impossible for you to do later in your architecture. Make things as modular as possible, so that way you can say, “Hey”—later on down the road—“Oh, we can switch this instance type.” Or, “Here’s a new managed service that we can maybe save money on doing this.”

And you allow yourself to switch things out, or turn different knobs, or change the way you do things, and give yourself more options in the future, whether those options are for resilience, or those options or for security, or those options are for performance, or they’re for cost optimizations. If you make binding decisions earlier on, you’re going to have debt that’s going to build up at some point in the future, and then you’re going to have to pay the piper. Sometimes that piper is going to be AWS.

Corey: One thing that I think gets lost in a lot of conversations about cloud economics—because I know that it happened to me when I first started this place—where I am planning to basically go out and be the world’s leading expert in AWS cost analysis and understanding and optimization. Great. Then I went out into the world and started doing some of my first engagements, and they looked a lot less like far-future cost attribution projections and a lot more like, “What’s a reserved instance?” And, “We haven’t bought any of those in 18 months.” And, “Oh, yeah, we shut down an entire project six months ago. We should probably delete all the resources, huh?”

The stuff that I was preparing for at the high end of the maturity curve are great and useful and terrific to have conversations about in some very nuanced depth, but very often there’s a walk before you can run style of conversation where, okay, let’s do the easy stuff first before we start writing a whole bunch of bespoke internal stuff that maps your business needs to the AWS bill. How do you, I guess, reconcile those things where you’re on the one hand, you see the easy stuff and on the other, you see some of the just the absolutely challenging, very hard,
five-years-of-engineering-effort-style problems on the other?

Tim: Well, it’s interesting because I’ve seen one customer very recently who has brilliant analyses as to their cost; just well-charted, well-tagged, well-documented, well—you know, everything is diagrammed quite nicely and everything like that, and they’re very, very aware of their costs, but they leave test instances running all weekend, you know, and their associated volumes and things like that. And that’s a very easy thing to fix. That is a very, very low-hanging fruit. And so sometimes, you just have to look at where they’re spending their efforts where sometimes they do spend so much time chasing those hard to do things because they are hard to do and they’re exciting in an engineering aspect, and then something as simple as, “Hey, how about we delete these old volumes?” It just isn’t there.

Or, “How about we switch to your S3 bucket storage type?” Those are easy, low-hanging fruits, and you would be surprised how sometimes they just don’t get that. But at the same time, sometimes customers have, like, “Hey, we could knock this thing out, we knock this thing out,” because it’s Trusted Advisor. Every AI cost optimization recommendation you can get will tell you these five things to do, no matter who you are or where you are, but they don’t do the conceptual things like understanding some of the principles behind cost optimization and cost optimization architecture, and proactive cost optimization versus react with cost optimizations. So, you’re doing very conceptual education and conversations with folks rather than the, “Do these five things.” And I’ve not often found a customer that you have to do both on; it’s usually one or the other.

Corey: It’s funny that you made that specific reference to that example. One of my very first projects—not naming names. Generally, when it comes to things like this, you can tell stories or you can name names; I bias for stories—I was talking to a company who was convinced that their developer environments were incredibly overwrought, expensive, et cetera, and burning money. Okay, great. So, I talked about the idea of turning those things off at night or between test runs, deleting volumes to snapshot, and restore them on a schedule when people come in in the morning because all your developers sit in the same building in the same time zones. Great. They were super on board with the idea, and it was going to be a little bit of work, but all right, this was in the days before the EC2 Instance Scheduler, for example.

But first, let’s go ahead and do some analysis. This is one of those early engagements that really reinforced my idea of, yeah, before we start going too far down the rabbit hole, let’s double-check what’s going on in the account. Because periodically you encounter things that surprise people. Like, “What’s up with those Australia instances?” “Oh, we don’t have anything in that region.” “I believe you’re being sincere when you say this, however, the API generally doesn’t tell lies.”

So, that becomes a, oh, security incident time. But looking at this, they were right; they had some fairly sizable developer instances that were running all the time, but doing some analysis, their developer environment was 3% of their bill at the time and they hadn’t bought RIs in a year-and-a-half. And looking at what they were doing, there was so much easier stuff that they could do to generate significant savings without running the potential of turning a developer environment off at night in the middle of an incident or something like that. The risk factor and effort were easier just do the easy stuff, then do another pass and look at the deep stuff. And to be clear, they weren’t lying to me; they weren’t wrong.

Back when they started building this stuff out, their developer environments were significantly large and were a significant portion of their spend. And then they hit product-market fit, and suddenly their production environment had to scale significantly in a short period of time. Which, yay, cloud. It’s good at that. Then it just became such a small portion that developer environments weren’t really a thing. But the narrative internally doesn’t get updated very often because once people learn something, they don’t go back to relearn whether or not it’s still true. It’s a constant mistake; I make it myself frequently.

Tim: I think it’s interesting, there are things that we really need to put into buckets as far as what’s an engineering effort and what’s an administrative effort. And when I say ‘administrative effort,’ I mean if I can save money with a stroke of a pen, well, that’s going to be pretty easy, and that’s usually going to be RIs; that’s going to be EDPs, or PPAs or something like that, that don’t require engineering effort. It just requires administrative effort, I think RIs being the simplest ones. Like, “Oh, all I have to do is go in here and click these things four times and I’m going to save money?” “Well, let’s do that.”

And it’s surprising how often people don’t do that. But you still have to understand that, and whether it’s RIs or whether it’s a savings plan, it’s still a commitment of some kind, but if you are willing to make that commitment, you can save money with no engineering effort whatsoever. That’s almost free money.

Corey: So, much of what we do here comes down to psychology, in many ways, more than it does math. And a lot of times you’re right, everything you say is right, but in a large-scale environment, go ahead and click that button to buy the savings plan or the reserved instance, and that’s a $20 million purchase. And companies will stall for months trying to run a different series of analyses on this and what if this happens, what if that happens, and I get it because, “Yeah, I’m going to click this button that’s going to cost more money than I’ll make in my lifetime,” that’s a scary thing to do; I get it. But you’re going to spend the money, one way or the other, with the provider, and if you believe that number is too high, I get it; I am right there with you. Buy half of them right now and then you can talk about the rest until you get to a point of being comfortable with it.

Do it incrementally; it’s not all or nothing, you have one shot to make the buy. Take pieces out of it that makes sense. You know you’re probably not going to turn off your database cluster that handles all of production in the next year, so go ahead and go for it; it saves some money. Do the thing that makes sense. And that doesn’t require deep-dive analytics that requires, on some level, someone who’s seen a lot of these before who gets what customers are going through. And honestly, it’s empathy in many respects, becomes one of those powerful things that we can apply to our customer accounts.

Tim: Absolutely. I mean, people don’t understand that decision paralysis, about making those commitments costs you money. You can spend months doing analysis, but those months doing analysis, you’re going to spend 30, 40, 50, 60, 70% more on your EC2 instances or other compute than you would otherwise, and that can be quite significant. But it’s one of those cases where we talk about psychology around perfect being the enemy of good. You don’t have to make the perfect purchase of RIs or savings plans and have that so tuned perfectly that you’re going to get one hundred percent utilization and zero—like, you don’t have to do that.

Just do something. Do a little bit. Like you said, buy half; buy anything; just something, and you’re going to save money. And then you can run analysis later on, while you’re saving money [laugh] and get a little better and tune it up a little more and get more analysis on and maybe fine-tune it, but you don’t actually ever need to have it down to the penny. Like, it never has to be that good.

Corey: At some point, one of the value propositions we have for our customers has always been that we tell you when to stop focusing on saving money because there’s a theoretical cap of a hundred percent of the cloud bill that you can save, but you can make so much more than that by launching the right feature to the right market a little sooner; focus on that. Be responsible stewards of the money that’s invested with you, but by and large, as a general piece of guidance, at some point, stop cutting and go back to doing the thing that makes your company work. It’s not all about saving money at all costs for almost all of us. It is for us, but we’re sort of a special case.

Tim: Well, it’s a conversation I often have. It’s like, all right, are you trying to save money on AWS or are you trying to save money overall? So, if you’re going to spend $400,000 worth of engineering effort to save $10,000 on your AWS bill, that doesn’t make no sense. So—[laugh]—

Corey: Right. There has to be a strategic reason to do things like that—

Tim: Exactly.

Corey: —and make sure you understand the value of what you’re getting for this. One reason that we wind up charging the way that we do—and we’ve gotten questions on this for a while—has been that we charge a fixed fee for what we do on engagements. And similarly—people have asked this, but haven’t tied the two things together—you talk about cost optimization, but never cost-cutting. Why is that? Is that just a negative term?

And the answer has been no, they’re aligned. What we do focuses on what is best for the customer. Once that fixed fee is decided upon, every single thing that we say is what we would do if we were in the customer’s position. There are times we’ll look at what they have going on and say, “Ah, you really should spend more money here for resiliency, or durability,” or, “Okay, that is critical data that’s not being backed up. You should consider doing that.”

It’s why we don’t take percentages of things because, at that point, we’re not just going with the useful stuff, it’s, well we’re going to basically throw the entire kitchen sink at you. We had an early customer and I was talking to their AWS account manager about what we were going to be doing and their comment was, “Oh, saving money on AWS bills is great, make sure you check the EBS snapshots.” Yeah, I did that. They were spending 150 bucks a month on EBS snapshots, which is basically nothing. It’s one of those stories where if, in the course of an hour-long meeting, I can pay for that entire service, by putting a quarter on the table, I’m probably not going to talk about it barring [laugh] some extenuating circumstances.

Focus on the big things, not the things that worked in a different environment with a different account and different constraints. It’s hard to context switch like that, but it gets a lot easier when it is basically the entirety of what we do all day.

Tim: The difference I draw between cost optimization and cost-cutting is that cost optimization is ensuring that you’re not spending money unnecessarily, or that you’re maximizing your dollar. And so sometimes we get called in there, and we’re just validation for the measures they’ve already done. Like, “Your team is doing this exactly right. You’re doing the things you should be doing. We can nitpick if you want to; we’re going to save you $7 a year, but who cares about that? But y’all are doing what you should be doing. This is great. Going forward, you want to look for these things and look for these things and look for these things. We’re going to give you some more concepts so that you are cost-optimized in the future.” But it doesn’t necessarily mean that we have to cut your bill. Because if you’re already spending efficiently, you don’t need your bill cut; you’re already cost-optimized.

Corey: Oh, we’re not going to nitpick on that, you’re mostly optimized there. It’s like, “Yeah, that workload’s $140 million a year and rising; please, pick nits.” At which point? “Okay, great.” That’s the strategic reason to focus on something. But by and large, it comes down to understanding what the goals of clients are. I think that is widely misunderstood about what we do and how we do it.

The first question I always ask when someone does outreach of, “Hey, we’d like to talk about coming in here and doing a consulting
engagement with us.” “Great.” I always like to ask the quote-unquote, “Foolish question” of, “Why do you care about the AWS bill?” And occasionally I’ll get people who look at me like I have two heads of, “Why wouldn’t I care about the AWS bill?” Because there are more important things to care about for the business, almost certainly.

Tim: One of the things I try and do, especially when we’re talking about cost optimization, especially trying to do something for the right now so they can do things going forward, it’s like, you know, all right, so if we cut this much from your bill—if you just do nothing else, but do reserved instances or buy a savings plan, right, you’re going to save enough money to hire four engineers. Think about what four engineers would do for your overall business? And that’s how I want you to frame it; I want you to look at what cost optimization is going to allow you to do in the future without costing you any more money. Or maybe you save a little more money and you can shift it; instead of paying for your AWS bill, maybe you can train your developers, maybe you can get more developers, maybe you can get some ProServ, maybe you can do whatever, buy newer computers for your people so they can do—whatever it is, right? We’re not saying that you no longer have to spend this money, but saying, “You can use this money to do something other than give it to Jeff Bezos.”

Corey: This episode is sponsored in part by Liquibase. If you’re anything like me, you’ve screwed up the database part of a deployment so severely that you’ve been banned from touching every anything that remotely sounds like SQL, at at least three different companies. We’ve mostly got code deployments solved for, but when it comes to databases we basically rely on desperate hope, with a roll back plan of keeping our resumes up to date. It doesn’t have to be that way. Meet Liquibase. It is both an open source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails to ensure you’ll still have a company left after you deploy the change. No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at liquibase.com. Offer does not apply to Route 53.

Corey: There was an article recently, as of the time of this recording, where Pinterest discussed what they had disclosed in one of their regulatory filings which was, over the next eight years, they have committed to pay AWS $3.2 billion. And in this article, they have the head of engineering talking to the reporter about how they’re thinking about these things, how they’re looking at things that are relevant to their business, and they’re talking about having a dedicated team that winds up doing a whole bunch of data analysis and running some analytics on all of these things, from piece to piece to piece. And that’s great. And I worry, on some level, that other companies are saying, “Oh, Pinterest is doing that. We should, too.” Yeah, for the course of this commitment, a 1% improvement is $32 million, so yeah, at that scale I’m going to hire a team of data scientists, too, look at these things. Your bill is $50,000 a month. Perhaps that’s not worth the effort you’re going to put into it, barring other things that contribute to it.

Tim: It’s interesting because we will get folks that will approach us that have small accounts—very small, small spend—and like, “Hey, can you come in and talk to us about this whatever.” And we can say very honestly, “Look, we could, but the amount of money we’re going to charge you is going to—it’s not going to be worth your while right now. You could probably get by on the automated recommendations, on the things that already out there on the internet that everybody can do to optimize their bill, and then when you grow to a point where now saving 10% is somebody’s salary, that’s when it, kind of, becomes more critical.” And it’s hard to say what point that is in anyone’s business, but I can say sometimes, “Hey, you know what? That’s not really what you need to focus on.” If you need to save $100 a month on your AWS bill, and that’s critical, you’ve got other concerns that are not your AWS bill.

Corey: So, back when you were interviewing to work here, one of the areas of focus that you kept bringing up was the concept of observability, and my response to this was, “Ah, hell. Another one.” Because let’s be clear, Mike Julian—my business partner and our CEO—has written a book called Practical Monitoring, and apparently what we learned from this is as soon as you finish writing a book on the topic, you never want to talk about that topic ever again, which yeah, in hindsight makes sense. Why do you care about observability when you’re here to look at cloud costs?

Tim: Because cloud costs is another metric, just like you would use for performance, or resilience, or security. You do real-time monitoring to see if somebody has compromised the system, you do real-time monitoring to see if you have bad performance, if response times are too slow. You do real-time monitoring to know if something has gone down and then you need to make adjustments, or that the automated responses you have in response to that downtime are working. But cloud costs, you send somebody a report at the end of the month. Can you imagine, if you will—just for a second—if you got a downtime report at the end of month, and then you can react to something that has gone down?

Or if you get a security report at the end of the month, and then you can react to the fact that somebody has your root keys? Or if you get [laugh] a report at the end of month, this said, “Hey, the CPU on this one was pegged. You should probably scale up.” That’s outrageous to anybody in this industry right now. But why do we accept that for cloud cost?

Corey: It’s worse than that. There are a number of startups that talk about, “Oh, real-time cloud cost monitoring. Okay, the only way you’re going to achieve such a thing is if you build an API shim that interprets everything that you’re telling your cloud control plane to do, taking cost metrics out of it, and then passing it on to the actual cloud control plane.” Otherwise, you’re talking about it showing up in the billing record in—ideally, eight hours; in practice, several days, or you’re talking about the CloudTrail events, which is not holistic but gives you some rough idea, but it’s also in some cases, 5 to 20 minutes delayed. There’s no real-time way to do this without significant disruption to what’s going on in your environment.

So, when I hear about, “Oh, we do real-time bill analysis.” Yeah, it feels—to be very direct—you don’t know enough about the problem space you’re working within to speak intelligently about it because anyone who’s played in this space for a while knows exactly how hard it is to get there. Now, I’ve talked to companies that have built real-time-ish systems that take that shim approach and acts sort of as a metadata sidecar ersatz billing system that tracks all of this so they can wind up intercepting potentially very expensive configuration mistakes. And that’s great. That’s also a bit beyond for a lot of folks today, but it’s where the industry is going. But there is no way to get there today, short of effectively intercepting all of those calls, in a way that is cohesive and makes sense. How do you square that circle given the complete lack of effective tooling?

Tim: Honestly, I’m going to point that right back at the cloud provider because they know how much you’re spending, real-time. They know exactly how much you spend in real-time. They’ve figured it out. They have the buckets, they have APIs for it internally. I’m sure they do; it would make no sense for them not to. Without giving anything anyway, I know that when I was at AWS, I knew how much they were spending, almost real-time.

Corey: That’s impressive. I wish that existed. My never having worked at AWS perspective on it is that they, of course, have the raw data effective immediately, or damn close to it, but the challenge for the billing system is distilling and summarizing and attributing all of that in a reasonable timeframe; it is an exabyte-scale problem. I’ve talked to folks there who have indicated it is comfortably north of a petabyte in raw data per day. And that was a couple of years ago, so one can only imagine as the footprint has increased, so has all of this.

I mean, the billing system is fundamentally magic from the outside. I’m not saying it’s good magic, but it is magic, and it’s something that is unappreciated, that every customer uses, and is one of those areas that doesn’t get the attention it deserves. Because, let’s be clear, here, we talk about observability; the bill is still the only thing that AWS offers that gives you a holistic overview of everything running in your account, in one place.

Tim: What I think is interesting is that you talk about this, the scale of the problem and that it makes it difficult to solve. At the same time, I can have a conversation with my partner about kitty litter, and then all of a sudden, I’m going to start getting ads about kitty litter within minutes. So, I feel like it’s possible to emit cost as a metric like you would CPU or disk. And if I’m going to look at who’s going to do that, I’m going to look right back at AWS. The fun part about that, though, is I know from AWS’s business model, that if that’s something they were to emit, it would also cost you, like, 25 cents per call, and then you would actually, like, triple your cloud costs just trying to figure out how much it costs you.

Corey: Only with 16 other billing dimensions because of course it would. And again, I’m talking about stuff, because of how I operate and how I think about this stuff, that is inherently corner case, or [vertex 00:31:39] case in many cases. But for the vast majority of folks, it’s not the, “Oh, you have this really weird data transfer paradigm between these two resources,” which yeah, that’s a problem that needs to be addressed in an awful lot of cases because data transfer pricing is bonkers, but instead it’s the, “Huh. You just spun up a big cluster that’s going to cost $20,000 a month.” You probably don’t need to wait a full day to flag that.

And you also can’t put this on the customer in the sense of, “Oh, just set some budget alarms, that’s great. That’s the first thing you should do in a new AWS account.” “Well, jackhole, I’ve done an awful lot of first things I’m supposed to do in an AWS account, in my dedicated test account for these sorts of things. It’s been four months, I’m not done yet with all of those first things I’m supposed to do.” It’s incredibly secure, increasingly expensive, and so far all it runs is a single EC2 instance that is mostly there just so that everything else doesn’t error out trying to
divide by zero.

Tim: There are some things that are built-in. If I stand up an EC2 instance and it goes down, I’m going to get an alert that this instance terminated for some reason. It’s just going to show up informationally.

Corey: In the console. You’re not going to get called about it or paged about it, unless—

Tim: Right.

Corey: —you have something else in the business that will, like a boss that screams at you two o’clock in the morning. This is why we have very little that’s production-facing here.

Tim: But if I know that alert exists somewhere in the console, that’s easy for me to write a trap for. That’s easy for me to write, say hey, I’m going to respond to that because this call is going to come out somewhere; it’s going to get emitted somewhere. I can now, as an engineer, write a very easy trap that says, “Hey, pop this in the Slack. Send an alert. Send a page.”

So, if I could emit a cost metric, and I could say, “Wow. Somebody has spun up this thing that’s going to cost X amount of money. Someone should get paged about this.” Because if they don’t page about this and we wait eight hours, that’s my month’s salary. And you would do that if your database server went down; you would do that if someone rooted that database server; you would do that if the database server was [bogging 00:33:48] you to scale up another one. So, why can’t you do that if that database server was all of sudden costing you way more than you had calculated?

Corey: And there’s a lot of nuance here because what you’re talking about makes perfect sense for smaller-scale accounts, but even some of the very large accounts where we’re talking hundreds of millions a year in spend, you can set compromised keys up on GitHub, put them in Payspin, whatever, and then people start spinning up Bitcoin miners everywhere. Great. It takes a long time to materially move the needle on that level of spend; it gets lost in the background noise. I lose my mind when I wind up leaving a managed NAT gateway running and it cost me 70 bucks a month in my $5 a month test account. Yeah, but you realize you could basically buy an island and it gets lost in the AWS bill at some of the high watermarks for some of these larger accounts.

“Oh, someone spun up a cluster that’s going to cost $400,000 a year?” Yeah, do I need to re-explain to you what a data science team does? They light money on fire in return for questionable returns, as a general rule. You knew that when you hired them; leave them alone. Whereas someone in their developer account does this, yeah, you kind of want to flag that immediately.

It always comes down to rules and context. But I’d love to have some templates ready to go of, “I’m a starving student, please alert me anytime it looks like I might possibly exceed the free tier,” or better yet, “Don’t let me, and if I do, it’s on you and you eat the cost.” Conversely, it’s, “Yeah, this is a Netflix sub-account or whatnot. Maybe don’t bother me for anything whatsoever because freedom and responsibility is how we roll.” I imagine that’s what they do internally on a lot of their cloud costing stuff because freedom and responsibility is ingrained in their culture. It’s great. It’s the freedom from having to think about cloud bills and the responsibility for paying it, of the cloud bill.

Tim: Yeah, we will get internally alerted if things are [laugh] up too long, and then we will actually get paged, and then our manager would get paged, [laugh] and it would go up the line. If you leave something that’s running too expensive, too long. So, there is a system there for it.

Corey: Oh, yeah. The internal AWS systems for employees are probably my least favorite AWS service, full stop. And I’ve seen things posted about it; I believe it’s called Isengard, for spinning up internal accounts and the rest—there’s a separate one, I think, called Conduit, but I digress—that you spin something up, and apparently if it doesn’t wind up—I don’t need you to comment on this because you worked there and confidentiality is super important, but to my understanding it’s, great, it has a whole bunch of formalized stuff like that and it solves for a whole lot of nifty features that bias for the way that AWS focuses on accounts and how they’ve view security and the rest. And, “Oh, well, we couldn’t possibly ship this to customers because it’s not how they operate.” And that’s great.

My problem with this internal provisioning system is it isolates and insulates AWS employees from the real pain of working with multiple accounts as a customer. You don’t have to deal with the provisioning process of Control Tower or whatnot; you have your own internal thing. Eat your own dog food, gargle your own champagne, whatever it takes to wind up getting exposure to the pain that hits customers and suddenly you’ll see those things improve. I find that the best way to improve a product is to make the people building it live with the painful parts.

Tim: I think it’s interesting that the stance is, “Well, it’s not how the customers operate, and we wouldn’t want the customers to have to deal with this.” But at the same time, you have to open up, like, 100 accounts if you need more than a certain number of S3 buckets. So, they are very comfortable with burdening the customer with a lot of constraints, and they say, “Well, constraints drive innovation.” Certainly, this is a constraint that you could at least offer and let the customers innovate around that.

Corey: And at least define who the customer is. Because yeah, “I’m a Netflix sub-account is one story,” “I’m a regulated bank,” is another story, and, “I’m a student in my dorm room, trying to learn how this whole cloud thing works,” is another story. From risk tolerance, from a data protection story, from a billing surprise story, from a, “I’m trying to learn what the hell this is, and all these other service offerings you keep talking to me about confuse the hell out of me; please streamline the experience.” There’s a whole universe of options and opportunity that isn’t being addressed here.

Tim: Well, I will say it very simply like this: we’re talking about a multi-trillion dollar company versus someone who, if their AWS bill is too high, they don’t pay rent; maybe they don’t eat; maybe they have other issues, they don’t—medical bill doesn’t get paid; child care doesn’t get paid. And if you’re going to tell me that this multi-trillion dollar company can’t solve for that so that doesn’t happen to that person and tells them, “Well, if you come in afterwards, after your bill gets there, maybe we can do something about it, but in the meantime, suffer through this.” That’s not ethical. Full stop.

Corey: There are a lot of things that AWS gets right, and I want to be clear that I’m not sitting here trying to cast blame and say that everything they’re doing is terrible. I feel like every time I talk about billing in any depth, I have to throw this disclaimer in. Ninety to ninety-five percent of what they do is awesome. It’s just the missing piece that is incredibly painful for customers, and that’s what I spend most of my time focusing on. It should not be interpreted to think that I hate the company.

I just want them to do better than they are, and what they’re doing now is pretty decent in most respects. I just want to fix the painful parts. Tim, thank you for joining me for a third time here. I’m certain I’ll have you back in the somewhat near future to talk about more aspects of this, but until then, where can people find you slash retain your services?

Tim: Well, you can find me on Twitter at @elchefe. If you want to retain my services for which you would be very, very happy to have, you can go to duckbillgroup.com and fill out a little questionnaire, and I will magically appear after an exchange of goods and services.

Corey: Make sure to reference Tim by name just so that we can make our sales team facepalm because they know what’s coming next. Tim, thank you so much for your time; it’s appreciated.

Tim: Thank you so much, Corey. I loved it.

Corey: Principal cloud economist here at The Duckbill Group, Tim Banks. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, wait at least eight hours—possibly as many as 48 to 72—and then leave a comment explaining what you didn’t like.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

Working on the Whiteboard from the Start with Tim Banks

Episode Summary

Episode Show Notes & Transcript

You might also like

Reliable Software by Default with Jeremy Edberg

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

Get the Newsletter

Sponsor an Episode