Join me as I continue a new series called Whiteboard Confessional by examining an all-too-common problem: having to scale a database when it’s too late. In this episode, I touch upon the underlying reason many developers don’t think about their database until they’re forced to, what some of the primary drivers of latency are, the easiest (and priciest) way to scale a database, what you can do to avoid this whole problem altogether from the outset, Corey’s advice on how to save months of work down the road, how often this problem rears its ugly head in applications, and more.
Episode Show Notes & Transcript
About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.
Corey: Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semipolite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.
But first… On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.
So I’m going to deviate slightly from the format that I’ve established so far on these Friday morning whiteboard confessional stories, and talk instead about a pattern that has tripped me and others up more times than I care to remember. So it’s my naive hope that by venting about this for the next 10 minutes or so, I will eventually be able to encounter an environment where someone hasn’t made this particular mistake. And what mistake am I talking about? Well, as with so many terrifying architectural patterns, it goes back to databases. You decide that you’re going to write a small toy application, You’re probably not going to turn this into anything massive. And in all honesty, baby seals will probably get more hits than whatever application you’re about to build will. So you don’t really think too hard about what your database structure is going to look like. You spin up a database, you define the database endpoint inside the application, and you go about your merry way. Now, that’s great. Everything’s relatively happy, and everything we just described will work. But let’s say that you hit that edge or corner case where this app doesn’t fade away into obscurity. In fact, this turns out to have some legs, the thing that you’re building now has attained business viability or is at least seeing enough user traffic that it now has to worry about load.
So you start taking a look at this application because you get the worst possible bug reports six to eight months later; it’s slow. Where do you start looking when something is slow? Well, personally, I start looking at the bar, because that is a terribly obnoxious problem to have to troubleshoot. There are so many different ways that latency can get injected into an application. You discover the person reporting the slowness is on the other side of the world with satellite internet connection that they’re apparently trying to set up to the satellite with a tin can and a piece of very long string. There’s a lot of failure states here that you get to start hunting down. The joys of latency hunting. But in many cases, the answer is going to come down to, oh, that database that you defined is now no longer up to the task. You’re starting to bottleneck on that database. Now, you can generally buy your way out of this problem by scaling up whatever database you’re using. Terrific, great, it turns out that you can just add more hardware, which in a time of cloud, of course, just means more money and a bit of downtime while you scale the thing up, but that gets you a little bit further down the road. Until the cycle begins to rinse and repeat, and it turns out, there are only instances that are so large that you’ll be able to get to power databases. Also, they’re not exactly inexpensive. Now, I would name exact sizes of what those databases might look like. But this is AWS, they’re probably going to release at least five different instance families and sizes, by the time I finish recording this. But it gets published later at the end of the week. So instead, there is an alternative here, and it doesn’t take much from an engineering or design perspective when you’re building out one of these silly toy apps that will never have to scale. What is that fix, you might wonder? Terrific question. Let me tell you in just a minute.
In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.
So this is a pattern that increasingly, modern frameworks are recommending, but a number of them don’t. And I’m not going to name names, because I don’t want to wind up in a slap and tickle fight around which frameworks are good versus which frameworks are crappy. You can all make your own decisions around that. But the pattern that makes sense for this is even when you’re beginning with a toy app, go ahead and define two database endpoints, one for reads, And one for writes. Invariably, this is going to solve a whole host of problems with most database technologies. If you take a look at most applications, and yes, I know there are going to be exceptions to this, they tend to bottleneck on reads. If you have just a single database or database cluster, then all of the read traffic gets in the way of being able to write to that. That includes things that don’t actually need to be in line with the rest of what the application is doing. If you can have a read replica that’s used for business analytics, great. Your internal business teams can beat the living crap out of that database replica without damaging anything that’s in the critical path of serving users. And the writes can then go specifically to the primary node, which is generally where the writes have to happen. Now, yes, depending on your database technology, there’s going to be a whole story around multi-primary architectures, and how that’s going to wind up manifesting. But those tend to be a bit more edge case, and by the time you’re into those sorts of weeds, you know it already.
The point here is that if you look at most applications, they are rebound. So being able to scale from a single primary to a whole bunch of replicas means that you can have those reads hitting a fleet of systems and depending upon replication delays, be getting near real-time results from those nodes, without overburdening the single node that can take writes. You wouldn’t think that this would be that big of a deal when it comes to architectural patterns, but I’ve seen so many different environments and so many applications fall victim to this. Well, it seems like an early optimization, you might say, naively, as I once did, what if we just make that change later when the time comes? Well, by the time an application is sophisticated and dealing with enough traffic to the point where you are swamping the capacity of a modern database server, at that point, you don’t have one or two database queries within your application, there are hundreds or thousands. And because there was never any setup that differentiated between a read endpoint and a write endpoint, a lot of queries tend to assume that they can do both. In some cases in the same query. It means that there’s an awful lot of refactoring pain that’s going to come out of this.
“Well hang on,” you might very reasonably say, “what if you don’t want to spin up twice as many database servers for those crappy toy apps, of which baby seals get more hits?” Great. I’m not suggesting you start spending more money on databases you don’t need. I don’t work for a cloud provider. I am not incentivized to sell you things like that. I would say in that scenario, great, you can have just that single database, because it usually will work. But if you refer to it in the application by two different endpoints that you set as variables at the beginning, one for your read endpoint and one for your write endpoint, it forces good behavior from the beginning and it saves you, in some cases, months of work down the road, trying to refactor this out in ways that are painful, difficult, and worst of all, from this perspective, expensive. Remember, I’m a cloud economist. My entire philosophy is around optimizing cloud spend. This is not something that is going to cost you a lot of money up front. But it is a form of technical debt that you very often don’t realize that you’re dealing with. I wish I could say this was just me looking for a random item from my past to talk about, but it’s not. This is something I have seen again, and again, and again, and again, to the point where I can almost quote chapter and verse of terrifying things that people tell me where this winds up being the root cause. People don’t do it to be malicious, they don’t do it out of ignorance. They do it by, in most cases, just assuming that this app is likely never going to get that big. And they’re right. It won’t. Until one day it does, and then I’m here ranting into a microphone yelling at you about proper database architectures. And given that we’ve already established that my favorite database in the world is Amazon’s Route 53, if I’m lecturing you about database architecture, something has gone very, very, very wrong. And yet, here we are.
Thank you for listening to me rant about proper separation of database endpoints. We’ll go back to real world stories next week. But ideally at this point, we have just saved someone from making a terrible, terrible mistake that they will not realize they’re making for months or years. I’m Cloud Economist Corey Quinn. This is the whiteboard confessional. Thank you for listening.
Announcer: This has been a HumblePod production.