Join me as I continue a new series called Whiteboard Confessional by exploring a time in a previous life when Amazon ElastiCache for Redis caused an outage that led to drama, what it was like to work for someone who can be described as a “metaphor-spewing poet,” how every event and issue makes sense in retrospect, why you should never schedule important maintenance on a weekend, how Amazon ElastiCache for Redis works, the four contributing factors that led to the outage in question, why blameless post mortems are only blameless if you have that kind of culture driven from the top, and more.
About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group
, the author of the weekly Last Week in AWS
newsletter, and the host of two podcasts: Screaming in the Cloud
and, you guessed it, AWS Morning Brief
, which you’re about to listen to.
Corey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.
On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH
. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io
When you walk through an airport—assuming that people still go to airports in the state of pandemic in which we live—you’ll see billboards saying, “I love my slow database, says no one ever.” This is an ad for Redis. And the unspoken implication is that everyone loves Redis. I do not. In honor of the recent release of Global DataStore for Amazon ElastiCache for Redis. Today I’d like to talk about that time ElastiCache for Redis helped cause an outage that led to drama. This was a few years back and I worked at a B2B company—B2B of course, meaning business-to-business. We were not dealing direct-to-consumer—I was a different person then, and it was a different time, specifically, the time was late one Sunday evening, and my phone rang. This was atypical because most people didn’t have that phone number. At this stage of my life, my default answer when my phone rang was, “Sorry, you have the wrong number.” If I wanted phone calls, I’d have taken out a personals ad. Even worse when I answered the call, it was work. Because I ran the ops team, I was pretty judicious in turning off alerts for anything that wasn’t actively harming folks. If it wasn’t immediately actionable and causing trouble, then there was almost certainly an opportunity to be able to fix it later during business hours. So, the list of things that could wake me up was pretty small. As a result, this was the first time that I had been called out of hours during my tenure at this company, despite having spent over six months there at this point, so who could possibly be on the phone but my spineless coward of a boss? A man who spoke only in metaphor, we certainly weren’t social friends because who can be friends with a person like that?
“What can I do for you?” “As the roses turn their faces to the sun, so my attention turned to a call from our CEO. There’s an incident.” My response was along the lines of, “I’m not sure what’s wrong with you, but I’m sure it’s got a long name, it is incredibly expensive to fix.” Then I hung up on him and dialed into the conference bridge. It seemed that a customer had attempted to log into our website recently and had gotten an error page, and this was causing some consternation. Now, if you’re used to a B2C or business-to-consumer environment, that sounds a bit nutty because you’ll potentially have millions of customers. If one person hits an error page, that’s not CEO level of engagement. One person getting that error is, sure it’s still not great, but it’s not the end of the world. I mean, Netflix doesn’t have an all hands on deck disaster meeting when one person has to restart a stream. In our case, though, we didn’t have millions of customers, we had about five and they were all very large businesses. So, when they said jump, we were already mid-air. I’m going to skip past the rest of that phone call in the evening because it’s much more instructive to talk about this with the clarity lent by the sober light of day the following morning. And the post mortem meeting that resulted from it. So, let’s talk about that. After this message from our sponsor.
In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH
. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io
. And my thanks to them for sponsoring this incredibly depressing podcast.
So, in hindsight, what happens makes sense, but at the time when you’re going through an incident, everything’s cloudy, you’re getting conflicting information. And it’s challenging to figure out exactly what the heck happened. As it turns out, there were several contributing factors, specifically four of them. And here’s the gist of what those four were.
Number one, we used Amazon ElastiCache for Redis. Really, we were kind of asking for trouble. Two, as tends to happen with managed services like this, there was a maintenance event that Amazon emailed us about. Given that we weren’t completely irresponsible, we braved the deluge of marketing to that email address, and I’d caught this and scheduled it in the maintenance calendar. In fact, we specifically were allowed to schedule when that maintenance took place. So, we scheduled it for a weekend. In hindsight: mistake. When you’re having maintenances like this happen, you want to make sure that they take place when there are people around to keep an eye on things.
Three, the maintenance was supposed to be invisible. The way that Amazon ElastiCache for Redis works is you have clusters, and you have a primary and you have a replica. The way that they do maintenances is they wind up updating the replica half of the cluster, they then fail the cluster over so the replica gets promoted to primary, then they update the old primary, which then hangs out as the replica. This had happened successfully, or so we thought, the day before on Saturday, a full day before our customer got the error page that started this exercise. What had really happened was that we’d misconfigured the application to point to the actual primary cluster member rather than the floating endpoint that always redirects to the current primary within that cluster. So, when the maintenance hit, and the primary then became the replica, we were suddenly unknowingly having the application talk to an instance that was read-only. So, it would still work for anything that was read based. It wasn’t until it tried to write something that all kinds of problems arose.
And that led to a contributing factor four. Because reads still worked, our standard monitoring didn’t pick this up. We didn’t have a synthetic test that simulated a login. As a result, the first indication that something was even slightly amiss showed up in the logs when the customer got that failed page 15 minutes before my metaphor-spewing poet boss called me. So, when explaining this to the business stakeholders during the post mortem, we got to educate them in the art of blamelessness which you’d think would be a terrific opportunity for someone who’s only real skill is spewing metaphor, but of course, he didn’t decide to step up to that plate. Again terrible boss. So, someone from the product org was sitting there saying, “What you’re telling me is that someone on your team misconfigured—” Okay, slow down Hasty Pudding. We’re not blaming anyone for this. There were contributing factors, not a root cause. And this is fundamentally a learning opportunity with a lot of areas for improvement. “Okay, so some unnamed engineer screwed up and—” And we went round and round. Normally, an effective boss would have stepped in here, but remember, he only spoke in metaphor. Defending his staff wasn’t speaking in metaphor, so he, of course, chose to remain silent. As it turns out, and as anyone who knows me can attest, I have a few different skills, but a skill that I’m terrible at is shutting up. It turns out that blameless post mortems are only blameless if you have that culture driven from the top because everyone roundly agreed at the end of that meeting, that the way that it devolved was certainly my fault.
Now, let’s be clear. This was my team that was responsible for the care and feeding configuration of this application. And therefore it was my responsibility. Who had misconfigured it is not the relevant part of the story. And even now, I still maintain that it’s not. There were a number of mistakes that were made across the board, but the buck does stop with me. And there was a chain of events that led to this outage. Our monitoring was insufficient for something this sensitive, an error like that in the logs should have paged me before I got a walking metaphor calling me manually, we should have been testing that whole login flow with synthetic tests, and we should ideally have caught the misconfiguration of pointing the application to the cluster member rather than the cluster itself. But really, the biggest mistake we made across the board was almost certainly using Amazon ElastiCache for Redis. How using something else would have avoided this, I couldn’t possibly begin to say, but when in doubt, as is always a best practice, blame Amazon.
This has been another episode of the Whiteboard Confessional. I hope you’ve enjoyed this podcast. If so, please leave a five-statr review on Apple Podcasts. If you’ve hated this podcast, please leave a five-star review on Apple Podcasts while remembering to check your Redis cluster endpoints.
Announcer: Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.
Announcer: This has been a HumblePod production. Stay humble.