Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.
This episode is sponsored in part by ParkMyCloud, fellow worshipers at the altar of turned out [BLEEP] off. ParkMyCloud makes it easy for you to ensure you're using public cloud like the utility it's meant to be. just like water and electricity, You pay for most cloud resources when they're turned on, whether or not you're using them. Just like water and electricity, keep them away from the other computers. Use ParkMyCloud to automatically identify and eliminate wasted cloud spend from idle, oversized, and unnecessary resources. It's easy to use and start reducing your cloud bills. get started for free at parkmycloud.com/screaming.
Welcome to the AWS Morning Brief’s Whiteboard Confessional series. I am Cloud Economist Corey Quinn, and today's topic is going to be slightly challenging to talk about. One of the core tenants that we've always had around technology companies and working with SRE, or operations-type organizations is, full stop, you do not make fun of other people's downtime because today it's their downtime, and tomorrow it's yours. It's important. That's why we see the hashtag #HugOps on Twitter start to—well, not trend. It's not that well known but definitely happens fairly frequently when there's a well-publicized multi-hour outage that affects a company that people are familiar with.
So, what we're going to talk about is an outage that happened several weeks ago for IBM Cloud. I want to point out some failings on IBM’s part but this is in the quote-unquote, “Sober light of day.” They are not currently experiencing an outage. They've had ample time to make public statements about the cause of the outage. And I've had time to reflect a little bit on what message I want to carry forward, given that there are definitely lessons for the rest of us to learn. HugOps is important, but it only goes so far, and at some point, it's important to talk about the failings of large companies and their associated response to crises so the rest of us can learn.
Now, I'm about to dunk on them fairly hard, but I stand by the position that I'm taking, and I hope that it's interpreted in the constructive spirit that I intend it to. For background, IBM Cloud is IBM's purported hyperscale cloud offering. It was effectively stitched together from a variety of different acquisitions, most notable among them SoftLayer. I've had multiple consulting clients who are customers of IBM Cloud over the past few years, and their experience has been, to put it politely, a mixed bag. In practice, the invective that they would lobby against it would be something worse.
Now, a month ago, something strange happened to IBM Cloud. Specifically, it went down. I don't mean that a service started having problems in a region. That tends to happen to every cloud provider, and it's important that we don't wind up beating them up unnecessarily for these things. No, IBM Cloud went down. And when I say that IBM Cloud went down, I mean, the entire thing effectively went off the internet. Their status page stopped working, for example. Every resource that people had inside of IBM Cloud was reportedly down. And this was relatively unheard of in the world of global cloud providers.
Azure and GCP don't have the same isolated network boundary per region that AWS has, but even in those cases, we tend to see far more frequently rolling outages rather than global outages affecting everything simultaneously. It's a bit uncommon. What's strange is that their status page was down. Every point of access you had into looking at what was going on with IBM Cloud was down. Their Twitter accounts fell silent, other than pre-scheduled promotional tweets that were set to go out. It looked for all the world like IBM had just decided to pack up early, turn everything off on the way out of the office, and enjoy the night off.
That obviously isn't what happened, but it was notable in that there was no communication for the first hour or so of the outage, and this was causing people to go more than a little bonkers. One of the pieces that was interesting to me, while this was happening, since it was impossible to get data out of this for anything substantive or authoritative, was I pulled up their marketing site. Now, the marketing site still worked—apparently, it does not live on top of IBM Cloud—but it listed a lot of their marquee customers and case studies. I went through a quick sampling, and American Airlines was the only site that had a big outage notification on the front of it. Everything else seemed to be working.
So, either the outage was not as widespread as people thought, or a lot of their marquee customers are only using them for specific components. Either one of those is compelling and interesting, but we don't have a whole lot of data to feed back into the system to draw reasonable conclusions. Their status page itself, like it was mentioned, was down, and that's super bad. One of the early things you learn when running a large-scale system of any kind is the thing that tells you—and the world—that you're down cannot have a dependency on any of the things that you are personally running. The AWS status page had this, somewhat hilariously, during the S3 outage a few years ago, when they had trouble updating what was going on due to that outage. I would imagine that's no longer the case, but one does wonder.
And most damning, and the reason I bring this up is the following day, they posted the following analysis on their site: “IBM is focused on external network provider issues as the cause of the disruption of IBM Cloud services on Tuesday, June 9th. All services have been restored. A detailed root cause analysis is underway. An investigation shows an external network provider flooded the IBM Cloud network with incorrect routing, resulting in severe congestion of traffic, and impacting IBM Cloud services, and our data centers. Migration steps have been taken to prevent a recurrence. Root cause analysis has not identified any data loss or cybersecurity issues. End of message.”
Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.
Now, my problem with that is it focuses on the idea of a single root cause, which most of the folks in the human factors part of the internet will tell you is never a true statement. In fact, J. Paul Reed, a friend of the podcast newsletters and occasionally, me, will angrily shake his fist. He hates that almost as much as he does the five whys. But the point here is that if a single provider messing up their network announcements can take down 80 of your cloud data centers for hours, and you're unable to communicate with the outside world, yeah, that's obviously bad, but you have failed on a multitude of different levels at building a robust system that can withstand that kind of disruption.
Perhaps most damning of all from those customers that I mentioned earlier who have a presence on IBM Cloud, they were texting with their account managers, because the account managers had no access to any internal systems. Reportedly, the corporate VPN was not working. My thesis is, therefore, that given that everyone was remote, no one was on site, everything was single-tracking through a corporate VPN that itself was subject to this disruption, and now there was no one able to log in to send a message authoritatively on behalf of IBM. All of their traditional tweets have been done through an enterprise social media client called Sprinklr with no e in it because social media. Enterprise. Ehh. And surprisingly, all of the developer advocates that I know of—I checked their feed during this outage—they were completely silent.
So, it was clear that no one was authorized to communicate about the outage and silence when a customer is in pain, is one of the worst things you can do. Explain to them that you're aware of this, you're focusing on it, you will have updates to them on a cadence. That is what breeds trust. No one expects a system to never go down, but they do have significant expectations around what is going to be done in the wake of outages. So, things that I take away from this would be, if it were me, that it's important to have ways into the network for specific folks that aren't tracked through the same things that are potentially going to go down in the event of a network disruption; you need to have a crisis communications plan for social media and other formats; when the corporate VPN is down, you can't bottleneck through there; and most importantly, you absolutely cannot blame arbitrary third-party misconfigurations mistakes—which is, let's face it, what the internet is built on top of—for a global multi-hour outage if you expect to be taken seriously in the world of cloud providers.
In the wake of this, barring further communication, I have no choice but to nominate IBM Cloud for the Oxymoron of the Year. I know it seems harsh, but there are so many missteps and failings here that it is apparent that IBM is not willing to have a good-faith public conversation about this, instead hoping to sweep it under the carpet and hope that no one brings it up ever again. That's not how we improve. We all make mistakes. We all take outages. AWS will periodically have full-on analyses of what broke. Google has some of the best in the world that I've ever seen when they take outages for various things. Microsoft has turned explaining business outages to business into an art form, and they have 40 years of experience doing it. They are polished almost to an annoying degree. From IBM we've gotten only silence, stonewalling, and blaming others, and viewed through the lens of responsible cloud providers to pick, I am seriously doubting IBM and their capability to servicing the market at this time, barring further self-reflection and the results of that communicated in public.
This has been the Whiteboard Confessional version of the AWS Morning Brief. I am Cloud Economist Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you've hated this podcast, you almost certainly worked for IBM, and are not allowed to use Apple products anyway.
Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.
Announcer: This has been a HumblePod production. Stay humble.