Status Paging You

Last week The Register did an analysis piece on the AWS Status Page that heavily quoted me. This is a good thing; I’m a big fan of seeing my name in print, and that goes double for a publication that played no small part in my decision to enter the technology field professionally over two decades ago. AWS then decided to muddy the works, first by giving an annoyingly ill-considered quotation for the article, and then on Monday releasing a shiny new status page.

Let me start by covering my position on cloud provider status pages.

Status Pages are Hard

I want to call out that AWS and its peers have a very hard problem: effectively communicating when a service is down to the public. “The public” in this case includes customers, those customers’ users, the media, and to some extent competitors. This inherently means that a single communications page has a variety of different audiences, many of whom want different things from it.

It’s under-appreciated by folks who haven’t had the opportunity to work at significant scale that running a service at global cloud provider scale is very much unlike running a WordPress blog. With a single computer, it’s pretty easy to figure out if it’s working or not. When we’re talking millions of machines, it stops being a question of “is the service up or down” and starts instead becoming a discussion around “how down is it?”

We know that us-east-1 is composed of at least a hundred different facilities that are stitched together to form six Availability Zones. If one of those buildings suddenly loses all power or internet connectivity, how should that be communicated? The services that lived on the computers inside of that building are of course down or unreachable, but… that only means that a subset of customers using a given service inside of one AZ are going to be affected. A significant majority of customers using that service in the same AZ aren’t going to notice that anything has happened whatsoever.

The AWS status page has received criticism for being a sea of green dots; let’s say that AWS’s critics had their way and it reflected every issue that they detected. A sea of red dots would be equally unhelpful, though significantly more alarming.

The Puzzling AWS Response

I get that an article that takes the framing of “your status page is garbage” isn’t likely to delight any provider, and I do think that The Register’s article was less than charitable to the challenges of doing a status page super well (can anyone at this scale?), but AWS’s response is just odd to me.

First, the AWS status page’s reputation is… not terrific. Historically it’s been slow to update, although in recent times the time it’s taken to reflect the situation for major outages has been notably better. It’s incredibly frustrating as a customer to be trapped in the “is it the provider, or is it something in my code” divide; both are causing outages, but the best way to resolve them differs significantly based upon which side of that question the answer lives within.

AWS taking a somewhat salty response as a formal PR quotation is just confounding:

“Third parties speculating on AWS availability almost always get it wrong… Just this week, Downdetector walked back its own false reporting by saying, ‘we do not believe there was a widespread service issue on AWS’s platform.’ The AWS Service Health Dashboard (SHD) is the only reliable source of AWS availability data, providing customers with timely and accurate information on AWS services and regions. It is not connected to our Service Level Agreements (SLAs) in any way. Our SHD provides more details and transparency on service availability than any other cloud provider.”

Let’s dissect that a bit.

First, “third parties speculating on AWS availability” is basically what we’re all left with when the status page has historically been glacially slow to update. One of the first things I do when I start seeing strange behavior is check Twitter to see if there’s something going on that’s larger than just my crappy code. I want to talk to other customers and see if they’re noticing anything funky too; there’s nothing wrong with that.

Second, I wouldn’t call what Downdetector does “false reporting.” They measure sentiment in a variety of places; when a bunch of people start asking Twitter “is AWS down?” then there’s a significant upswell in the same kind of sentiment you see when there’s an actual outage. One of the facets of AWS’s success is that “a major website or two going down” equates to “AWS is having issues,” just because AWS has become increasingly indistinguishable from the rest of the internet’s infrastructure. Pulling a “fake news” dismissal of a good faith analysis of reasonable signal just isn’t the level of professionalism I’ve come to expect from AWS’s corporate comms.

Third, referring to their own lethargic service health dashboard as “the only reliable source of AWS availability data” means that “reliable” is doing a lot of heavy lifting. There are many other sources that people consult. To that end…

Fourth, I agree with AWS that any customer who’s using the public status page as their SLA monitoring is desperately in need of a better monitoring / observability story. The idea that “we won’t update the status page because it’ll make us look bad to our enterprise customers” factors in to AWS’s decision to publish an outage is something of a joke; it’s a rare customer that won’t notice an outage if AWS didn’t mention it; the reason people care about these things in the first place is that their stuff stops working and their own internal graphs show this in ways that are roughly visible from orbit. I’ve never yet seen AWS counter with “well, we didn’t update our status page” as a counterargument to why they’re not granting SLA credits for an outage. If they did, I’d want a meeting invite just to join you in laughing them out of the room.

Fifth, isn’t this a weird thing to still be touchy about? Here in 2022, nobody worth paying attention to is using reliability as a determining factor in their selection of cloud provider. The top three, and arguably the next eight as well all have reliability that’s going to far surpass what you’ll be able to achieve on-premises; “it’s fine, check the box and move on” is the baseline reaction to various reliability questions during procurement’s analysis of which direction to take.

Last, I find it a bit odd that they gave such a full-throated defense of their status page (something only its progenitor could love) just a business day or so before completely replacing it with something new.

The New AWS Status Page

If you visit status.aws.amazon.com you’ll see a shiny new status page; a rough approximation of the old one lives at stop.lying.cloud, which I’ll have to either update to reflect the new design or deprecate entirely. I do wonder how many non-shitpost scraping tools for the status page broke similarly when this was rolled out.

I think the modern refresh of the status page has been a long time coming (although it still uses the ancient box logo as a favicon; I really think that branding had more character but the war has been lost and it’s time for an update).

One thing that’s very odd and leads to a fair bit of my own skepticism about the new status page is that it presents itself radically differently based entirely upon whether or not you’re logged in to an AWS account (and presumably, whether there are things broken that affect that account, though as of my publication deadline we haven’t yet experienced a significant enough service disruption to meaningfully tease this apart). I suspect that this is going to be where we see the wheels fall off when outages start to occur. “There’s an outage on the AWS status page” vs. “no there isn’t” is likely to be fairly wild as far as fostering misunderstandings go, provided that there isn’t a clearly marked delineator for folks on both sides.

They’ve also done a fair bit with respect to modern web design; the HTML that lives at the heart of the page is significantly smaller and better constructed. Historically the status page’s raw HTML (including no images!) came in at over 9 megabytes. That is a RIDICULOUS amount of text, as evidenced by both how long it took to render as well as the sheer tedium of scrolling down to see the service you’re inquiring about.

They’ve also stopped hiding their event history (this is good!), exposed it to your entire organization if you’ve configured it appropriately (this is great!), and even gives you links to their documentation on how to integrate with their Health API (this is fantastic except for the part where it’s only available to Business tier support or higher).

I’ll suspend judgment until we get a feel for how outages are reflected, but my biggest challenge with the new status page is that by calling it “AWS Health Dashboard,” you’re going to confuse the living hell out of any healthcare AWS customers who themselves confuse “Health” for “Medical.”

My Takeaway

So that’s a fair bit of ground to have covered. I think I can best sum it all up as “status pages are hard,” “AWS gets it more right than they do wrong,” “AWS PR needs to not only speak up more, but do so intelligently,” and “the new status page is a win.” We’ll know more just as soon as a significant portion of an AWS region falls over.