AWS published its analysis of last week’s us-east-1 outage, and it raises more questions than it answers. I understand that they wanted to get it out when they did (late on a Friday during one of the worst cybersecurity flaps in years), to avoid excessive attention. But I’m unconvinced in reading it that the outage was truly understood by AWS itself, and what we as customers should do about it.
Some of the concerns are relatively banal. Services that were mentioned in the write-up as being impacted were green throughout on the status page during and after the incident. AWS doesn’t appear to understand every aspect of what went wrong, but they know it was triggered by an internal networking autoscaling event. They explicitly state as a result that they’re going to stop autoscaling their internal network until that’s corrected. So on the one hand, good for them on being responsible. On the other … damn.
As we unfortunately discovered last week, you cannot have a multi-region failover strategy on AWS that features AWS’s us-east-1 region. Too many things apparently single-track through that region for you to be able to count on anything other than total control-plane failure when that region experiences a significant event. A clear example of this is Route 53’s impairment: “Route 53 APIs were impaired from 7:30 AM PST until 2:30 PM PST preventing customers from making changes to their DNS entries, but existing DNS entries and answers to DNS queries were not impacted during this event.” Read another way, “we didn’t violate our SLA, but if you were using Route 53 for DNS, you could make no changes to where traffic was directed for seven hours.” As of this writing, Amazon.com itself doesn’t use Route 53 for its public DNS, choosing instead to use both UltraDNS and Oracle’s Dyn. Yes, the same Oracle they castigate on stage from time to time. I’ve yet to hear of a single disaster recovery plan that would survive intact if you could make no DNS changes during an event.
“Don’t use Route 53 for public records” is an unfortunate takeaway we’re left with from this experience.
Let’s also address the giant problem in the room that exists in the form of AWS SSO, or “Single Sign On.” The “Single” in the name is a heck of a clue; you can configure it in exactly one region. From their documentation comes this gem:AWS Organizations only supports one AWS SSO Region at a time. If you want to make AWS SSO available in a different Region, you must first delete your current AWS SSO configuration. Switching to a different Region also changes the URL for the user portal.
To frame that slightly differently, if there’s an outage in the region that contains your SSO configuration, you’d better have another way into the account if you’d like to do anything in your cloud environment.
Liz Fong-Jones from Honeycomb reported significant issues with KMS and SSM during the event; the AWS analysis makes no mention of these services being impacted. Liz is far from the only person who noticed degradation of these services (she just happens to be one of the folks I trust implicitly when it comes to understanding what’s broken!), so I don’t believe that this is some sort of fever-dream or a weird expression of one company’s software architecture. I’m left with the unfortunate reality that AWS either does not know about or does not disclose all of its various service interdependencies.
DynamoDB and S3 gateway endpoints in subnets were impacted; some folks had to resort to using the dreaded Managed NAT Gateways with their 4.5 cents per gigabyte data processing fee. For some folks, this resulted in significant cost. If you were one of those customers, reach out to your AWS Account Manager for a concession for these charges. You shouldn’t have to eat fees that you paid to work around a service degradation. If they say no, please let me know; I’d be very interested to hear how customer obsession plays out in the wake of this mess.
I’ve previously said that before you go multi-cloud you should go multi-region. I stand by that advice. However, it sure would be swell if AWS didn’t soak customers with ridiculous data transfer fees to move data between regions as well as between availability zones within the same region. To review: data from the internet into AWS is free; moving data between availability zones and regions starts at 2 cents per gigabyte and increases significantly from there. Data to the internet from AWS is significantly more expensive. Viewed through this lens, AWS’s exhortations to build applications across regions and availability zones is less an encouragement to ensure application durability and more of a ham-fisted sales pitch. Unfortunately there’s really no lesson to take from this; we’re stuck with the understanding that there is always a trade-off between cost and durability, and AWS is going to milk customers like cows to achieve significant reliability.
To be very clear on my position here: AWS does a hell of a better job than you or I will running our own infrastructure. They’re fanatical about reliability and protecting it. But there’s something about this outage and its analysis that really, really rubs me the wrong way. Trust is everything when it comes to cloud providers, and there’s frankly enough wrong with Amazon’s public outage analysis to make me question exactly how far it can be trusted.