The Blog

S3’s Durability Guarantees Aren’t What You Think

Calendar Icon 04.21.2021
aws-section-divider aws-section-divider

For those unfamiliar, AWS’s Simple Storage Service (or “S3”) is an object store.

This is different from a file store or a block store because if you try to use it as one of those, a bunch of people will come out of the woodwork to yell at you for a variety of reasons.

Object storage is the most economical storage approach across multiple providers; the cloud collectively “took a vote” and object stores won.

Today, I want to highlight some of the guarantees AWS makes around your data.

Availability

Availability describes the idea of “can I access the data that’s stored in S3?” Except for that one time in 2017 where S3 went down in us-east-1, the availability of the service has been exceptional—far better than AWS is held to via their published SLAs for the service.

A 99.9% uptime guarantee means that the service can be down for as much as 1m26s a day, 43m49s a month, or 8h45m56s per year before they’re in violation.

In practice, if S3 were down anywhere near that much on a consistent basis, the internet would have largely made other plans for where data was going to live. This is all fine and good—because it’s not where I’m taking this post today.

Durability

Durability describes the idea of “I have stored my data inside of S3; how much of that data have you lost?”

This is understandably a very different risk model than availability. “I can’t reach my data” is one thing; “my data is lost and gone forever” is quite another.

Compared to the 99.9% availability guarantee, S3’s durability design target numbers are the stuff of legend / absurdity: 99.999999999%. Amazon states that “if you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years.”

This is well beyond the risk level of “gravity randomly stops working.”

Let’s unpack that

I want to point out a few things that are worth noting.

First, this is presumably due to the number of different hard drives that the objects reside upon; I presume that it’s using something similar to Reed-Solomon error correction / erasure coding.

In effect, you can imagine it as “your file is mathematically broken into 100 parts; any 70 of those parts are sufficient to reconstitute your file.” Those parts are then strewn across different drives, locations, etc. The probability of those drives simultaneously failing is what gives that ridiculous durability number.

Second, AWS offers no formal SLA around data durability in S3. Okay, fine. It’s clear the service works; reliable reports of S3 losing data are nonexistent. Let me save you some time if you’re trying to negotiate your AWS contract to include custom SLAs: You won’t get them.

Third and most relevant to what you should consider in your planning is that their durability target applies to a bunch of different S3 storage classes—including their One Zone Infrequent Access tier. This storage class saves money by only storing your data in a single availability zone, and thus wouldn’t survive the destruction of that AZ.

Let’s go back and restate that: Moving all of the hard drives that contain your data closer together (in some cases, into a single building) does not impact the durability metrics. From this, we can surmise that disasters aren’t included in the calculation of S3’s durability.

What this means for you

In hindsight, this makes perfect sense.

The collapse of government and the takeover of an AWS region by roving gangs of bandits is orders of magnitude more likely than 0.000000001%. A bad push to the S3 control plane that doesn’t get caught and eats all of your data is also more likely.

Realistically, winning the lottery while being struck by a meteorite and lightning at the same time as you’re being abducted by aliens is more likely than that. All of the AWS regions can burn down eight times in a row and it’s still more likely.

What this means here in reality is that using S3 doesn’t remove the need for backups.

Let me put that more bluntly

If you trust AWS’s durability guarantees for important data, awesome. You should; they’re good numbers.

But what are the odds of someone in your organization (accidentally or otherwise) deleting critical data from the wrong bucket or misconfiguring a lifecycle policy to do it for them?

Backing your data up elsewhere is a great idea if your business is going to have a hard time existing without it. “Another region” is a good idea; “another AWS account that nobody from the first account has access to” is a better one; and “another cloud provider entirely” is the easiest one to explain to your Risk Management people.

And so…

I’m not suggesting that AWS is doing anything wrong by citing these numbers.

I just know that I’ve spoken to folks who have used the durability numbers as evidence that they don’t need to worry about backing things up.

And that frankly scares the hell out of me. So I implore you: Don’t do that!

aws-section-divider