The Blog

An AWS Database Safari

Calendar Icon 06.17.2020
aws-section-divider aws-section-divider

When we talk about vendor lock-in, one of the most common stories we see is one of databases. The database you pick to hold your data is something you’re going to be using for a good long while; migrations are painful, expensive, time-consuming, and—in some cases —barely possible.

Amazon themselves ran on top of Oracle for a long time. Despite extreme incentive to get off of it as fast as possible, it took them years and inventing their own database engine to do it successfully.

If you take a look at AWS’s offerings for databases, you’ll see … a lot of options. Let’s ignore their RDS databases, of which there are many. (There are five database engines to choose from, and their Aurora variants speak two of those engines’ languages fluently. Then we add in two more for Aurora’s Serverless options, bringing this thing we’re handwaving away to nine database options already.) Why does AWS have so many database options?

The simple answer is that different databases support different use cases. After all, “every AWS product is for somebody, and no AWS product is for everybody.”

Picking a database is a “one-way door,” as Amazonians like to call it. It’s painful and annoying to migrate databases even between different versions of the same engine. When you pick a database, you’re making a commitment—whether you know it or not.

Let’s start simple with…

Amazon Redshift

Redshift is an opinionated version of PostgreSQL that’s designed for data warehouse projects. To wit, it’s imagined that this is used for relational workloads that are likely to hit petabyte scale. Don’t let the pricing fool you; you’re not going to run just one of these suckers.

Amazon Athena

More cost-effective is Athena. This uses a variant of the Presto engine for running SQL queries against data that lives in S3. It’s way, way, way cheaper to store data inside of S3 then it is to shove it onto disks attached to relational data stores, and this even works well for ad-hoc querying, too.

The challenge is that the query performance and latency for response is nondeterministic, meaning you may not want to have this hooked up to anything interactive in a web form. As Redshift begins to speak to S3 more effectively, the lines between the previous two database options blur.

SimpleDB

You might think SimpleDB has been discontinued. You’d be wrong. It’s not in the AWS console because it never has been. Until a couple of weeks ago, there was an AWS job posting for the SimpleDB team in Chennai. It’s still there—and AWS gives every appearance that it’s not going anywhere. That said, unless you’re already using it, you probably don’t want to start. Instead, the best guidance is to instead consider…

DynamoDB

DynamoDB is an interesting take on a NoSQL database. If you know what your queries are going to look like in advance, it’s hard to beat. It offers a key-value store but can also masquerade as a document store (more about those in a bit).

It’s inexpensive when configured properly, its responsiveness is impressive, and you have no compute infrastructure to manage. But you do need to make your peace with the unfortunate fact that since it is proprietary, anything you’re using in DynamoDB is unlikely to work anywhere else. A migration off of AWS therefore means you’re also rearchitecting your DynamoDB data stores—and the applications that interface with them.

Amazon Neptune

Neptune is a managed graph database, which is an accurate answer that tells you absolutely nothing. Graph databases are great at returning results that highlight relationships. “This user is friends with the following users” is the canonical example, because basically anything else requires 80 pages to explain. My position is, and steadfastly remains, this: “If you need a graph database, you almost certainly know it, and Neptune is on the table; otherwise, move along.”

Amazon QLDB

QLDB, or Quantum Ledger Database, is a database engine that arose from the question “What if we needed a blockchain but without all of the hype-driven nonsense that makes blockchains ridiculous for most use cases?”

In other words, if you need a ledger-style database but can trust a central authority, QLDB is for you. If you can’t trust a central authority, Amazon Managed Blockchain might be a better answer. But let’s face it: At that point, you’re almost certainly past trusting a cloud provider, aren’t you?

Amazon Timestream

While we’re delving into the realm of fantasy, let’s look at Timestream, their take on a time series database. This isn’t an objectively nutty thing to want; my criticism of it comes from the fact that it was announced at re:Invent (AWS’s own version of Cloud Next) 2018, and over a year-and-a-half later, it has yet to enter either public preview or general availability. Time series databases (like InfluxDB) are great at displaying data over time. Metrics, logs, events—and frequently at incredibly high volume. This is a big deal not just in application monitoring but also in the world of IoT. Devices in the field reporting vast quantities of data back to a central point will often look for something in the time series space.

Amazon ElastiCache

ElastiCache has two variants (Redis and Memcached), both of which serve as an in-memory database. This means incredibly quick response times are available, since the query never has to touch the disk. This is thus generally used for keeping session data around.

A very common use case is having multiple web servers behind a load balancer, but sharing the session data so that users don’t have to log in again every time the load balancer gives them to a different server. The risk, of course, is that since the data isn’t persisted to disk, you’re one power outage away from data loss. Speaking of data loss…

Amazon DocumentDB (with MongoDB compatibility)

MongoDB is a storied database that achieves some great things. Unfortunately, it also likes to emphasize performance at the potential cost of data integrity and historically has done so by burying some very important caveats deep within their documentation.

That said, what makes DocumentDB interesting to me is that AWS does no real marketing of the service past talking about how compatible with MongoDB 3.6 it is. My takeaway from that message is this: “If you want to run MongoDB in an AWS environment, consider this.” Based upon MongoDB’s community interactions, I don’t want to run it at all, so I don’t spend much time paying attention to DocumentDB, either. If you’re less judgmental of MongoDB than I am, this is worth a gander.

Amazon Keyspaces (for Apache Cassandra)

Similarly, the AWS documentation for Keyspaces falls far short. It spends most of its energy talking about how compatible with Cassandra it is—albeit in ways that sound suspiciously like DynamoDB.

It’s worth noting that one of the Dynamo paper authors went on to build Cassandra later in time. If I wanted to use a managed service but still have a theoretical database exodus strategy I could fall back to, I’d consider this as my first stop on the path.

Amazon Route 53

Lastly, we come to my favorite database: Route 53.

You might argue that DNS isn’t a database; I would argue that an eventually consistent world-spanning key-value store that (in this case) offers a 100% uptime SLA is awfully hard to see as anything other than a database. I will accept that I’m in the minority opinion (for now!), but I will highlight that an awful lot of what people are misusing S3 for could just as easily be done well with Route 53.

This concludes my survey through AWS’s database offerings. Unfortunately, this will rapidly go out of date; there are multiple job postings on AWS’s site looking for people to work on “unreleased database products,” so it’s pretty clear that I may have to revisit this topic after re:Invent (AWS’s own version of Cloud Next) unless I want to watch it age badly.

aws-section-divider