What is Data Retention? Learn the Basics

Considering the importance of data security and the ballooning regulatory burden imposed by new privacy laws, it’s more important than ever to develop a comprehensive data retention strategy.

But knowing when to hold ’em and knowing when to fold ’em when it comes to data isn’t as straightforward as it sounds. Even relatively small organizations can manage huge amounts of data at varying levels of value and sensitivity, and it’s common for data to be stored in multiple formats and locations. For many organizations, just charting existing data flows can be a herculean task.

Let’s look at what data retention is, some high-level guidance for developing your data retention policy, and how to apply what you’ve learned in an AWS environment.

What is data retention?

Data retention is the practice of preserving data for a specific period of time to meet technical, business, or regulatory requirements. Any time you save data to a file, you’re technically retaining it — but the term “data retention” usually refers to the deliberate, systematic ways in which you store, use, and delete data.

A data retention policy seeks to answer questions such as:

How long do you retain access logs?
What happens to your customers’ data after they churn?
When are you required to delete a user’s personal information?
What are your obligations when data crosses geographic boundaries?

When should you retain data?

The simplest and best data retention strategy is the most obvious: Don’t store any data. If data doesn’t exist, it doesn’t require a flow diagram, doesn’t cost you any money, isn’t subject to regulation, and never needs to be deleted.

This may sound like a joke, but I mean it. For every new data inflow, you should begin with the assumption that you won’t retain anything for more than 24 hours.

In most situations, discarding all the data won’t be an acceptable end state — most data has some kind of short-term value. But it’s a great default position; it forces you to qualify the data and weigh the costs and benefits of keeping it. This strategy is similar to the principle of least privilege: Start with the absolute minimum and only expand scope where necessary.

Qualifying data can be tricky, but there are a few key questions that can help. Many of these questions are difficult to answer and will depend on the particulars of your organization and use cases. Remember that as data ages, the answers to these questions often change. So your strategy should encompass both the initial state of the data as well as its changing value and liability over time. Keep your use cases and long-term needs in mind as you consider these questions to help you decide whether, and for how long, to preserve data.

Is it useful?

Besides immediate business needs, you may want to retain data for auditing, troubleshooting, compliance, or redundancy. If data isn’t useful, you should not keep it.

How much will it cost to store?

Storing data indefinitely might be handy sometimes, but it guarantees a perpetually inflating storage bill. What is the data worth to your business, and what is the ROI on storing it? You sometimes need to make an educated guess about the current and future value of your data.

How often do you need to access it?

Very often, data has a hot-warm-cold lifecycle. At first it’s accessed very frequently, then less frequently, then almost never. If this is the case for your data, its retention flow may be complex. For example, the data may move between database nodes, storage classes, or even physical systems. This flow can also greatly impact cost considerations.

Is it replaceable?

Sometimes data is derived from other data. So to re-create or adjust it (say, because of a bug in your roll-up query), you need the raw source data. In these cases, you may want to keep both the source data and the derived data so mistakes can be remedied on a predictable time horizon. Or in the event of a catastrophe, how much data would you need to restore your service to working order? Once data is deleted, you can’t get it back, so assess your minimum data durability when creating a retention strategy.

Is it subject to regulatory requirements?

If so, this trumps all other considerations. Many types of data have specific legal requirements for retention — for example, health care data, financial data, and employment records. This could be the hardest question you’ll face, and the answer may contain a lot of acronyms: HIPAA, FERPA, FLSA, SOX … all your favorites!

Is the data subject to regional privacy laws?

If your data contains anything close to personally identifiable information, it may be subject to laws like the EU’s General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), and Brazil’s LGPD (again, so many acronyms). These laws sometimes prescribe deleting personal data as soon as possible. This can be tricky, so we’ll cover privacy regulation in more detail.

Data retention and privacy regulation

“Personal data” can be broadly defined as anything that can identify a unique individual or can be considered a profile of an anonymous individual. Given the growth of privacy regulation over the last few years, you should assume that any personal data you store will soon be subject to regulation, if it isn’t already. Sometimes these privacy obligations can be mitigated by user agreements and other mechanisms, but you should assume that’s not the case unless somebody wearing an expensive tie tells you otherwise.

Privacy regulations generally require that you treat personal data like a scalding hot potato. You should only collect it when given permission. You should only store it for as long as necessary. You should provide copies of it when requested, and you should delete it when asked. … OK, maybe that’s not exactly how you treat potatoes, but it’s too late for me to back out of this simile, so just roll with it.

These requirements have serious implications for how and where you store data. Whatever your data storage solution, you need the ability to query, delete, or alter data without denting the integrity of your audit or reporting systems. Backing an existing system into these requirements can be extremely painful, but the penalties for flouting privacy laws could be worse. Ideally, you should have a committee at your organization that keeps an eye on privacy regulations and identifies current and future concerns regarding your data.

Data retention in AWS

There are four main practical considerations when designing data retention flows in AWS:

Moving or expiring data. How can data be moved to cheaper storage tiers, and how can it be deleted? This addresses cost concerns and, sometimes, regulatory requirements.
Encryption and access. How is data encrypted, and how is access controlled? Beyond security best practices, some regulatory frameworks prescribe specific encryption and access controls.
Querying and modification. How can you find and modify individual records if required? This addresses business needs, and privacy regulation requirements.
Backups and redundancy. How much data needs to be backed up, how durable do backups need to be, and how can backups be automated? This is sometimes a regulatory consideration, but most often, it will be about business continuity, disaster recovery, and auditing.

If you organize your data well, querying, altering, and destroying it will be much simpler, no matter which technology you use. It may make sense to organize data by customer, or in a time-series, or by organizational unit, or along some other dimension. When you’re deciding how to organize your data, try to consider its entire lifecycle — not just its immediate use.

There are some common services you may use to store data in AWS. Let’s look at how each service manages expiring data, managing access, querying individual records, and automating redundancy.

S3

S3 provides an amazing feature for managing data retention: lifecycle rules. Lifecycle rules can help you manage storage costs by automatically transitioning objects to cheaper storage classes after they hit a certain age. Storage classes are priced based on data availability and durability, and they work really well for hot-warm-cold retention schemes. Once objects are no longer needed, your lifecycle rules can “expire,” or delete, them. IMO, “delete” is the cheapest and best storage tier.

Encryption and access is also a concern when you’re storing any kind of data, and it’s relatively simple in S3. Bucket policies, public access settings, and IAM policies help control access to data, while encryption controls let you enable automatic encryption using your own key or AWS’ key.

Complying with regulatory requests (for example, a GDPR data purge request) can be a bit of a pain in S3. Right off the bat, crawl any buckets that contain personal data with AWS Glue. Once data is crawled, it can be easily queried via Amazon Athena. If you’re looking to identify the exact object that a specific record exists in, you just need to include “$path” in your list of selected columns. Very handy!

Manually editing objects isn’t very scalable. So AWS offers a Find and Forget tool that’s designed specifically for complying with data erasure requests. It’s a bit of a beast, but it seems to do the job well and would be relatively easy to customize if you have unusual needs.

Redundancy is important in any data retention scheme, and bucket replication makes this trivial in S3. Enabling replication automatically syncs data from a source bucket to a destination bucket, preferably in another region. But remember that your lifecycle, access, and encryption policies should also apply to your backups!

Also note that since AWS CloudTrail is backed by S3, all of the above applies to CloudTrail logs as well <dancing Carlton emoji>. Many hybrid storage solutions, like AWS Storage Gateway, are also backed by S3. In cases where on-prem data exists, you need to consider both local and remote copies of data.

CloudWatch

CloudWatch handles all kinds of log data, including VPC flow logs, application output, custom metrics, and logs from an endless list of AWS services. By default, logs in CloudWatch never expire. But thankfully, CloudWatch provides simple retention settings, which enable the automatic deletion of logs after a certain period of time.

Encryption is relatively simple in CloudWatch: You can specify a KMS key, and CloudWatch automatically handles encryption and decryption. Access is managed by IAM roles and policies.

Complying with regulatory requests in CloudWatch seems basically impossible. There isn’t a good way to delete, selectively delete, or alter data from CloudWatch without deleting the entire log stream. So as a rule, don’t keep anything resembling personal data in there!

CloudWatch is magically redundant, and you basically just have to trust that AWS won’t lose your data. It is possible to export CloudWatch data to S3, so if necessary, you could implement your own redundancy mechanism.

Databases

A few databases natively support time-to-live (TTL) functionality, which automatically expires data based on a property you set at the time of record creation. Amazing, right?! Databases with TTL features include Redis, Dynamo, and DocumentDB, including Mongo. For these databases, simply set an expiration date on object creation, and it’ll automagically be deleted when the time is right.

For databases that don’t support TTL, you can employ a time-series table structure, allowing you to easily drop irrelevant tables. In AWS Elasticsearch, automated index management allows you to set rules to delete entire indexes on a schedule.

Most AWS database offerings support encryption and fine-grained access controls. Many support access management via IAM, but in some cases, you’ll need to deal with whatever native user management system the database includes.

Databases are usually easy to query and update (that’s what they’re for!), so regulatory requests or customer data purges should be a breeze with most any DB service. However, for data that isn’t time-series, you may have to invent a bespoke solution for purging data after it reaches a certain age, customer churn, or a user request.

Most AWS database services offer redundancy, either via multi-region replicas or AWS Backup.

Other places data is stored

You undoubtedly have data stored in other places — maybe you have logs stored on long-lived EC2 instances, or maybe you have SQLite sitting on an EFS volume. Again, the key things to consider are:

What data exists?
What encryption and access limits are in place?
How data is accessed and deleted?
How data is backed up?

If you can’t answer one or more of these questions, consider improving the data flow or moving the data to another store altogether.

How to get started on a data retention strategy

Developing a good data retention strategy requires a solid understanding of the nature, format, sensitivity, and useful lifecycle of your data. A good place to start is to create an inventory and diagram for all your data inflows and outflows. Once this inventory exists, you can develop a retention strategy for each persistent storage node and data stream. This can be an illuminating exercise, whether you’re planning future systems or evaluating existing ones. You may realize that you’re storing a lot more data than you need to be — or a lot less!

AWS provides some tools to help identify sensitive data and ensure compliance with various security and regulatory standards. For example, Amazon Macie helps you identify potentially sensitive data stored in S3, which can be a huge help when trying to get a handle on existing datasets. Though not strictly related to retention, AWS Security Hub analyzes your environment against a number of compliance frameworks and delivers multiple recommendations related to data access, backup, and retention. This is especially useful for things like PCI compliance.

Beyond retention, the modern regulatory and privacy environment requires a comprehensive strategy for search, manipulation, and deletion of source data. Again, this begins with looking at your data flow diagrams, identifying potentially sensitive data, and ensuring it can be easily queried, managed, and expired. This may require tools and processes that you didn’t previously employ, and in some cases (such as with S3), it can be a non-trivial task to implement. But knowing what needs to be done is a crucial step.

Once your existing data flows are documented and you have the tools in place for managing data retention, you need to create a formal data retention policy describing your organizationwide practices for data retention. Though we’ve mostly been talking about AWS, your policy would ideally dovetail with your organization’s broader data governance policies covering things like emails and physical records. Whatever your policy, anticipate that new data streams will be constantly created. Any team creating or managing data needs to understand how best to implement your policy and its implications.

This kind of organizational rigor isn’t easy to accomplish, and it’s even harder to back into. It can be a real pain just to document all of the data inflows and outflows for existing systems. As control of infrastructure and operations is increasingly democratized and decentralized (e.g., via NoOps and similar concepts), it’s becoming more difficult to keep tabs on your data streams. So in addition to implementing specific solutions for your existing data, create a suite of tools and policies that can be easily understood and applied by any person or team that generates data.

Reaching an optimal data retention process

Data retention will always be a moving target. But with a solid understanding of your data ecosystem and a working knowledge of AWS data storage services and features, you can design a framework that enables your organization to successfully balance business needs and regulatory requirements with each new data stream.