AWS has been making a lot of noise about generative AI, emphasizing risk mitigation and the need for control over your data.

Unlike its competitors, AWS doesn’t train its models on “the entire internet, regardless of various intellectual property restrictions.” This is laudable! (Though unlike the other large cloud providers, AWS currently doesn’t offer indemnification against intellectual property infringement claims when using its AI services, but that’s not the bone I wish to pick today.)

What I want to draw your attention to is the hidden catch AWS clearly hopes you won’t notice: It is training its AI services on your usage of a subset of their own cloud services.

AWS’s data paradox

AWS has long treated your data as sacrosanct, and I’ve found it very hard to argue with the company on this point. They don’t snoop into your S3 buckets to see what data you’re hosting, nor do they tailor the AWS customer console experience to individual customers in any meaningful way (even when they perhaps should!).

This position supports AWS’s leadership principles of Earning Trust and Customer Obsession. AWS even specifically commits to Transparency on its Responsible AI page, stating that it’s “Communicating information about an AI system so stakeholders can make informed choices about their use of the system.”

The truth is that AWS is training its AI models on your use of a subset of their services. Moreover, it’s been doing this for quite a while.

As per AWS’s Service Terms, the data it hoovers up for its own AI training applies to Amazon CodeGuru Profiler, Amazon CodeWhisperer Individual, Amazon Comprehend, Amazon Lex, Amazon Polly, Amazon Rekognition, Amazon Textract, Amazon Transcribe, and Amazon Translate.

Fortunately, this is clearly disclosed when you first use these services.

Ha ha! I am, of course, joking.

Rather than presenting its AI training on your usage to you front and center, so you can make an informed decision, this little nugget of trivia is buried deep within the terms of service. A quick spot check of several of these services shows that this disclosure isn’t presented to the user in any meaningful way. AWS also suffers an overdose of irony by stating in its terms of service, “You will not, and will not allow any third-party to, use the AI Services to, directly or indirectly, develop or improve a similar or competing product or service.”

Even should you be fine with volunteering for a $1.252 trillion company, you probably want to make sure that you notify your own customers that some of this data will be processed outside of the regions your account operates within. This is very much “check with your attorneys if this might be a problem for your business” territory.

The process to opt out of AWS’s AI model training

OK, OK. You’re aware of the issue now, and you realize you don’t want to let AWS get free training from your use of their services, because there’s very clearly value there for someone, and it’s most unlikely they’re going to give you that value for free in return for your training data. Or maybe you’re worried about it taking that data outside of the region you thought it was bound within. So, like most sensible companies, you want to opt out of this.

Good thing there’s a clearly marked organizationwide opt-out switch in the console.

Ha ha! I am, of course, joking again.

If you want to opt your organization out of AWS’s AI training, you first have to enable AI opt-out policies in your org, which is a switch flip in the console.

Next, Amazon has modified its own management policy language, so you have to go look some stuff up unless you want to paste random things in all william-nilliam. You need to craft your own policy; Amazon gives an example that’s polluted with “helpful” annotations, which means you cannot copy and paste the example itself, in which you opt out of everything org-wide. Here it is, all cleaned up:

json
{
  "services": {
	"@@operators_allowed_for_child_policies": ["@@none"],
	"default": {
  	"@@operators_allowed_for_child_policies": ["@@none"],
  	"opt_out_policy": {
    	"@@operators_allowed_for_child_policies": ["@@none"],
    	"@@assign": "optOut"
  	}
	}
  }
}

AWS helpfully gives additional opt-out examples, as if there exists a universe in which you’d want to give Amazon access to some of your AI service usage for free, but not all of it. This is a great example of busywork for whomever was tasked with creating these examples that will never, ever be used.

As of this writing, you can opt out of the above-mentioned services, as well as: Supply Chain by Amazon, Amazon Chime SDK Voice Analytics, Amazon DataZone, Amazon Connect, Amazon Fraud Detector, Amazon GuardDuty, Amazon QuickSight Q, and Amazon Security Lake. None of these services is explicitly named in section 50.3 of the Amazon service terms, so it makes me wonder why they were left out. I’ll give the benefit of the doubt and assume simple oversight, but then again, lawyers aren’t exactly known for missing things.

“Now, that wasn’t so bad …” , you might be thinking.

Slow down there, hasty pudding. You are nowhere near done.

Next, you have to take that policy you crafted and attach it to your organization’s root OU. “What the hell is that?” asks the obviously small minority of readers who don’t have a Directory Services background (that minority of folks who didn’t grow up having nightmares about LDAP rounds to “nearly everybody”). You get to hunt around in the console a bit to figure out how this works; alternately, you can use a handy Terraform module or Python script if that’s more your style.

Lastly, you have to validate that the assembly of all the various policies in your org do the actual thing that you want, as the complexity of this policy language means that the interplay might not work out exactly the way you expect.

Finally, a few hefty steps and much research time later, you’ve opted your org out of AWS’s AI training. Probably. I don’t see a way to validate that this policy is doing what you think it’s doing.

A departure from AWS principles

Why is this not a simple switch flip in the console? It clearly could be, making me wonder if this whole rigamarole is intentional. AWS’s approach here, far from being customer-obsessed, trustworthy, or transparent, seems to be mired in obfuscation and self-interest. It feels to me to be decidedly underhanded.

As it stands, AWS may be using your data to train its AI models, and you may have unwittingly consented to it. If you wish to prevent or stop this, be prepared to jump through a series of complex hoops.