Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.
Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.
Welcome. I am Cloud Economist Corey Quinn, and this is the AWS Morning Brief: Whiteboard Confessional. One of the nice things about how I do business is that I don't actually know when I record these episodes, who is going to be sponsoring it. Today, I'm going to talk about secrets management. The reason I bring this up is that should whatever sponsor has landed the ad slot for this week be talking about a different way of handling secrets management, you should of course disregard everything I'm about to say, and buy their product and or service instead. That said, let's talk about secrets management and how it can be done in some of the most appalling ways imaginable.
There are a depressing number of you listening to this, where if I were to steal your laptops, A) you potentially would not have hard drive encryption turned on, so I could just pull things off of your system. That said, most modern operating systems do this by default now, so that's less of a threat. Now, let's pretend that I wind up instead surmounting an almost impossible barrier. That's right, getting a corrupted browser extension onto your system that somehow has access to poke around in your user's home directory.
Think for a second about what I might find. Would I find, oh, I don't know, SSH keys that would grant me access to your production environment? Well, that wouldn't be that big of a problem because there's no possible way I would know what hosts they go for unless I look at the known_hosts file sitting right next to your SSH keys. But even that's a little esoteric because that's not something I would ever do at grand scale. Let’s instead consider what happens if I poke around in the usual spots and find long-lived IAM credentials, or whatever your cloud provider of choice’s equivalent is, which I believe is IAM in most cases unless you're using IBM Cloud, in which case, it's probably an old-timey skeleton key that is physically tied to your laptop.
Now, the reason this becomes a common pattern is because it's honestly pretty convenient. You're going to need to be able to access production environments or your cloud environment, and have permissions that are generally granted to you, and ease of access is always juxtaposed with convenience. And invariably, convenience tends to win out. Sure, you can mandate the use of multi-factor authentication for those credentials to get into production, but that means you have to type in a code or press a button on a Yubikey, or something else. That fundamentally means you're going to be spending a lot more time pressing buttons or digging out passphrases than you're going to spend getting into production in a hurry.
So, we make trade-offs; we cheat; it's human nature. And of course, once you get into your production environment, things are rarely better. It seems that you have a choice. You can either have the same password shared absolutely everywhere within an environment, or you have these incredibly secure key management systems, but in return becomes virtually impossible to rotate credentials. We've seen this before, and we've talked about this before. When we look at what happens when someone leaves a job unexpectedly, and suddenly the credential rotation causes four site outages in the next two days.
There's always a trade-off here. And the problem is, is that these elaborate multi-step secret retrieval processes that people can deploy are no stronger than their weakest link. I've talked about it in an early episode, but probably one of the most bizarre I've ever seen was for regulated data, where in order to start the database server, it required a long key that was cut into pieces, and then we needed to have multiple staff contribute and turn their key like we were launching a freakin’ nuclear missile from a submarine. And it worked, sure, but at the same time, it meant to restart a server, you needed at least two people nearby, and that became a little nutty. Let's also ignore for a minute the fact that this was just for encrypting the data at rest.
Once the service was running, it was loaded into RAM. There was no real guarantee that this was going to be any more secure than anything else. And let's face it, we're living in an era now where people stealing the server out of our cloud-hosted environment is not the primary or secondary or tertiary threat modeling that anyone has to do. For better or worse, you can give an awful lot of crap to the cloud providers, but they've pretty much solved the ‘someone rams a truck into the side of the building, grabs a rack into the back of said truck and peels off into the night.’ Except IBM Cloud. So, what are some patterns that work for this? Great question. But first:
Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.
Hopefully, that ad was not about secrets management. Again, if it was, please disregard everything I'm saying and buy that product or service instead. Now, there are tools out there that will solve this problem for you. HashiCorp Vault is a good example. And over in the world of AWS, you have a couple of options. You have Systems Manager Parameter Store, which is free but has a long-winded, stupid name, or Secrets Manager, which does exactly what it says on the tin, but costs 40 cents per secret per month. The question is, is it worth 40 cents per secret for you to wind up avoiding a stupid name? It certainly is for me. Snark aside, one key differentiator, that I'm a fan of, is secrets manager lets you invoke a Lambda function during credential rotation, which means you can teach it how to talk to any arbitrary database system you've got, run some script that winds up updating the credential, and then, effectively, it is push-button, rotate credential globally.
This gets into the larger-scale pattern of things that are scary or dangerous, that scares the heck out of people are exactly the sort of things you should do more of, “Well, we haven't rotated our passwords or certificates in three years because the last time we did, it caused an outage,” is almost always the wrong direction to go in. The better approach for sensible human beings is, “Ooh, that was difficult and painful. How do we do that enough so that, A) it becomes routine, and B) it becomes something that we can build automation around, so it's less fight the wolf to a standstill and more push the button?” This incidentally is one of the dangerous parts historically about SSL certificates having these incredibly long expiration times. In fact, a number of browsers are now not going to honor certificates with expiry periods of longer than one year, and that's kind of a good thing. Let's Encrypt, the free certificate manager only gives validity of 90 days, which means you're basically forced to automate this away, which is great.
Otherwise, in the olden days, we had these five-year validity windows for certificates, so by the time a certificate expired, A) the people who'd set it up were long gone, and frankly, working with open SSL command lines in the blessed place was always a question mark, and B) these certificates had spread so far within the organization—by hand—that no one knew where all of them lived. And the way we found out was when these certificates expired, invariably at five in the morning on a weekend, when we could least afford the downtime or a person to look at this. Honestly, every time you try and pull up a website that has an expired certificate, you sort of shake your head and wonder who dropped what ball. It certainly doesn't give you any degree of confidence in their technical competence. Frankly, I disregard blog posts I read when I'm confronted with a certificate error. If I was confronted by an expired cert to log into my bank, I've got to say, it's painful, but I would probably find a new bank.
So, production is one beast, but your laptops are another. One pattern that I'm a big fan of that kind of works with both is the idea of forcing credential rotation on a cadence. Some tools, like AWS vault, will do this in the background automatically. What I'm a big fan of in the world of EC2 is using instance roles because those are automatically rotated credentials that have a validity window of less than a day. So, if something gets compromised, there's a very limited window of validity during which time they can cause damage, as opposed to—let's face it—on your laptop, your IAM key pairs and SSH keys probably are damn near old enough to vote, for some of you.
So, in conclusion, take a look at what your risk exposure is with credentials. Understand that there's a spectrum of good ways to solve this, bad ways to solve this, and despite when anyone tells you about how awesome their approach is, invariably there is someone in their environment who is doing it completely wrong.
This has been the AWS Morning Brief: Whiteboard Confessional. I'm Cloud Economist Corey Quinn. And if you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you've disliked this podcast, please instead leave a five-star review on Apple Podcasts and a copy of your latest certificate pair.
Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.
Announcer: This has been a HumblePod production. Stay humble.