Firewalls, Zombies, and Cloud Permissions Security with Sandy Bird

Episode Summary

On this Featured Guest episode of Screaming in the Cloud, Corey is joined by Sandy Bird, Co-Founder and CTO of Sonrai Security. The two discuss the current state of cloud permissions security, and Sandy details the company’s breakthrough Cloud Permissions Firewall which promises fast and scalable cloud least privilege all with one click. Corey and Sandy also talk about bunk AWS tools in this space, the insanely high “zombie” population in the cloud, and how Sonrai works for companies of all sizes.

Episode Video

Episode Show Notes & Transcript

Highlights:

(00:00)
Welcome to Screaming in the Cloud with Corey Quinn

(00:50) Sponsored Ad

(01:32) Exploring Sonrai Security's Mission and Challenges

(03:38) Introducing the Cloud Permissions Firewall Concept

(05:59) Comparing Cloud Providers' Permissions Models

(09:49) Sponsored Ad

(10:12)
Addressing the Zombie Identity Problem

(16:44) Scaling Solutions for Different Company Sizes

(20:10) Navigating Cloud Security Challenges

(23:38) Innovative Approaches to Permission Management

(25:27) Optimizing Permission Requests with Statistics

(27:04) Improving Cloud Security with Permissions on Demand

(35:15) Concluding Thoughts and Contact

About Sandy: 

Sandy Bird is the co-founder and CTO of Sonrai Security, helping enterprises protect their data by securing cloud identities and access. Sandy was the co-founder and CTO of Q1 Labs, which was acquired by IBM in 2011. At IBM, Sandy became the CTO for the global security business and worked closely with research, development, marketing and sales to develop new and innovative solutions to help the IBM Security business grow to ~$2B in annual revenue. He is a trusted and experienced cloud security expert.

Links referenced: 

Sonrai Security Website:  https://sonrai.co/screaming-cloud 
Free 14-Day Trial:  https://sonrai.co/screaming-trial

Transcript

Sandy Bird: Our existing customer base, the people that really cared about least privilege were like large financials. They actually had the staff to put in place to kind of monitor you're actually getting to least privilege and cut the tickets and they could afford the extra developer time to do it right. And so those customers we found as a pattern not only cared about least privilege, they were really good at writing.

We use the example of ADS, SCPs, Azure Policy, things like that to basically block the undesired activity.

Corey Quinn: Welcome to Screaming in the Cloud. I'm Corey Quinn. And this promoted guest episode is brought to us by our friends at Sonrai Security. Also brought to us is their own co founder and CTO, Sandy Bird. Sandy, thank you for joining me. Thanks for having me, Corey. Do you know what's more old school than blowing on a Nintendo cartridge to make it work?

Manually creating individual policies to achieve least privilege in your cloud. Leave old habits in the past and lock down access to sensitive permissions and services without disrupting DevOps with a single click. With the Cloud Permissions Firewall, you can easily restrict excessive permissions from human and machine identities, quarantine unused identities, and restrict specific regions and unused services with the click of a button.

Start a 14 day free trial for Sonrai Cloud Permissions Firewall at Sonrai. co. screaming. That's S O N R A I dot C O slash screaming. So take it from, think from the top, I suppose. I don't believe I'd heard of Sonrai before you had reached out. What is it you folks do over there?

Sandy Bird: Yeah, for the last five years, four to five years, we have focused on getting identities that are in AWS, Azure, or GCP to lease privilege.

So you can think about that as looking at the history of what they do, generating a better policy, applying that policy to that particular identity, and now it's at lease privilege. Um, we've learned a lot in four years. That probably is, in some ways, and I hate to say this, a fool's errand, because you have so many identities that doing them one at a time, unless you have some way to completely automate that and trust the automation, is almost impossible.

Corey Quinn: So effectively, you take existing permission sets in various cloud accounts and then prune them down to least up to a minimum viable privilege in order to get something out that good people do their roles, but they don't just have casual access to things that they don't need. Is that directionally correct?

Sandy Bird: That was again, I always call that, you know, as you build these companies, summary 1. 0, right? And it was our thesis, which was, and I had this great thesis. It was because cloud logged everything, doesn't actually log everything, but let's pretend it logs most things. Um, we would be able to look at every resource and get these perfect policies for it.

And then over time, we adapted those policies to make them a little less restrictive, really annoying when you're using the console and somebody is taking away every single thing you've never done before. And you browse around the console and everything is broke when you get there. That's not such a great experience.

So we made kind of those, we'll call them least restrictive, least privileged policies. But We came into this kind of conclusion about a year ago that we would monitor our customers. We had this great customer that was super successful at this. They had built it into their thing, they put Jira tickets in for people, they fixed their terraform, they would test it in UAT, and then they would roll it to production and be like, Oh, we were super successful.

We measured, you know, that timing, and it was like over a 10 month period, they fixed two or 3000 identities. And you know, that's pretty successful until you realize they generated more than 2, 000 identities in that same period. And then you're like, this isn't working.

Corey Quinn: We're getting more and more efficient at pushing this boulder up the hill continuously.

Sandy Bird: It is right. And so a year ago, we took this kind of flip it on its head model and said, there has to be a better way to do this. And so we created this thing called a cloud permissions firewall. I'm curious, Corey, what do you think of that name? Not knowing even exactly what it does yet. What do you think of the name?

Corey Quinn: Oh, I think it's brilliant marketing because you're not going to be able to get into RSA unless you have a firewall to sell someone. I mean, that's basically their entire schtick. So, great, I'd call basically anything a firewall if it gets me access to people I need to market to. It's great. Um, it also explains, based upon what I'm thinking off the top of my head, That's something that helps explain something sort of esoteric, which is effectively identity as perimeter, which is what we're talking about here, and explaining it to people who still think in terms of firewalls.

See again, RSA.

Sandy Bird: And you've kind of hit it on the head. It was a really touchy topic around here as we were naming it because part of it, as you say, is this very old school name, which by the way is even older than networks, right? We have firewalls between apartment buildings and we have firewalls in our car.

But the reality is, it's a very old kind of term. And we didn't know if people would be able to make this bridge into this, as you say, identity as the new firewall world. But when we started thinking about it more, as we kind of built this new model, it really has flipped on its head to be a deny first model for identity.

But only for the most sensitive permissions. If you actually took every single identity, there's I think it's up to like 43, 000 permissions across those three main cloud providers now. It's insane. And it grows every day, literally every day, there's more permissions. Um, if you did that and tried to protect all 43, 000 of them, everybody would be using something new at some point in time and it would just be super annoying.

However, if you took, we actually did this piece of work to find all the really sensitive stuff, created URL, you know, copy a snapshot to another place, like the things that actually leak data, poke holes in your world, you know, Destroy the cloud, these types of things. We, we got it down to about 3000 permissions across those three clouds, plus or minus a few.

And when we looked at it that way, we could flip the model and we could say, now we can build deny first for those 3000 permissions. And if you have the other ones, they're not as restrictive. You should go back to our old model, build least privileged policies for it. But if you don't get it, we can take most of the risk out of this.

by protecting the 3, 000 centrally. So it's a different model, super effective, super fast to getting it done. You know, you can get it done in a week versus 10 months.

Corey Quinn: I have a lot of thoughts on the idea of permissions in cloud and least privilege. Two almost diametrically opposed philosophies, at least the last time I dug into this in any depth, AWS and GCP.

By default, nothing in AWS can talk to anything full stop, whereas in GCP everything within a project generally can speak to everything within that project until you start. Isolating things down. And security purists love to turn up their nose at the Google approach, but I think it is the better way to start.

Otherwise, you wind up with what everyone does in AWS. You try and just give it the permissions it needs, and then something doesn't work, and you expand it a bit, and it still doesn't work, and you try and expand it yet again, it still doesn't work, and then you just give it full access to do things with a to do, fix this later, and the to do hangs around longer than any five employees in your company.

Sandy Bird: I think you've nailed it on the head, and it will I'll bleed Azure in here too, just to really mess the world up. Right? So in AWS you have this expanding of the wildcard problem. You don't know what the permissions are underneath of them. And so people, just as an example, give it EC2 star or Lambda Star, whatever they need to get the thing done.

Then they find out they need a pass roll. So they add, you know, IM

Corey Quinn: to it. You also need to be able to talk to CloudTrail logs or won't be able to charge you out the wazoo. Like, that's right. Great.

Sandy Bird: Right, exactly. So you have all of this massive permission set in AW s. One thing that's neat about the AWS model, though, is it is a deny first model.

And so if you can get a deny somewhere in that path of your identity, you can deny something. And no matter how many times other things grant it, it will still be denied. As you say, GCP is a little different than that. It has these kind of very open projects, right? And we always, uh, we pick on people for their service accounts that can act as anything else, including all the other service accounts in the project.

But it is still a deny first model and about a year and a half ago, maybe a little longer ago, um, GCP put a binding in that's very special that allows you to create a deny on a permission, and you can actually build exemptions around that using different principles that they have. And so you can actually get a pretty good deny first model in GCP.

But as you say, At least it starts in a usable form where things aren't like open to the entire GCP cloud. They're at least limited to that product, so it's a little bit better in some ways. Although, sometimes I equate a project to an account in AWS and, uh, you know, again, we can talk about how open those are.

Azure is really backwards. Azure is an allow first model. So no matter how many denies you have and how many policies you've written, if anywhere's in Azure, there's one statement that says you're allowed to do it, you're allowed to do it. And um, so you have to think completely differently in Azure when you go to correct these things because you can't, you can still create a deny first model, but you have to understand all the inheritance and everything for doing that.

And depending on where you're putting the rules, there are other things that don't work. Anyway. I've been beating the

Corey Quinn: drum for ages that Azure security is deeply flawed across a variety of different levels. I wasn't even aware of this and just add it to the pile at this point. Although I will give them credit.

They're the most cost effective cloud just because of how easy it is to run your stuff on someone else's account.

Sandy Bird: Yeah, well there you go. We could spend a lot of time for it, but I'm going to go on a side tangent in our experience. Discovery of Azure, so we were spending time building this particular project and we were looking for ways to basically wait, think, we're going to talk about zombies later, I love zombies, we're going to talk about zombies and cleaning zombies up, but we were trying to find ways to make sure that we would know if something happened in Azure that was denied.

What we discovered was, almost nothing in Azure that is denied is logged to their centralized logging. It shows up in the screen of the person who is denied, in their console you'll get a deny, or in the SDK you'll get the deny. But then when you go to look at the activity logs, no matter what you turn on, the diagnostic logs, all these things.

It's not every permission, but it's huge numbers of them, which is really interesting and sure. Anyway, side tangent, we could, we could go down on that one for a while.

Corey Quinn: Um, I want to go to the zombie thing that you're talking about, because I suspect I may have a, a real world story that is germane to this.

If you're going the same place, I think you're going with it. But, Please tell me more.

Sandy Bird: Yeah, so we were doing our research in building this flip the model on its head and doing this cloud permission firewall instead of this, you know, let's fix every identity. And one of the statistics we started looking at was how many of these identities are completely unused.

So they have permissions attached to them. Some of them are really sensitive. Some of them are benign. But they just have something attached to them and they're sitting in the cloud and they're completely unused. And we took a, a large chunk of our customers, big, big, you know, enterprise customers that have, you know, thousands of accounts and then little small customers that have 10 or 15.

And there was this interesting stat that the longer you were in cloud, the more of these identities that you had, which we nicknamed zombies, that were sitting there with all these permissions that weren't used. And it's really scary when you start, started looking at companies that were in cloud for more than five years, so they had history.

It was like 75 percent of the identities kicking around were unused. That high. It was insane how high it was. Some were worse, actually. That was an average. So it's, it's really scary. It's pretty bad, actually, and all that stuff, of course, opens up risk, uh, in the, uh, in the environment.

Corey Quinn: Well, so does closing them, and that's the challenge I have around this, because depending on what your sampling window is, there are things that only run once a quarter, for example, so if it's not at least 90 days, you're going to catch some of those things out, and then you have some very frantic, very upset business people wondering why something isn't working.

But the one that I come, that I care about the most from the, the old world IT ops side of the world, Is the break glass scripts, the things that you have sitting somewhere that don't normally run in the course of business. Uh, I have one now in my personal account where everything, for my dev box, everything is on my tail scale network.

On the off chance that that isn't working for whatever reason, I can hit a Lambda endpoint with a pre stored key, and all that does is it changes the security group to open up port 22. So I can SSH into the thing with an actual credential and continue from there. That is something that I don't think I've ever used it, other than when I built it and tested it.

Easy for something like this to view that as, oh, you don't need this around, and you're right. Until suddenly I will very much need that in some weird networking circumstance, and it won't work. How do you avoid that trap?

Sandy Bird: Look, Corey, I think you nailed the last four years of my life. We have this great CIEM solution, Cloud Infrastructure Entitlements Management, another acronym by Gartner.

And, um, we've been trying to get people to clean up these zombie identities forever. And there's really kind of, you said two ways, and there's actually a bit of a third, which is part of your first solution, which is. The break glass accounts are never supposed to be used, as you said, and we should never get rid of them.

Corey Quinn: That's also a small handful of them, though, to be clear, as opposed to the huge amount of things that got spun up as detritus of other things.

Sandy Bird: Exactly, and I would argue, you really should know what they are. The second part is, is more like your solution, though, where, you know, you have another team that's built something, it might be a yearly report, it's named really weird, and you as the cloud ops person at the top of the infrastructure is like, I don't know what that thing is, I'd have to go find the team, I'd have to ask them questions, whichever.

And then part of that same scenario is your example where you have this kind of complex configuration set up where if I hit this lambda endpoint, then that will do something which changes this and there may be a resource group on that that trusts that lambda function. And so it's this encompassing workload can get worse if it's an IAM user, which maybe you shouldn't be doing this, but has an access key, um, you know, a cut on it.

And you delete any of those things. What happens is not only do you delete the identity or the IAM user, you delete the access, so you've lost the key material. You've removed all of the permissions, and now that identity that's trusted through some trust relationship on some other resource doesn't exist anymore.

And so if you had to put it back, you wouldn't even know how to do it. You know, you wouldn't have the original state.

Corey Quinn: This is the guidance I give customers when they're talking about, we don't think this thing's being used, but we're not sure. How do we find out? And it's like, well, if you turn it off and no one knows what it is and something breaks, that's going to be challenging.

And not because there's really no warn if reject on a lot of these things. Great. Let's, let's change security groups so nothing can talk to it and leave it there for some period of time. Check the instance role. Is it doing anything that during that sampling period? And at some point, then go ahead and stop it without terminating it and let it go another period.

And then there's the scream test. When you block access to it, who screams? That's on some level sounds like what you're talking about.

Sandy Bird: It is exactly it. And so what we did was we basically said, that's fine. Let's leave all the permissions intact. And then basically short circuit it using a deny star, so it doesn't work anymore.

And what we did was we have this, um, second part of our product which we call Permissions on Demand. And what that does is it listens for the wake up. So if it sees an attempt to be used after nine months, it sends a message via chat ops, Slack, Teams. Email if you're into that sort of thing, which maybe I am, but everyone else likes Slack.

You get this message that says, Hey, this thing just tried to wake up. Do you want to reanimate the zombie? And if you do, you hit, yes, I want to reanimate it. The thing tries again and it's going to work. You could interrupt something, as you say, by turning it off. Who screamed? But you give this person screaming the ability to approve and turn the thing back on.

And then after some period of time, hopefully you do become comfortable and say, This thing's really not used. You should move it away. But you do have to put in the exemptions for like the break class accounts, right? You know what those are. So we have in our product this way that you can actually put them in as exemptions.

And of course they will never get blocked. But I actually think it's one of the most powerful parts of the product is being able to remove that. Because what we find is, is that they show up in these lateral movement change. So, you know, this identity can get to this identity. Which you then can get to this unused identity and then it can do all kinds of havoc.

And by actually short circuiting them, they no longer laterally move through them.

Corey Quinn: Do you know what's more old school than blowing on a Nintendo cartridge to make it work? Manually creating individual policies to achieve least privilege in your cloud. Leave old habits in the past and lock down access to sensitive permissions and services without disrupting DevOps with a single click.

With the Cloud Permissions Firewall, you can easily restrict excessive permissions from human and machine identities, quarantine unused identities, and restrict specific regions and unused services with the click of a button. Start a 14 day free trial for Sonrai Cloud Permissions Firewall at Sonrai. co.

screaming. That's S O N R A I dot C O slash screaming. It seems like it's one of those fun places that you can get lost in if you're not careful. It feels like this is something that works super well for certain scales of company. Because this sounds great even on my own test account, which is awesome. I can see it working at small to medium scale.

What I start to wonder is, At enterprise scale, where in some cases I have clients spending hundreds of millions a year upon thousands of accounts. And at that point, it's so diffuse that it becomes difficult to reason about any of these things in any holistic way. Is there a sweet spot that you found that's resonating with, or is this one of those rarities that actually does apply to theoretically every cloud customer?

Sandy Bird: It, uh, how we came up with a solution is, is kind of interesting. And sometimes you have to get beat up a lot to figure out where you need to be in these things. And so we had our existing customer base, the people that really cared about least privilege were like large financials, they actually had the staff to put in place to kind of monitor, you're actually getting to least privilege and cut the tickets and they could afford the extra developer time to do it right.

And so those customers we found as a pattern, not only cared about least privilege, they were really good at writing. We use the example of ADS, SCPs, Azure Policy, things like that to basically block the undesired activity. But they probably had a team of people doing that. When we went to our customer base that was, we'll call them large scale cloud, but not As highly governed or as highly mature and it was typically, you know, a team of four people that ran the whole cloud infrastructure and they were responsible for everything end to end.

They didn't have the cycles to put into monitoring to get to least privilege. They didn't have the people to write SCPs. They didn't have that. And so the cloud was kind of a mess, a growing mess as it went. And so when we were building the solution, we were trying not to build it for that. Highly governed, seven people writing SCPs, you know, they just knew what to do and they were doing it well.

We were trying to write it for that team that was, man, we're understaffed. We've got to get to least privilege from, you know, whatever compliance regime we're under. We're supposed to get to least privilege, but we can't do it. This gave them a way to get there fast and easy and didn't disrupt anything.

Because we have this option where we find all of the exemptions based on the history. We put those in automatically. And then you really only have to worry about day plus one, where you use permissions on demand. It's been interesting actually building the product and exposing it back to some of those larger, highly governed companies.

And what we found was They too struggle with SCPs because if you, if you look at SCPs, there's SCP space limits. There's the number of them you can attach. There's all these weird constraints you have to do. And some of the stuff we had to do to solve those problems is actually even applicable to them.

So by no means is this an for everybody solution if you're the purest and you can afford the staff to get to least privilege. I would agree. You should do that. That's the perfect way to do it. However, for the people that can't do that and can't achieve it, this is a much better solution. Scale. What's neat about this is you can start in one account.

You can, you can monitor the whole arrogance. I'm just going to start in development in this one area and then you can kind of work your way up through it. You don't have to do it on day one. And we've built the SCP scaling such that it works, you know, across thousands of accounts or across 10 accounts, whichever you happen to have.

Corey Quinn: That's a neat approach. It's, uh, on some level, on paper, It sounds like if you use just the lens of AWS, they have a few offerings that make what you do irrelevant. They have the IAM Access Analyzer, which in turn now can generate policies based upon what you actually use. And that would be awesome and would basically be, like, well, why would I ever need to use AWS?

What you're, what you've built, except for the fact it doesn't freaking work, or it works, but it doesn't go far enough. Where, oh, we saw that this role used the DynamoDB write table option. Okay, great. Can you tell me what table you're up to? No! Go guess! Then what's the point? Like, I, you don't get to be specific enough.

Like, what I would love to see is something that it auto generates a policy of, okay, based upon our observed behavior during the capture window, you're able to write to the following S3 keys. Like, okay, great, let's back that up a little bit, give it a prefix or a bucket or something. But yeah, that's it.

The Direction. Let me broaden it, because otherwise you wind up in the hell that I'm still in with my one of my code build roles that does deployments where it has full access to spin things up in a given account. To be clear, this is for my newsletter stuff. This is not for my production stuff, touching client data, different universes here.

But yeah, it's still has full access because every time I've tried to dial it in. It, it's a problem because first it has the, the ongoing updates to things when it does deployments, it's a permission set, but it needed a separate permission set entirely to provision those things in the first place the first time I ran it.

So there's a question of, great, how do I dial those in? It's okay to discard those extra permissions now, but every time I thought I had it working, I'd make one small change and boom, I'm back to square one. So I gave up.

Sandy Bird: Yep. And it's a, it's a common pattern. It is, again, I, I lived the last four years of my life before this particular new product thinking I wanted to be that purist too, right?

I want to get everything absolutely perfect. And then after looking at these customers struggle with, you know, some of these accounts are huge, right? You know, 50, 000 plus identities that are, have sensitive permissions that haven't used them in the last, whatever, we failed and we had to get it. a better mom.

The only way to do that was to start with a smaller subset. We couldn't do every single permission, right? But you could do the sensitive ones. And by doing the sensitive ones, you could remove that.

Corey Quinn: Oh, everything you have, the important stuff gets buried under an avalanche of random trivia.

Sandy Bird: And I also think what's interesting about, you know, you look at your problem and when you're looking at, you know, the DynamoDB tables or the S3 keys or the prefixes, I don't even think half the people know, like, you have to turn on extra auditing even to see that stuff.

And not every service on Amazon even supports that auditing. So, you know, doing it is super hard.

Corey Quinn: Oh, and data events to get logdos in CloudTrail. I've done the numbers on this. The API call to read an object in S3 will show up there in the CloudTrail data events, and it will cost 20 times as much as the API call for the read.

Which, okay, but that's not gonna solve every problem for everyone. It's, I understand that there's value in security and some things should be paid for, but I firmly believe that providers should not be charging extra. For things that only they can provide. If they want to go head to head of, we'll ingest Syslog and do these analytics stuff, yeah, by all means, charge away for that.

Because I have a half dozen options and honestly, I still like AWK with occasional grep tied to it. And that gets me surprisingly far, especially if I sprinkle in some Perl. And that's the, great. Other times you can send it to Datadog or Splunk. If you have a spare princess lying around, you can ransom back to whatever kingdom she's from.

Awesome. That's the. Like, that's an open playing field, but I've got to pay for these audit CloudTrail events because there's no second option for me to pay someone for that.

Sandy Bird: No, it's, it's amazing when you look at the amount and you know, there's other quirks in there too, as we're talking to your audience, right?

If you run two CloudTrails, now you really, really get burned because you only get one for free and the second one costs more and then you're storing the data twice. You ever

Corey Quinn: see CloudTrail paid events? It's usually, usually. A sign that something's misconfigured. Very occasionally, especially at finance institutions, I see security teams want an unadulterated cloud trail that they will, that no one else can see, for whatever reason, and they refuse to share it onward.

Cool. I can tell you down to the penny what that cost you last month. Great. Make your own decisions. I, I'm here to advise. I'm not here to make decisions for you. You clearly have context. I don't, that's the nature of respecting your customer's businesses, but it's frustrating to, to see that that misconfiguration, it feels like a tax on not knowing this one weird piece of trivia about AWS.

Sandy Bird: Yeah. Anyway, again, this all led us down this path to say. If we're going to try to fix this for the average customer, right, that doesn't have the team from those large financials that can justify four cloud trails for some reason, we had to do it in a way that was, you know, click a button, make the thing happen.

And the way that that, you know, worked for us was we did a lot of statistics across these sensitive permissions. So we did a lot of work figuring out what those 3000 sensitive permissions were. When we looked across our customers, we did throw a few of them out that were. They were called a lot and they were called by disparate identities.

So it would have been a lot of different identities doing this thing. And we said, well, that's too many permissions on demands requests. You'd have to approve it too many times. And we kind of graded ourselves to say, anything we're going to put in this list has to be something that's called. It can be called a lot by a single identity.

But the number of unique identities that call it in a period of time, it needs to be somewhat, um, small. It's not called at all, but it can't have a hundred different identities in one account calling it. And that was kind of this kind of guiding light to us to say, Okay, well, you know what? You create your inner gateway one time, and you never really touch it again.

So that's a sensitive permission. You know, you do things like, um, you know, you would say, well, decrypt must be sensitive. Uh, decrypt gets called by everything in your cloud, so we can't use decrypt as a sensitive permission. Right. And so you use this as your guiding light to figure out what these are and what's Azure has some crazy ones, by the way.

There's stuff in Azure that allows you to take like a file system off a running VM and make it a u RL on the internet.

Corey Quinn: New database just dropped, I guess.

Sandy Bird: Yeah, exactly. And so, you know, you have to look at these permissions and know what's sensitive and what's not. And, uh, anyway, so we spent a lot of time on that.

It was a, it was a fun exercise for sure.

Corey Quinn: I imagine it would have to be. It's, how do you wind up then handling the provisioning of permissions that need to exist all the time? Because part, an aspect of what you do, to my understanding, is the concept of permissions on demand.

Sandy Bird: And so what we do is, and so this is back to those statistics, which are so interesting across this, so When we looked at what gets provisioned that has sensitive permissions, and we'll use your AWS example because we've used them before, like EC2 star, Lambda star, like I couldn't figure out how to get it to work, so I gave it a bunch of services with star, it started to work and I moved on.

So in those scenarios, in every one of those services, rather it's Lambda or EC2 or whatever it happened to be, CloudTrail, Um, there were some number of sensitive permissions there and EC2 has, you know, 40, some of them, you know, they can go down to Lambda. It has, I think, 15, like every one of these has some number of them.

And so we said, okay, we are trying to solve the 92 that don't use them. What are we going to do with the 8 percent that do use them? Right? And so what we did was, Okay. When we initially onboard into the account, we find that centralized cloud org trail, we read backwards in time, just like IAM Access Analyzer does.

Finds all of the identities that would use it and have used it, and we suggest those as exemptions. But we tell you the last time they were used, was it used? Three months ago, or was it used yesterday? Right? So you get some history in that and we build that exemption list so that when you hit the protect button and it removes the 92 percent and leaves the eight, the eight are already there.

So you don't have to go and approve those ones. They, you know, you previously approved them by giving them EC2 star. We just said they can continue to do it. But the 92 percent are now off. They don't get to use sensitive permissions anymore. But they can continue to work like they always did because they don't use sensitive permissions anyway, so all their regular workloads work.

However, as soon as they try to use one, so if something in the 92 percent all of a sudden tries to create an internet gateway, which is suspicious in itself, but it does it, um, we hook on that and we know that that deny just happened. And we have this approval tree, which basically says you can set up for any different zone.

We'll use accounts in AWS and projects in GCP. Like who's the owner of that that has to approve this? That team gets notified in a Slack app or a Teams app. Hey, this Terraform role just woke up and tried to create an internet gateway. Do you want to allow that to happen? They hit approve. We make a slight change in the cloud.

Think about ABAC access. Um, all of a sudden that now can do it. And if they run the Terraform again, it'll work. And the idea is, is the team that's doing the notifications and the approvals can be the same team for self approval, or it can be escalated up one level.

Corey Quinn: And your dev account, you should be able to prove yourself to do almost anything.

There should be like larger SCPs that stop you from things. But other than that, yeah. Whereas in production, it's yeah, you're able to do anything. Should be highly constrained in, in most typical scaled out companies. Like, it's going to be a bit different at Twitter for Pets, the two person startup, versus the, uh, you know, a large bank.

There, the, who can do what and the risk blast radius is going to be somewhat distinct. But, you know, begin as you mean to go on.

Sandy Bird: So, again, it gives you this great starting point. You get everything kind of locked down in a hurry, and then, Because you can get the permissions back very quickly, it's literally, and if it's in self approval, it's literally Slack message approved, run the thing again.

Um, it doesn't create much friction for the dev team, so they kind of like it. It's unlike the, we had one customer as a design partner that was like, I love this story. Everybody here has contributor in Azure because the process is the same for getting contributor as it is for getting any least privileged role.

So why would you ask for anything less, right? You know, and you, so you created this, um, friction for getting any access. And so now it's so hard, everybody just asks for more than they need. And what this does is allows you, you can provision it with more, but until you get that really low friction approval, you won't be able to use it.

Corey Quinn: I might have accidentally discovered upon a source of confirming some anecdata I was curious about. Someone attempts to do something in an AWS account. Their role does not let them do it. The approval pops up. Um, but first off, what is the time lag on that rejection hitting? Because historically, CloudTrail was racing, you know, the fossil record, and it's gotten better, but not perfect.

Sandy Bird: Yep, and you cannot use, if you're, like as an example, let's say you're writing to a centralized bucket somewhere, and you were to look in that CloudTrail for these events, it's way too delayed. They say it can be as bad as 20 minutes. It's not that bad, but it is

Corey Quinn: bad. It used to be. It's not anymore.

Sandy Bird: Yeah, it's, it's still bad.

And so you have to use other mechanisms in the cloud to hook on these things. Used to be CloudWatch. There's EventBridge. And so there's ways that you can hook onto these very special events earlier in the cycle before they're ever written. And so you have to find other ways to hook them. You can't actually do it using the standard cloud trail mechanisms.

Otherwise, that delay is way too long. We, um, again, when all of the stars align, they happen in like four seconds. When all the stars don't align, it's still under a minute, so it's very

Corey Quinn: fast. Which is fair, otherwise you have a 10 minute cycle time every time someone thinks that it's the permissions thing, but no, no, they just have the wrong endpoint or something.

Uh, the second question I have for you is, okay, they get denied. The, uh, they click the approve button, which I assume hits the API more or less, uh, more or less synchronously, and then it winds up enabling that on that role. From that being time zero, how long does it take until the change is actually reflected in the role and the thing can go through?

Is it an atomic transaction? Is there a replication delay? And if so, how long is that delay?

Sandy Bird: It's that actually happens super, super quick. Um, you know, think about things going through SQS queues and stuff. It's super fast. It happens in Okay,

Corey Quinn: it's a tense, there is a delay, but it's not massive. Okay, not massive because I've often wondered when I do things that look like they should be working and they're not.

It's okay. Well, maybe there's some IAM replication lag going on here. And usually I have never found that to be true that I'm aware of. I'm sure there's been once or twice, especially in far flung regions, but it's a, but yeah, the big problem has been, no, no, I'm just bad at computering.

Sandy Bird: It is. It's interesting in AWS.

We are this. In the development community. There's a lot of talking about, uh, eventually consistent.

Corey Quinn: It puts the eventually in the eventual consistency. Yeah,

Sandy Bird: exactly. And so there is some of that in AWS, generally it happens within that 10 to 15 seconds, right? I've had scenarios where it's slightly outside that range.

It's generally not with things like, you know, We'll say adding a policy to a role and attaching it. It usually doesn't take more than 10 seconds ever for that to be effective. Um, so, you know, in some really crazy busy account, maybe it hits 15 seconds or something, but

Corey Quinn: they're, they're pretty good. I suspect there would probably be a larger latency delay if you're using this to manage IAM roles on an AWS outpost.

Sandy Bird: Oh yeah, I would think so, yeah.

Corey Quinn: There's got to be a sink in caching storage, because if you yank the cable out of the back of it, it still needs to be able to authenticate and do its thing. Whether that's constant online or batched updates, I don't know. They haven't given me one of them to play with yet, because one of the requirements is enterprise support, which I might be able to talk around.

And yeah, there's a loading dock attached to my house, which I'm having some trouble with.

Sandy Bird: Yeah, exactly. You'll need to roll a rock in there somewhere, so. We used to put, you know, rocks, small rocks under our desks, didn't we? That was, that was something from 10 years ago. I don't think we do that anymore. Oh,

Corey Quinn: yeah.

Uh, people still do with Mac minis for the build servers. They usually call them Bruno because when the auditor comes around, you do not talk about Bruno. But the benefit now is that with their, with the EC2 Mac instances, yes, it winds up being, uh, hundreds of bucks a month. But, okay, fine. For a build server that I can now treat like everything else?

Cheap at twice the price. Security is one of those fun things. I guess I have to wonder though, how do you avoid being the inherent scapegoat for every time something doesn't work in a cloud account, which is all the time. Because you're now the department of literal no here, where you say no, you cannot do that thing.

How do you avoid being the constant blame target?

Sandy Bird: Yeah, there was, uh, when we were going through this, there were two parts to that. One is, we were You know, the person putting least privileged policies in for years anyway, so we learned a lot about, um, as an example, if you were to compare our least perfect least privileged policy with AWS Analyzer one, we are not quite so restrictive as they are because the reality is we know that even humans and workloads have things that are somewhat similar that they do that they don't do all the time.

So you should put that stuff together. So we got better at not being the no as often. Thank you. But that doesn't solve the problem. You're still the department of no because you're absolutely denying something that somebody is now trying to do and they need to do to get their job done. And so when we did this, the reason this permissions on demand uses this chat ops method where it's communicating back to the team that's doing the work instantly, you know, within seconds, you're getting notified the thing you just did.

We said no to and if you are the approver, press this button and you can continue. Or if you're not the approver, here's who this is where it went to. You need to talk to Joe. A huge part of the solution was making sure that that whole cycle end to end from the time that you were actually denied in the cloud and you're now sitting there staring at an air message to the point of you getting notified in your other channel and somebody hitting approve could happen in we set a goal for ourselves.

Less than one minute. Has to be. That whole cycle must be less than one minute's interruption in your day. And now, again, if you talk about that big bank with the large level approval in the prod account, I agree that the approver may not press the button in a minute, but, again, we'll, uh, you'll probably have to ask some questions.

But the, the actual software time end to end really is less than a minute. And so, that's how we got out of it. We, um, you have to say no sometimes. We are security people. That's what we do, and we do it for the right reasons. But if you can get the workflow where the work happens quickly and they can get out of jail quickly, it doesn't become an impediment to them.

And um, we did a few other tricks too in the sensitive permissions to do some groupings to say, if you're using this one, you're going to use these other four too, so we might as well give them back to you at the same time. So we did some tricks there where you didn't keep running into the same block over and over and over in these workloads.

Corey Quinn: I really like the story about what you've built. If people want to learn more, where's the best place for them to find you?

Sandy Bird: Look, we have, and this is another big change for Sonrai Security. We were typically selling to these large banks and financials, and we were very much an enterprise security sell at that point.

Corey Quinn: But nowadays you start at a hundred bucks a month. So that's not unreasonable.

Sandy Bird: Completely different. We decided to say, look, we want to help people with 10 accounts. And so. Pricing's on the website. There's a free 14 day trial for anyone that wants to try it. And by the way, 14 days gives you enough to onboard it.

We will see all of your history that you've done before. We'll find your exception, so it's in monitor mode, you can see it. And you can try it in a dev account and get the permissions on demand working all in that 14 days. You'll know if it works. It's awesome. Um, and so super easy to do that from the website.

There's a click through demo on the website. I always say the sales guys have a block in it somewhere in the middle where they make you put your email in it if you're on the 14th click, which is sort of annoying, but well, but they're sales people, so that's what they should do.

Corey Quinn: That's what tagged email addresses are for.

Sandy Bird: Yeah, exactly. Exactly. Plus something on your Google thing. So it's super easy. Just come to the website. All the great stuff there. There's some good blog content there on the sensitive permissions and how we did that and lots of identity stuff.

Corey Quinn: Awesome. And we'll of course put a link to that in the show notes.

Thank you so much for taking the time to speak with me about this. I really appreciate it. Thank you, Corey. It's been great. Sandy Bird, co founder and CTO of Sonrai Security. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five star review on your podcast platform of choice.

Whereas if you hated this podcast, Please leave a five star review on your podcast platform of choice, along with an angry, insulting comment that I will no doubt use as a database of sorts because your podcast platform of choice almost certainly did not pay attention to least privilege.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.