Chaos Engineering for Gremlins with Jason Yee

Episode Summary

Jason Yee is the director of advocacy at Gremlin, an enterprise-grade chaos engineering platform. Prior to this role, he worked as a senior technical evangelist at Datadog, a community manager for ops, performance and security at O’Reilly Media, a software engineer at MongoDB, and a senior developer at OpenSourcery, among other positions.

Join Corey and Jason as they talk about what Gremlin is and what a director of advocacy does, making chaos engineering more accessible for the masses, how it’s hard to calculate ROI for developer advocates, how developer advocacy and DevRel changes from one company to the next, why developer advocates need to focus on meaningful connections, why you should start chaos engineering as a mental game, qualities to look for in good developer advocates, the Break Things On Purpose podcast, and more.

Episode Show Notes & Transcript

About Jason

Jason Yee is Director of Advocacy at Gremlin where he helps companies build more resilient systems by learning from how they fail. He also leads the internal Chaos Engineering practices to make Gremlin more reliable. Previously, he worked at Datadog, O’Reilly Media, and MongoDB. His pandemic-coping activities include drinking whiskey, cooking everything in a waffle iron, and making craft chocolate.

Links:

Break Things On Purpose podcast: https://www.gremlin.com/podcast/
Twitter: https://twitter.com/gitbisect

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the Enterprise (not the starship). On-prem security doesn’t translate well to cloud or multi-cloud environments, and that’s not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35 percent faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com/trial.

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Corey: Jason, thanks for joining me.

Jason: Thanks for having me, Corey.

Corey: So, you’re one of those people that we’ve always passed at conferences and other events, sort of like ships in the night. We hang out in group settings, but strangely, for whatever reason, despite traveling in the same circles for years now, we’ve never really sat down had an in-depth conversation with each other to the point where I feel like both of us are sort of wondering on some level, “Does he just not like me?” It’s been one of those items for me of, I want to catch up with Jason at some point and learn what makes him tick. And then pandemic happened. Well, no more. Thank you for talking to me.

Jason: Yeah. And again, thanks for having me. I’ve always felt the same way. We’re always at these speaker dinners, or just hanging out with friends, and for some reason, I’m, like, at one end of the table, and you’re at the other. And we’ve just never had this opportunity.

Corey: Exactly. Because you actually do a lot of good in the community, and I’m usually at the kids table. Which is, frankly, what happens, and honestly, it’s the right call. But you and I, I guess, are aligned in a few weird and interesting ways. And—well, let’s talk about what you do. You’re the Director of Advocacy at Gremlin. What is Gremlin, first off, and then what is a Director of Advocacy really do?

Jason: So, Gremlin is a chaos engineering platform, or a reliability platform as we’re trying to sell it now. Because we started out doing chaos engineering, so some of the folks that were doing chaos engineering back at Netflix and back at Amazon, decided, most people aren’t Netflix, most people aren’t Amazon; let’s build something that everybody can use. So, Kolton and Forni, our founders, got together, they started this up. And the idea is really, how can we help people make things more reliable? And obviously, chaos engineering is one of those ways, so that’s what they started off with.

And we’ve got a platform that really just makes that easy and safe to do. So, the second question about what is Director of Advocacy? I know you like to make fun of AWS naming, and I feel like it is sort of a weird, nonsense name because it doesn’t actually explain anything. But essentially, it’s developer relations. So, I have the task of talking to all sorts of folks who aren’t customers—really, just anybody in tech—about chaos engineering and why they should be doing it, and how to make applications and systems more reliable.

And then, aside from that, I also get to interact with our customers and help them out. So, I’m a combination of customer success or success engineer slash support slash the advocate side is advocating for their needs within the organization. So, when they make a product request, I pass that on, see what we can do about that. So, it’s sort of a mishmash of all these different roles.

Corey: I want to draw a bit of a parallel that DevRel slash advocacy slash evangelism universe to the sysadmin world where then we started calling ourselves DevOps and that led to an enormous schism around is DevOps a job title or not? “No, but it pays a lot better, so yes.” Then SRE. “Well, you’re not real, SRE,” and the rest. It comes down to quibbling over definition of terms instead of, you know, doing work. And I feel like, on some level, the whole DevRel space has, in some respects, gotten twisted around something that resembles the same axle. Is that unfair?

Jason: No, that’s absolutely correct. There is that question of what is DevRel? How do you define it? And part of that is how do I justify my job? And on top of that, how did—at least pre-pandemic, how do I justify the company spending tens of thousands, if not hundreds of thousands of dollars, not only for my salary but to fly me around the world to get on stage and say things.

Corey: Right. And it looks from a distance, an awful lot like, okay, you cost as much as an engineer, you don’t write any code to make what we do any better. Your expense budget is about the same as your salary in some cases, and then you travel far away to what looks like a giant party to hang out with your friends. And you get on stage and say, “I work at company X. Thanks. They’re great. Now, for the next 45 minutes, let’s talk about the right standing desk for you.” And it becomes a very difficult sell internally. And for a group that prides itself on advocating for its company. They don’t often seem to do as good of a job advocating for themselves, internally.

Jason: Absolutely. There’s always the discussion of KPIs. How do we measure the impact of what developer evangelism, DevRel does? And it’s a hard thing, partly because every company is a little bit different. Because nobody’s really defined this, DevRel often is very fluid and just fills in the cracks of whatever a company needs.

So, for some companies that might be doing support, right? I’ve heard people being called DevRel, and they literally are just on forums all day answering questions, or writing documentation, or speaking. So, it’s really just this nebulous thing of whatever a company needs.

Corey: It becomes almost this weird expression, in some respects, of marketing. Of course, a lot of DevRel folks will scramble at the objection, “Oh, we are not in marketing.” And that’s always said with a very sneering tone towards marketing because those people are terrible. I argue that marketing is, A) wildly misunderstood, B) incredibly valuable, and C) where DevRel in many respects finds its spiritual home because it’s very hard to tie your marketing budget as a company to definable results and do attribution effectively, but there’s clear value to the company in things that can’t necessarily be measured, or at least not without a heck of a lot of work. That is the piece, in many respects, the DevRel is missing. But the first thing that they want to make clear is that we don’t work for marketing. It’s a very weird feeling.

Jason: It’s very weird because as I explain that DevRel often is filling in the cracks and is very fluid, that’s because my personal perspective of DevRel is inclusive. I try to get involved in as many teams as I can, so I’m constantly working with engineering, and with marketing, and with customer success, and really everybody. And then on the flip side, you have people that define it by what it’s not. I’m not marketing, I’m not this. And you end up cutting yourself off.

Corey: And neither are you an accountant, but I didn’t ask if you were, so yeah.

Jason: But at the same time, you’re not an accountant, but you should have some sort of notion of what the finances of the company are because that gives you some sort of indication on whether you’re going to get laid off, for one, but also just for the success of the company. And I think maybe it’s just the engineering mindset that I’ve had from being an engineer of you take everything that and you try to learn everything that you can and put it together. And so, for me, that comes from having experience working in marketing, having experience working in engineering; how can I put these things that I know together to solve a problem? So, rather than saying, “I’m not marketing,” I’m going to ignore that because as you mentioned, marketing’s super valuable, especially the way that they’ve done data-driven marketing now. It used to be like madmen days, you’d throw up a billboard, and who knows if it works, but you paid a bunch of money for it. And now they’re so data-driven, and everything’s tracked. And, yeah, you may not be able to directly connect a few things, but you get a much better sense of where your value is, and where your time should be spent.

Corey: Absolutely. And you can get—I don’t know—the 80% of the way there, and then the last 20% will drive you mad, so at some point, you just shrug, give up, and that’s okay. Similar in many respects to an AWS bill. It just becomes such a weird process to explore. And from a certain lens, when you have those cross-cutting functional types who are doing DevRel, they start to sound almost enthusiastic amateurs in the various disciplines that they bring together.

“Yes, I’m an engineer, but not as deep on the engineering side, as some of my colleagues who do engineering 40 hours a week and then some.” “Oh, we're part of product.” But strangely, to work in product you usually have significant experience and training in how to conduct user experience studies and user interviews, whereas an awful lot of the DevRel input back to product is ‘word on the street style’ stuff.

Jason: Yeah. And both are extremely valuable. It’s obviously very valuable to have that process of doing user studies and actually getting that hard data, but as we all know, that word on the street and what’s the general vibe of folks at a conference or folks at a meetup really informs things that usually doesn’t get asked in those formal user studies.

Corey: Completely. And telling stories from my own world, back when I was, you know, having a real job and able to be fired by a whole bunch of different people—and was—there was the constant justification story of why should you go to that conference and speak? Why would we spend that money? Why shouldn’t it just be a personal thing that you take vacation for? Now that I own the company, it’s a different story because I know that when I go out and participate in the community, good things happen, but I don’t have the need anymore to justify it, other than to myself and possibly to my business partner.

There are very real stories that I’ve looked at here where I go to a conference, I start talking to someone, we keep in touch, they wind up changing companies, we continue to talk, suddenly, they have an AWS bill problem, and now they become a customer. Yeah, it turns out that’s super hard to predict when you’re looking at flight prices to go to that conference in the first place. And there are many other conferences that nothing came out of it, I think, but you never really know.

Jason: Yeah. One of the nice things about my job and one of the reasons that I joined Gremlin was the idea that chaos engineering is still pretty new. And so in my past experience with DevRel, it very much was your exact experience; how has what you said on stage or the introduction of our brand to an audience made an impact? And since chaos engineering has been so new, I’ve gotten to take a little bit of a step back from that. Obviously, I want people to get Gremlin or to try Gremlin, but even if folks just try chaos engineering and have a better understanding of it, that’s a big goal of my job. That means that I win if you try chaos engineering, even if that’s with an open-source tool. So, that’s one of the reasons that I’m super happy about where I’m at right now in terms of DevRel is, I get to be DevRel for an entire practice, rather than just a company.

Corey: And, on some level, you get to define what success and failure looks like among your team. But turn it around for a second; how do you wind up articulating the value and story of what you do to the larger business? Because I’ve seen the approach if you can’t measure DevRel that way—regardless of what that way is—and it’s always this, don’t ask us for metrics. Don’t ask us to really, functionally, be accountable for much. And from a business strategic point of view, where you’re not deeply involved with aspects of what that leads to, “Okay, so it rounds to zero, and wow, I’m spending an awful lot of money on something that doesn’t really add any value. I could spend that money on things that do instead.” And then you see a bunch of negative things happen. Like, as soon as there’s a layoff or a downturn, that entire group winds up getting decimated in some cases, even when, in reality, that’s the thing that should be invested in the most.

Jason: Absolutely, yeah. One of the things that I’ve always loved is people talk about metrics. And yes, we definitely get that from the marketing side. And so I do have metrics on things like how many workshops we run. And those people are obviously, we capture those leads, they go through the marketing funnel, et cetera, et cetera.

But then there’s the idea of how many engineers out there have those same metrics? We always complain about you shouldn’t count the number of lines of code because that’s stupid. You shouldn’t count all these other things. But generally, most engineering teams are working off of quarterly OKRs or some sort of time period, what those goals are and the product that they’re going to ship. And so I’ve tried to adopt the same thing in every DevRel organization that I’ve been in, is what are the high-level goals?

And if you can get leadership to buy off on those, for example, we’re currently working on an online learning platform. We don’t have tight metrics about how many people should be registered and complete the course and be certified yadda, yadda, but we have a good sense that if we build this, it’s going to be very beneficial in a number of ways. And leadership agrees, and they’ve bought off on that, and they’ve signed their names to it. And so for us, what does success look like in terms of this is actually implementing that and shipping it.

Corey: It’s a really strange and really powerful thing, but you take a look at so many different companies who have done well and companies that haven’t done well, and the way that they engage not just with the ecosystem, but with the community specifically, in many cases seems to be the path that it follows. I mean, not to pick on them unnecessarily, but Chef had a wonderful community; they engaged absolutely flawlessly, from what I could tell, even when I didn’t agree with people or particularly like them in some cases, the people who worked at Chef almost demanded respect, and it was pretty clear, even as someone who didn’t use it myself, that they were a force to be reckoned with. And then they wind up effectively losing a lot of the people that made it special, the community moved on, they sold it to a company no one had ever heard of, and now it’s one of those, oof, they deserved a better end. Maybe that’s unfair, but that is the perception.

Jason: Yeah, I would say the same thing sort of happened with Puppet, the idea that they built a nice community, and back to my point of, like, you have a project, you work on shipping that, you don’t really track those numbers. That’s what I saw from both communities Chef and Puppet is they had these strong communities, they were doing things, and the goal was the community. And I don’t know—I haven’t talked to Nathan, I haven’t talked to folks at Puppet, but I suspect that they weren’t simply about how many people—like, what’s the total number of people that we would say are in our community? There was a value on, we want to do this thing and we have a sense of the quality of the community, and how much people just are engaged, and interested, and want to help each other.

Corey: The piece that also gets lost as well is companies are out there to turn a profit. And building a vibrant open-source community who loves your open-source offering but aren’t in a position to either champion or purchase the thing is often viewed as a complete waste of time by the business. So, they in turn, then pivot business models and do things that insult or alienate the community, and suddenly are perplexed by the massive groundswell of negative publicity they get, of people actively advocating that companies not use them. And their position is somewhat understandable in a form of, “What the hell is this? You weren’t spending money on us before. Now, you’re still not spending money on us, but you hate us. What gives?” Community is a weird thing to wrap your arms around.

Jason: Absolutely. I would say it’s hard to wrap your arms around it when you’re not valuing the relationship. It’s like any relationship where you have ulterior motives. If you can’t actually connect with people, it’s never going to go right.

Corey: No. And it also can’t be self-serving, or seem to be self-serving—spoiler, the best way to make sure you’re not perceived a certain way is to not actually be that way—we take a look at Last Week in AWS, my newsletter, it is explicitly aimed at people who want to keep up with what’s going on in the world of AWS, which is fair. It is not aimed at people who have a big AWS bill and don’t know what to do about it. And sure I reference periodically in that newsletter what I do, but it’s not a sales piece. It’s not every week hammering home, buy whatever it is I’m selling because that’s how you alienate and lose the audience.

I’ve always felt that by being top-of-mind for the problem and reminding people I exist every week with something that’s useful and ideally a bit funny, then, when they have that expensive problem, they’ll think of me. That was my theory four years ago, and I’m still here, so apparently, it wasn’t completely off base.

Jason: Yeah, well, that works, right, because nobody wants to subscribe to a newsletter to hear about the service. If they knew they needed your service, they would just buy your service. So, what’s the value of the newsletter? What’s the value that you’re offering to people? And that is, well, the fact that there’s so much freaking news about AWS every week that it does require a newsletter.

Similarly for me, what’s the value? Well, if people knew that they needed Gremlin, they would just come talk to me. But they don’t. They were concerned about the needs that they have, about how do I build a more reliable application, “My stuff’s always breaking. I’m having too many incidents. I’ve done everything that I can think of. What’s next.” So, it’s just offering that.

Corey: If your mean time to WTF for a security alert is more than a minute, it's time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you're building a secure business on AWS with compliance requirements, you don't really have time to choose between antivirus or firewall companies to help you secure your stack. That's why Lacework is built from the ground up for the Cloud: low effort, high visibility and detection. To learn more, visit lacework.com.

Corey: And let’s be very clear here, you have a much harder challenge than I do. Because it turns out that you don’t need to be deep into the weeds of corporate finance, to understand the concept of wasting money on the AWS bill might not be the best thing in the world. Once you get more into the nuances, you start to realize, “Oh, being able to predict the AWS bill sounds super awesome, too.” But none of those are a particularly heavy lift, whereas, “Wow, your site is crappy and falls over a lot. Have you considered breaking it on purpose?” Sounds deranged the first time someone hears it.

Jason: Absolutely, yeah. That’s the number one thing that I hear all the time is—and people joke about it. I don’t need chaos engineering; I do regular deploys.

Corey: That sounds almost like someone was sitting in a blameless post mortem and got carried away trying to keep it blameless because otherwise, it was going to be their fault, and accidentally invented entire field.

Jason: Yeah, yeah. I mean, it’s definitely blameless if everybody is causing things to break; then we all share the blame. It is a funny thing. It’s a tricky thing to sell the people and I think it’s tricky because we have these misconceptions about what that actually means, the idea of breaking things on purpose. And trying to move away from that because the breaking really isn’t the goal.

And oftentimes, they’re not actually even breaking things; you’re stressing them out or you’re simulating things, so nothing’s really broken. But once you start thinking of it as that idea of I’m going to test my assumptions, right? I think that things work this way, but I don’t know, I’m not super confident that it actually will do that. And we do that all the time when we’re developing applications or infrastructure. I set things up, I’m pretty sure that it’s going to work a certain way.

Documentation says that this app works this way. Does it actually do that? Well, I can either find out when it doesn’t do that at some random point, or I can actually try to force it to act in that way, or to encounter that bad environment that I’m a little suspect about. And so we do this all the time with other things. And oftentimes, we’ll do this just mentally as, “What would happen if—” and you kind of play it out in your mind.

And that’s actually a great way to start with chaos engineering, rather than actually doing it, just that mental game. “What do you think would happen if this goes wrong?” Play that out in your head? Cool. Once you’re comfortable with that you’re like, I think this is what my next steps would be. I’m pretty sure there’s documentation here, or I’ve gone and checked and assured that there’s docs, or run books, or whatever, why not give it a try?

Corey: It’s one of those areas where what have you got to lose? I mean, as you just said, your site breaks all the time anyway, before you even touch it’s stability, what happens if the database just suddenly increases latency through the roof? What happens if suddenly all of us-east-1 is hard down? In many cases the answer is, we don’t really care about our website anymore because the world is not going to care about the internet not working that day, in the context of what we do. In other shops, yeah, that matters, and we kind of still need the power grid to work.

So, there’s a definite question of what failure modes are worth planning for and what aren’t, but even going through that exercise is fantastic. I used to do things like that from a sysadmin perspective, asking companies when I was asked to build out a mail server. “Great, how much downtime is acceptable?” And they said, “Absolutely none.” I said, “Great. I’ll need a budget of $20 billion to start, and when that runs out, I’ll come back for more.” And they said, “Wait, what are you talking about?”

And we said, “Oh, now we’re negotiating with the business.” And it turned out what they really meant was, “It would be nice if the mail server worked during business hours most of the time.” And, “Oh, okay. I can do that for slightly less.” And it really just came down to what do you value? What is important to your business?

Jason: Yeah. How much reliability do you need? Although one of the key things that I always point out is, a lot of times people are like, “Oh, you don’t need 99.9% reliability; you could probably get by with less than 90 because people aren’t using your application at night, they’re not using it on the weekends, yadda, yadda.” The other problem with that, though, is you rarely control when those outages happen.

So sure, if it happens in the middle of the night, and nobody’s using it, great. Just keep sleeping. As you start to work on this, though, there is the idea of it could happen at any time, so let’s actually test things to ensure that if it happens at the least opportune time, things actually work the way that we expect.

Corey: And that’s an incredibly valuable thing. See, you’re already convincing me on this. And clearly, you’re very effective at that advocacy role. How do you hire and how do you determine who’s a great fit? Because I’m imagining that bringing someone in, in an advocate role, and their position being, “Oh, at no point, can you ever measure me on any context, and just assume that what I’m doing is amazing and great.”

That becomes a hard thing to do. When I was talking to companies about possibly doing evangelist style roles, years ago, I asked, “How will you know if I’m being successful in this job?” And one of the answers was, “Well, you speak at a certain number of tier-one conferences a year.” “Cool, what are those?” And, they listed off a bunch and cool, there’s only one in that list that I’m not scheduled to speak at this year, so do I get a raise?

People try and aim at the wrong thing in their quest to articulate what they really value, but what they really value is hard to measure. So, how do you evaluate people on a basis of are they doing what they should be doing, or are there ways that they can be coached to improve, or are they just not effective in the role at all?

Jason: Yeah. Well, I think you mentioned two great things, are they doing what they’re supposed to be doing? And it comes back to every quarter, we’re laying out the goals of what do we want to accomplish this quarter? And we make them achievable, so hopefully, by the end of the quarter, you’ve achieved this thing that not only the team, but senior leadership has decided is a good thing for the company. And to that point, if it’s not, if we do that thing and nothing happens, and it’s—or it’s bad for the company, at least we can say, “Hey, senior leadership, you are the people that thought this was a good idea, too.” But that said, we try not to do the blame. We try to iterate on things and experiment a lot. Especially at Gremlin, we’re all about experimentation, so we’re constantly trying things. But ultimately, it’s are you getting this thing done that we’ve agreed that we’re going to get done?

But you also mentioned that second thing about growth. I think that’s something that I always look for with anybody, whether that’s DevRel or engineering. I want people that are interested enough in the job that they want to do it well. There’s something about it that they really love or they’re really into, and they want to master that. And so part of my goal as a leader is trying to help people along that path of what do you find interesting? For example, last year, we were working on those tiers, as we’re trying to figure out what does it actually look like. Because we’re really small team at Gremlin, and so as I’m starting to consider how do I promote people?

What are the various, like, levels or tiers of going from an advocate, to a senior advocate, to whatever is beyond that? So, I asked the team, really, “What do you think that would look like? What do you think the next level for your career is? What is the thing that you want to master?” Because ultimately, people have more investment when they’re choosing their destination and they’re choosing their direction.

And so if I can help people do that, just define what’s the next thing that you want to tackle? What do you think mastery or the next level of your career looks like? How can we help you get there? So, that’s what I am for.

Corey: For better or worse, it seems to be working. I remember back when Gremlin was a rando startup idea a couple people had and now I’m starting to see you folks, basically everywhere.

Jason: Yeah. Again, we’ve got a small team, but it’s a great team. So, Ana Medina has been on the team, actually, before I joined, but she’s been doing a fantastic job and she has been working on a lot of our educational outreach. And then Pat Higgins on the team actually started on the engineering side. So, he was one of our front-end engineers; he’s been working on a lot of really great tools.

He helped me restart the Break Things On Purpose podcast. So, we’re into season two of that now—and by the way, we should have you on that show as well. But yeah, we’re doing a lot of fun stuff, and folks are happy. So, try to keep them challenged, and we’ll see what’s next.

Corey: Yeah, I’m really looking forward to seeing how the story continues to evolve. It’s a fascinating field that went from, “That is ridiculous,” to, “Oh, that’s great but it would not apply to what I do,” to, in my case, it actually would not help me in any way with what I do because it turns out, well, what if an AWS region goes down and you can’t produce your newsletter the usual way? Oh, I’ll write it by hand that way because suddenly I have a much bigger story to talk about that week.

Jason: I am curious, though, speaking of having you on the podcast. Oftentimes, we talk about reliability, and having never had to deal with AWS bills because they always go to somebody else in finance, I am curious how reliability ties into the cost of what you’re paying for AWS? Because I can imagine things like—a common thing that we hear about is, “I’m moving a lot of stuff to Lambdas.” Like, great. Serverless. It’s cool, it’s hot. How is that charged?

Corey: Right.

Jason: Obviously, by time.

Corey: Oh, yeah.

Jason: So, if it’s charged by how long something takes, what if your latency goes up? What if your resources are constrained? How does this actually affect things? And how does that impact how you think about reliability not just from a is it up or down? How’s my customer looking at it? But maybe from what your AWS bill looks like?

Corey: I love where you’re going with that. And it’s the conversations everyone loves to have as about three levels beyond where most companies actually are. Easy example that sounds like something in the distant past, but it’s very real today: I want to store data in multiple availability zones for durability purposes and making sure that we are reliably up. Well, every time a gigabyte crosses an availability zone boundary, that cost two cents. And then you have to pay to store it twice.

So, there’s a question of how much is having multiple sets of that data worth? And the cloud-native answer to that is, “Oh, put it in S3. There’s no cross-charges there. Their durability is ridiculous, and you can access it a whole bunch of different ways, provided your application supports it.” But that’s not a fit for everything.

And you find that saving money, and being reliable, are at some point completely at odds with each other. And this is incidentally, why we don’t do this as a tool, we do it as a consulting engagement. There are times where, for business purposes, you will want to spend more on reliability. Because saving money that accidentally takes your company down for a month is not money you should be saving.

Jason: Yeah.

Corey: Now, the real fun thing I want to see from Gremlin one of these days from a implementation perspective is, just for fun, we’re going to run a chaos injection experiment where we decide to cancel the credit card tied to the account and then also remove the increasingly frantic alerts from your email when that happens, and see how long it takes you to realize the giant single point of failure that no one really thinks about existing, but absolutely does.

Jason: So, I am curious, for folks that are listening who are engaged with the chaos engineering community, or at least follow Corey’s newsletter and have seen updates, AWS has announced their own chaos engineering tool, the Fault Injection Simulator, which to Coreys skill of poorly named things, that actually isn’t a simulator. It does inject real faults, so it may be—S should be service. One of their faults, though, that they can do is API throttling, which essentially could simulate the idea of, you haven’t paid your bill; we’re turning things off. So, Gremlin is working with the AWS folks, we’re trying to figure out great ways that we can work together so that people can use both Gremlin and AWS FIS. So, I’ll let you know if that becomes a thing, and maybe we can get some API access to billing as well.

Corey: I’d love to see it. Please keep me looped in. Thanks so much for taking the time to basically go all over the world of DevRel and probably make some lifelong enemies in the process. If people want to hear more about what you have to say, where can they find you?

Jason: Yeah, I’m on Twitter. My Twitter handle is @gitbisect—and by the way, if anybody tweets about Git bisect, it is a fantastic tool, fantastic utility within Git—oftentimes, I will respond. But that’s where to find me on Twitter. Otherwise, you can find me on [unintelligible 00:31:30] podcast, Break Things On Purpose. It’s available in all the platforms.

Corey: Excellent. We will, of course, put links to that in the [show notes 00:31:37]. Thanks so much for taking the time to speak with me. I really do appreciate it.

Jason: Yeah, thanks, again. It’s been long overdue, and I’m glad we finally made it happen.

Corey: Awesome. Jason Yee, Director of Advocacy at Gremlin, I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment saying that the best thing to test breaking in production is your DevRel team.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

This has been a HumblePod production. Stay humble.

Chaos Engineering for Gremlins with Jason Yee

Episode Summary

Episode Show Notes & Transcript

Transcript

You might also like

The Appalachian Cloud Trail: Hiking, Cloud Economics, and Finding Perspective

Coding Agents, Chaos, and the Future of Dev Work with Dexter Horthy

The Rise of Autonomous Ops: Inside AWS’s DevOps Agent with David Yanacek

Get the Newsletter

Gnarly cloud cost questions?