- Last Week In AWS Twitter: https://twitter.com/lastweekinaws
TranscriptCorey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.
Pete: Hello, and welcome to AWS Morning Brief. I’m Pete Cheslock. I'm still here; Corey is still not. I'm sorry. But don't worry, I'm here again with Jesse DeRose. Welcome back yet again, Jesse.
Jesse: Thank you for having me back. I have to say for all our listeners, I'm sorry I have not watched the entire Step Up trilogy and all the other breakdancing movies we talked about last time. It is still on my todo list. But fear not, it will happen. We will talk about this again.
Pete: Well, that actually brings a really good point, which is we need to make a correction from our last podcast. We talked about how Breakin' 2: Electric Boogaloo was the sequel for Breakin’, and I had incorrectly thought that Breakin’—the first one—also had ‘Electric Boogaloo’ in the name. It turns out I lack the ability to read an article on Wikipedia. There was a very carefully placed period in that sentence which, as our listeners probably know, delineates one sentence from another. So, no: Breakin' one, it was just called Breakin’. It was not Breakin’: Electric Boogaloo. I’m—just have no ability to read anything on Wikipedia, apparently.
Jesse: I still feel like this is a missed opportunity for the first one in the franchise to be Breakin’: Electric Boogalone.
Pete: [laughs]. Almost as bad as Electric Boogalee, but—
Jesse: It's up there.
Pete: —that's for another podcast. Anyway, we are talking today, not about breakdancing movies from the 1980s, we are actually talking about a little bit of a different change in our normal conversation, not necessarily around Amazon-specific technologies, but around fostering change within an organization, and some of the worst ways that we have seen change kind of implemented into an organization. Fostering change, it's important in any organization in general—and maybe we're a little biased; we spend so much of our time dealing with cost savings and cost optimization, but it really is so much more important when you deal with over-reaching cost optimization and, kind of, management strategy within a company.
Jesse: Yeah, I feel like there's this massive disconnect between a lot of companies, where leadership has this really, really heavy incentive—or really, really heavy goal to better understand and manage cloud costs, and the individual contributors or the underlying engineering teams just don't have the same focus. And that's not to say that they don't care about costs, so much as maybe they have other roadmap items that they're working on or other tasks that have been prioritized before cost optimization projects. So, there really seems to be this disconnect to think about cost optimization more thoroughly throughout all levels of an organization. And it ultimately makes us think about how do you go about making that change because it seems like the best way to instill the importance of cloud cost optimization and management across a company is by instilling it in the company's culture. So, today, I really want to focus on what are some of the ways that we can get the entire company to care about cost optimization and management, the same way that leadership might care about cost optimization and management. Or alternatively, if this is an individual contributor that cares, how they can get the rest of the company to care about these things and vice versa.
Pete: Yeah, that's a really good point. And we deal with a whole swath of different companies and different people at those companies, where it's kind of amazing to see how some people just inherently really care about what's being spent. And it could be for various reasons. Maybe these are people that may not have any connection to the bill or paying the bill, but more just—they just—I mean, myself, I am this person. I just hate waste. I hate waste in all parts of my life, but I really hate waste in my Amazon bill because finding out that I didn't have to spend $10,000 last month on all of those API list requests on S3 due to that bug, it just—it cuts up my soul.
Jesse: And it's really rare to find people in any organization, whether it's a client that we're working with or an organization that you work in, that are super, super invested in that kind of cost optimization work. But when you find them—I was working with one recently at one of our clients who described themselves as a super nerd about cost optimization work. And that's perfect. That's what we want. We want somebody who nerds out over this stuff, and really passionately cares about, what's it going to cost for us to make changes?
Pete: Yeah. I mean, we are two people who have focused our careers on caring about how much people spend on their bill. We're cost nerds. It's fine. It's okay to say it.
Jesse: I accept this term. I accept.
Pete: [laughs]. So, before we get to some of the good ways that we've seen to get people to care about this stuff, we want to talk about some of the worst practices we've seen. And this is broader than just cost management. This really is, what are some of the worst ways that we have been a part of seeing a company just try to affect change, whether you're a startup that's trying to pivot to the next thing, make it to the next funding round; or maybe you're an enterprise and you're just trying to go digitally native, cloud-native, multi-cloud, or something like that. The technology is not your challenge. It's not the technology is the reason why you're not going to accomplish your goal. It's always going to be the people and getting them to care about it. So, what are some ways, Jessie, that you've seen that have been particularly grinding to you?
Jesse: Yeah, if we're going to talk about incentivizing practices, I think that the big one that we need to talk about is gamifying the system where the leadership or management sets some kind of goal to say, “We want all of our IT team’s support tickets to be closed within 48 hours.” So, that's a great goal to set; that's a lovely SLA goal to work towards, but if you just set that goal blanketly, for your team, they're going to gamify the system hard. They are going to end up closing tickets as soon as they send a response, rather than waiting for the issue to be resolved or not. I've experienced this multiple times, and it drives me absolutely batty and doesn't solve the underlying problem, which is faster and higher quality support for the customer.
Pete: I'm in this picture, and I don't like it.
Pete: It's true, though, one of my first, we'll call it ‘real’ jobs ever in the tech space was actually support. I was support for a SaaS product, and one of the metrics we tracked was time to close; time to resolution. And there were no incentives on this, I just was really competitive. And I would send a response and I’d closed the ticket. And for people who have worked in support before, you'll know that people react to a ticket being closed differently.
If you've ever opened a ticket with Amazon, for example, and when they respond to you, they close the ticket, and you're like, “Whoa, whoa, hold on a second.” I think they've gotten better in more recent time where they'll leave it open for a predetermined period of time, and then they'll close it automatically to obviously hit their stats. But, like, the time to close was always very questionable to me, and applying some sort of financial incentive around that, I mean, you're just going to create just the worst from people.
Jesse: And I would much rather that ticket close after a certain amount of inactivity. I would rather get the passive-aggressive automated email saying, “Hey, we haven't seen a response from you. Do you still have this issue? Do you still want us to work on this? Can you update this ticket?” versus the, “I sent you one response and now I'm going to close this ticket. Thank you for playing. Goodbye.”
Pete: Yeah, exactly. I think there are real reasons to close a ticket out. I think for most folks, you got to think about what is the metric that matters to you? And in many cases, what a previous company that I had worked at, instead of trying to aim for tickets being closed in a certain period of time, it was time to first response. It was how quickly we were able to respond to that client, not necessarily get to resolution because software support, way too complex to really nail down how quickly you can resolve an issue.
You could run into something that might take actual software engineering work to happen in the background. You might have a code change has to go out, and maybe that code change has to go in through your scrum cycle. And there's two more weeks plus some QA time and whatever. I mean, it's so hard to balance that out. So, like most things—and I think you'll find when we talk about them today is—you kind of have to understand what you're trying to incentivize for, not just applying random incentives or gamifying the system.
Jesse: Yeah, and I think it's also important to note that it's not just about what you're trying to incentivize for, but how best to incentivize—which we'll talk about some of the better ways to incentivize in a minute—but it's also important to think about, there are positive and negative ways to reinforce your goals. So, positive reinforcement, generally speaking, is going to be much more proactive, rewarding somebody who does the right thing, and negative reinforcement is more going to shame somebody who does the wrong thing, or punish somebody who does the wrong thing. Nobody wants to be punished. Nobody wants to be called out on the carpet for something that they did, whether it was intentional or accidental, and so it's a lot harder for organizations to get their employees to do what they want. It's harder for organizations to get their employees to care about cost optimization, and care about these other metrics if all they're doing is being negatively punished for not hitting the metric or not achieving the goal.
Pete: I will share one story that I don't think it's a negative reinforcement, but I guess I'll let our listeners and you, Jesse, be the judge of me on this one. But at a previous company, I created a tool that allowed you to connect into different servers within our environment because at some point you're probably going to have to log into a server—even if it's in the Cloud—and look at a log file or debug something. Like, you're just always going to be there. And we made a change to how you would connect, and functionally this was the difference between, like, dashes and underscores in how you connect it to a thing. So, I was deprecating a certain way of connecting, and so I use a helpful motivator, which is Clippy. Clippy is the mascot, I guess, of Office. The Microsoft Office mascot.
Pete: And so what would happen is, if you typed in this command incorrectly using, like, a deprecated command, Clippy would pop up and say, “Hey, did you actually mean to type this instead?” And then it would pause for a second, correct your mistake, and then send off the command.
Jesse: Oh, my God.
Pete: And I didn't wait long enough to be super annoying. And it was just more like, “Hey, just a reminder, you should stop using the tool this way.” And it was the best way I can think to let a whole wide swath of people know. But then I took it one step further, and I had really, truly honest implications for this one, which I had it report a StatsD metric to a Grafana dashboard every time you did that because your username was associated with that. So, I ended up with—
Jesse: [laughs]. On no.
Pete: —a dashboard that showed who was doing it wrong. Now, in my defense, I actually used that dashboard to go to those people and just say, “Oh, hey, like, is there a way you're using this tool that you're actually running into this? How can I help you? How can I make my software better?” But when someone found this dashboard, they actually brought it up in one of our on-call meetings and thought it was a lot more negative than it really was intended to. So, if you do create a dashboard, add some context to it. Make sure that people know the purpose of that. I really did not think it was that bad.
Jesse: The intention was definitely there. The intention was so so good. Sadly, it was just taken out of context.
Pete: [laughs]. Well, then, of course, because of our hilarious—or so we thought were hilarious—jokes internally, we then started using Clippy for a bunch of different things. And anytime we deprecated something, Clippy came back again, and—
Jesse: Oh no.
Pete: We made fun of it, we made it a fun thing, but what you definitely don't want to do—and this is where that negative reinforcement is, is be publicly shaming engineers, employees, on a dashboard for things. That's one of the important parts of this is I never shared this dashboard publicly, and was like, “These five people are doing it wrong.” But I have seen scenarios where people have used those types of dashboards to rank their employees. You see it a lot in sales-type organizations, they are motivated far more with stick than carrot, I think.
Corey: This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.
Jesse: Yeah, I think that's the important thing to call out here to distinguish, that ultimately, your intention was good, and you ultimately we're trying to use this as a way to discover who might have been making those mistakes and help them, versus somebody who might be publicly sharing and then shaming these people. Because if somebody in the company, whether it's leadership, whether it's a team, whoever tries to shame others based on these kinds of leaderboard metrics, people are just going to lean into that harder and make a joke out of it entirely, about how many times can I make this mistake to get the number one spot on this leaderboard? Even though it was a negative leaderboard, per se, you're still going to try to lean into that harder if someone's going to continue to make a joke out of it, or try to make it something serious, when it clearly was meant to help people rather than shame anybody.
Pete: I mean, this is exactly why we don't pay people for lines of code, right? It's these arbitrary metrics that just don't have a lot of meaning in the real world. So, all right. We've gone through a couple, and we could fill this whole episode easily with all of the terrible ways and worst practices we've seen, or even worst practices that I've created for people that have worked for me.
But let's talk about the good things. What are some of the good ways that we have found to get people to actually care? And in this scenario, I'm going to specifically kick us off talking about, again, the cost optimization side of things. How do you get people to care about that? Because if you think about it, in a lot of ways, if someone were to come to me and say, “I need you to cut the spend on a particular service,” and I know that that could impact the availability, well, guess what?
If I'm on call, I'm not going to spend—and, you know, save the company money that's going to cause me more pain, right? That's a really bad way of coming at it. And so, maybe I'm going to share my concerns about that. Hopefully, I work with a team that actually listens to it, but there has to be a balance. So, from the manager side—as being a previous manager and managing a team—can you strike a balance between the carrot and the stick?
Now, one of the things that I had done with a good amount of success was to add, kind of, more of the human aspect to cost savings. And it was more face-to-face time with people—again, back when we could be face-to-face, which feels like a lifetime ago, pre-COVID. But it was really trying to connect with the engineers at a personal level for what they were building, how they were using Amazon, to understand what they were trying to accomplish. So, maybe I would go in and say, “Wow, I'm looking at a series of these C5 extra large instances, and their CPU is pretty much idle.” I can go to that engineer—based on some tags that we have, so there's an owner maybe—and I can go to an engineer and talk to them and say, “Hey, based on this workload, I actually think we can move over to T class instances. What do you know about this service that I don’t?” Now granted, maybe every once a while, I might be like, “Yeah, be a real shame if anything happened to those C5 extra larges there.”
Pete: But, you know, it was trying to be a little bit more personal and do that. And because of my love of saving money, I developed a nickname at the company called Captain COGS.
Jesse: Oh my God.
Pete: COGS is short for ‘cost of goods sold’ because that was the metric that we cared about internally at the business; it was cost of goods sold. We were a non-profitable startup, so that's kind of a financial metric people care about. But what was interesting is that by sharing that that's the thing we cared about and different ways that engineers could help improve that number, people actually did start to care about it.
Jesse: Yeah, I think that's a really important point because what you're fundamentally getting at there is building this culture of trust and empowerment. You are trusting that the person who spun up the C5 class instances knew what they were doing when they deployed them, and you're asking them to share their context, asking, “Hey, do you have more information, more context about this than I do?” And in a lot of cases, they'll come back to you and say, “Oh, yeah, this was for a business requirement that we had to do X, Y, and Z.” Or maybe, “This was the only thing available or only thing powerful enough at the time,” or maybe, “The workload was higher at the time when they deployed the C5 class instances.” And so now that the workload has slowed down, they can move to something cheaper and better for the workload.
But ultimately, you're trusting the other person and you're empowering them to make these decisions. And I think that's honestly what this is all about. It's honestly about sharing what needs to happen with everybody. It is bringing the work individually to the people who are actually doing the work. It is sharing those goals, sharing all that information and those details with the people who are doing the work, and creates this culture of psychological safety to make mistakes and own up to them.
It's the space where you can ask those clear questions of, “Hey, did you mean to spend up that I3 instance or was that an accident? Do we actually need all this storage in io1 EBS volumes or is there something else that we can use instead?” And you're ultimately empowering them; you're empowering them to make informed decisions better in the company's best interest. You're empowering them to participate in and shape the practices and the processes that allow them to be mindful of cost during every part of the engineering process: feature development, forecasting, architectural decisions, all of it.
Pete: The important part of that, too, I think, is that context about what these costs are. You know, these are all movable levers. You can cut the cost on your Amazon usage to zero. You can just turn everything off, right?
Now, will your customers be happy? Probably not. But you can move these levers, these are all changeable levers. They're all going to be within the bounds of the business, within the context of what needs to be done. One of the really big success points that I had was working closer with product teams and the engineering teams to break down the cost of each individual feature.
So, if I had 10 features within a product, and I could break them out to say, you know, feature one represents 30 percent of our total bill. And I can work with product to then give them that insight because honestly, the most amazing thing would happen. The product teams would say, “Wait a second. That's our least used feature.” It almost always happened. It's like, the one that costs the most was the least used.
Pete: And while I know that none of these changes are going to happen right away, by just dropping that little nugget to a product person, it will start to fester in their mind. And eventually, as that company matured, the cost of specific features started to show up in product planning sessions. When they would decide to refactor or change different features and things, how much of a total spend that represented would pop up. And it was that kind of democratization of that context across the business that enabled it.
Jesse: It's data-driven decision making. It's giving the necessary data to all of the players involved so that they can make informed decisions about business goals and product releases and feature releases and optimization efforts, full well knowing what making those changes might ultimately cost to the company.
Pete: Yeah, I think the other item that I always think about here is just getting people to care about how much of their stack that they're about to deploy is going to cost is hard because of that missing context. I know there's plugins for tools like Terraform that'll tell you, “You're about to provision something that's $300 right now—” or, “$5,000 right now.” Well, is that a lot of money? Like, $5,000?
Like, yeah, that's a ton of money. But what if that represented a fraction of a fraction of a percent of your total bill, right? What if that was such a small rounding error? Or what if that represented 50 percent of your bill? It's the context that matters. And that's kind of what's missing for a lot of those.
Even when trying to price out a brand new service, trying to cost-model something out. On paper, it could look wildly expensive, but in relation to maybe your engineering efforts, well sure, this is going to cost us $10,000 a year, but we're going to be able to get back, like, a whole engineering resource who doesn't have to deal with the broken thing anymore, right? Those trade-offs and those decisions, they have a lot more impact when you grab that data, and when you really ask those questions.
Jesse: Absolutely. And I think that's so critical to be able to look at all of the pros and cons of any business decision, both in terms of the actual costs of building something that AWS or your cloud provider might charge you, and then looking at the hidden costs in terms of engineering effort; in terms of other resources, whether it's a third party resource like another monitoring or observability tool, or other infrastructure resources. It's important to look at all those resources together in order to make your decision.
Pete: Well, I think we could fill plenty of additional Whiteboard Confessional podcasts with more of these, worst of the worst and best of the best ways that you've seen on fostering change in your organization. Shoot us a message on Twitter @lastweekinaws and we'd love to hear. What have you seen work really well? What have you seen that has not worked as well? All right, Jesse, again, thank you for joining me, otherwise, it would just be myself, talking to myself, just me, myself and I.
Jesse: If it was just you talking to yourself, we never would have gotten off the Breakin’ rant that started last week, and we would still be here talking about breakdancing movies.
Pete: It'd be 30 minutes in and I would be misquoting Wikipedia articles.
Jesse: I'm happy to help—well, actually you—anytime.
Pete: [laughs]. Thanks again, Jesse. So, if you've enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice, and tell us some of the worst ways that you have seen change done in an organization. Thanks again.
Announcer: This has been a HumblePod production. Stay humble.