- Unconventional Guide to AWS Cost Management:https://www.duckbillgroup.com/resources/unconventional-guide-to-aws-cost-management/
- Trash Taxi: https://trash.taxi
Corey: This episode is sponsored in part byLaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visitlaunchdarkly.com and tell them Corey sent you, and watch for the wince.
Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I’m Pete Cheslock.
Jesse: I’m Jesse DeRose. [laugh].
Pete: Hashtag #FFF. Not my grades in high school; that is Fridays From the Field.
Jesse: We will make it a thing. It’s going to happen.
Pete: It’s going to happen. We’re going to do our best to use the hashtag triple-F as much as possible. So, if you have any questions for us, just again, reminder, you can go to lastweekinaws.com/QA as we talk more about our Unconventional Guide to AWS Cost Management. Please give us your feedback, ask us some questions, we’ll answer those in a future episode. Today, we’re expanding on tagging. Because it’s so thrilling to talk about tagging some more, Jesse.Jesse: We know that you have struggled to fall asleep at night listening to our podcasts. So, we wanted to do a very special episode just for you, to talk more about tagging. Let’s move into our NPR voices. [silky-smooth voice] Hello, and thank you for listening.
Pete: [buttery-smooth voice] Sponsorship of this—no, I’m just kidding. We’re not—we leave that work to, Corey.
Pete: So, today is really about how to win friends and influence DevOps, and it’s all about continual tagging improvement.
Jesse: We talked about the importance of tagging, and one of the things that’s really important to tagging is identifying a tagging strategy, and then building and developing that tagging strategy over time. Your tagging strategy is going to change over time; that is the nature of the beast. Your organization is going to change over time, therefore your organization’s needs are going to change over time, and the tagging strategy and the tagging needs are going to change over time, as well.
Pete: Exactly. You’re going to build new products; you’re going to grow, hopefully; you’re going to add additional Amazon accounts; you can make acquisitions; you could get sold to another business. There’s just so many things that are going to happen, they’re going to change. It’s just inevitable. So, how do you continue this process of tagging, and this is, I think, a really important discussion because when you start that process, you take that first step and you start investing in tagging, the best way to get those—you know, that compound interest on all of the return value that you’re putting into tagging, is by making it a long term, continual process. And I’m not talking about, like, “Well, you know, we do a little thing every month, and it’ll be good by, I don’t know, maybe a month or two, next quarter. And then we’ll be done.”
Pete: And that doesn’t work. The best companies that we’ve seen that have really knocked this out of the park have turned this into just a multi-year endeavor. It is going to take you a long time to reach just, like, the pinnacle of tagging, having that ability to allocate just down to the penny of your Amazon spend is going to take a long time. So, manage those expectations appropriately that this is not an overnight fix.
Jesse: So ultimately, at this point, you’ve tagged all of your resources; you’ve built this policy. The next thing to really think about is, why? Because in a lot of cases, a lot of engineers are going to ask you this very question. Why should we tag this information? Why should we tag these resources?
And you’re going to need an answer that’s more than just, “Well, finance wants this information,” or, “Product wants this information,” or, “The engineering leadership team wants this information.” What you’re getting at with tagging is cost attribution. So, at a really high level, for those who aren’t familiar, cost attribution is the process of identifying, aggregating, and assigning your AWS spend to your organization’s teams, your business units, your products, however you want to slice-and-dice that data, whatever different tags you might be leveraging within your tagging policy. So, it’s really about where is your AWS spend going, along these different lines of the different things that finance cares about, that engineering cares about, that product cares about, that IT or security cares about. So, it’s not just about tagging your resources so that everything’s tagged, but it’s about leveraging that information to understand, where are your costs going?
Pete: I think that also gives companies a great KPI—Key Performance Indicator for the non-business folks. But it's a good metric. It’s a good way to track your success with tagging is to basically answer this question: what percentage of spend is tagged? Not number of resources because there are some resources that simply don’t have a cost that have the ability to be tagged. So, tracking tagged by a percentage of resources is, for the most, part not useful.
Pete: But tracking what percentage of your spend is tagged—and specifically tags that are enabled as cost allocation tags, which is something that you need to make sure you set up—but by tracking that spend, that KPI, that’s how you can start to understand how good of a job you’re doing at this. Now, again, we’re obviously focused on tags as a cost attribution strategy. But the reality is, is that’s the main use of them on Amazon, specifically. The main use of tags, again, that we see is so people can understand where the money’s going.
Jesse: Yeah. AWS even calls them out as user-defined cost allocation tags. For example, if you want to log into Cost Explorer and see where your spend is going among different products, among different teams, among different business units, you need to make sure that those tags that you’re leveraging are enabled as cost allocation tags in Cost Explorer. So, that’s a really important footnote to call out.
Pete: Yet to that point, is if you do enable your cost allocation tags, there’s maybe some default ones that Amazon will enable for you, but you’ll have to enable any of your own customs. Those take effect going forward; they’re not retroactive. So, if you want to understand which tag is costing you a certain amount of money, make sure to go and enable that as soon as you possibly can because it’s not going to—you’re going to be able to look back at Cost Explorer and see what historical spend was.
Jesse: And this is also a reason why we mention that it’s never, ever too early to start the tagging conversation and tagging your resources because the sooner you start tagging your resources, the more historical data you will have for that particular tag.
Pete: Yeah and that ability to have historical data will help you forecast, which someone’s going to ask you for in the future if they haven’t already, is to try to forecast and plan for the future spend. And historical spend is a great way of understanding that. But again, our focus is on the cost side of things. But there’s a lot of other great usage for tags in Amazon: security, access control, things like that are really useful. I’ve seen companies use tags as a way of marking hosts that were in compliance, you can mark an EC2 host is in compliance when it was scanned and tag it appropriately and then security teams can use those tags.
Super bonus if you can align those two needs of cost attribution plus your security needs because that’s a really great way of incentivizing those engineers to make sure things are tagged really well. Another really interesting open-source product that came out of a previous company—I swear, this is not a joke. It’s an open-source product called Trash Taxi—
Jesse: Oh, my God. [laugh].
Pete: It has a fantastic logo. You can go to trash dot taxi and check out the very awesome logo there. A former colleague of mine built this. This is actually a phenomenal use of tagging as a way to identify assets that are running that were, maybe, manually logged into.So, of course, like, “You should never log into your EC2.” That’s the thing that people say. But everyone logs into their EC2 at some point. You got to look at something, you’re debugging, like, you just need to get on the host if you’re running something on EC2, and that’s fine. And at a previous company, that was an okay thing.
But what if, after someone logged in we could mark that host with a tag and say that this is essentially a tainted host. Because if someone logged in, then theoretically, they could have made a change that falls outside of a normal configuration management run. It’s now different than the others. So, what if we could mark it and then, later on, we can go and just terminate it, let the auto-scaling group replace it. So, that is, again, a really interesting use of tagging.
Other uses that we’ve seen has to do with service discovery. I know in the earlier days, tagging APIs were a little dodgy. I’m—I’ve had outages due to tags going down, which sounds like a crazy thing. But tags are another great way of driving some of your service discovery needs. And again, great way to align with your needs of cost attribution.
Jesse: Yeah, and I think that there are a lot of different ways that you can make sure that these tags are applied, or use these tags for your policy work for identifying service discovery, all these things that Pete just talked about. There are tools like Cloud Custodian, for example, that I used in the past. I used to work for Capital One, and we had extensive use of Cloud Custodian, and as soon as we deployed any resources in any AWS account that didn’t quite fit the framework of the EC2 instances that we were allowed to use, that didn’t live in the correct availability zones or regions that we were allowed to deploy in, or maybe didn’t have the right tags associated with them—how meta can you get? That we maybe didn’t have the owner tag associated with it, or we didn’t have all the necessary tags for our policy associated with that. Cloud Custodian would automatically tag that resource as non-compliant, and potentially send us an email or some kind of message so that we knew that we had resources associated with our team that were non-compliant.
And then after a certain amount of time, if we took no action, those resources would be shut down automatically. Which I don’t necessarily recommend for production—we didn’t actually do that in production, we just stopped the instances, but you ultimately had this opportunity to really clearly enforce your tagging policies.
Corey: This episode is brought to you in part by our friends at FireHydrant where they want to help you master the mayhem. What does that mean? Well, they’re an incident management platform founded by SREs who couldn’t find the tools they wanted, so they built one. Sounds easy enough. No one’s ever tried that before. Except they’re good at it. Their platform allows teams to create consistency for the entire incident response lifecycle so that your team can focus on fighting fires faster. From alert handoff to retrospectives and everything in between, things like, you know, tracking, communicating, reporting: all the stuff no one cares about. FireHydrant will automate processes for you, so you can focus on resolution. Visit firehydrant.io to get your team started today, and tell them I sent you because I love watching people wince in pain.
Pete: Yeah, you often hear these terms, “It’s a governance solution.” It’s a—
Pete: —kind of a gross term. Yeah, it doesn’t sound good. Sounds very enterprise-y, like ‘governance.’ But it’s really what it is. You’re trying to enforce this policy in what I would either call the stick or carrot model.
You’re either going to give some negative feedback, like terminating that instance or maybe some positive feedback, positive reinforcement, by emailing that person and saying, “Hey, just a friendly reminder, I’m going to terminate your instance.”
Pete: But tools like Cloud Custodian are really powerful because they can help you keep an eye on things when you can’t. Like you can’t always watch everything that’s happening—and hopefully, you’re allowing your engineers the ability to build it and run it themselves on Amazon—this governance solution will ensure that people are, kind of, doing the right things and then fixing it—I think that’s an important one—when they’re not.
Jesse: Yeah, again, this gets back to the idea of, you want your systems to be as streamlined as possible, as easy to use as possible for your developers, to make the right thing the easy thing. So, for example, if you want all these tags applied to your resources, you want your deployment pipelines to be able to add these tags as easily as possible. But someone’s going to forget, someone’s going to log into production and spin up that I3 instance for whatever, quote-unquote, “Testing purposes,” but then you now have a streamlined way to very clearly say, “Hi. I know that this was probably meant to be spun up in a different environment, or maybe you meant to tear this down as soon as you were done testing with it, but this isn’t compliant with our tagging policies, this isn’t compliant with maybe our business policies. This resource either needs to be moved, tagged, some action needs to take place.”
So, there’s really awesome opportunities to provide very streamlined, easy-to-understand messages to your engineers to say, “Hey, this doesn’t align with our tagging policies. This doesn’t align with our business policies. We need you to take some action.” And very clearly give them the action to take place, whether that is adding tags, whether that is tearing it down. Give them very, very easy opportunities to understand how to resolve the problem and how to move forward.
Pete: Yeah, exactly. I mean, there’s really two main strategies here: you set up continuous monitoring, which you’re probably doing already with, like, host monitoring and metrics monitoring, but set up a continuous monitoring with a governance solution, something—either home-built or maybe a Cloud Custodian or other more commercial SaaS products that exist out there—and notify those teams when they fall out of compliance. But then also start introducing those more aggressive approaches, in maybe pre-production accounts first, but over time by terminating those resources, or automatically resizing them if someone used the wrong instance, or whatever. I mean, this stuff exists via these APIs. And you could do maybe both of these. Porque no los dos? Why not both?
Jesse: [laugh]. I will say a lot of the research that we’ve seen suggests that the positive reinforcement incentives work better than the negative reinforcement. So, in a lot of cases, if you can champion the teams that are tagging all of their resources—or sorry, I should say tagging all of their taggable usage—for example, that’s going to ultimately lead other teams to say, “Well, hey, if they’re getting appreciated for their work, if they’re getting rewards for tagging all of their usage, I want to be like them. I want to tag my usage as well.” And that’s ultimately going to be more impactful, more beneficial than if you start punishing teams for not tagging the resources. But to be clear, there’s definitely going to be teams that just don’t care. So, you’re going to need a little bit of both. There’s going to need to be some opportunities for hard love.
Pete: Yeah. I mean, I would even consider providing cash incentives, right? Because these engineers—let’s say you’re identifying, via tagging, that some teams are maybe a higher percentage of tagging or are maintaining their spend better than others, and other teams are maybe a little bit more wasteful or leaving things running. Provide cash incentives for teams to take action. And this kind of goes back into a more broader discussion when you talk about the tags and how you’ve maybe used them for your budgeting purposes, whether it’s a chargeback or a showback.
But by making those numbers a thing, those budgets that teams can work towards—again, another metric you can track—consider savings. If you see an opportunity for a team to reduce their spend, incentivize them with some money. That’s a great incentive for a lot of teams.
Jesse: Yeah. Or if you’re able to, maybe you can incentivize them with additional days off, additional PTO days, or something else, maybe a group activity, and I’d shuddered as soon as I said those words, but maybe there’s something that you can do to incentivize them that really speaks to the team’s interests.
Pete: Yeah, you could do that. You could also do what I used to do at a previous company as in, just secretly replace everyone’s hosts with T class instances.
Pete: I’m not even joking. It’s like, “[velvety-soft voice] We’ve secretly replaced his host with a T2. Let’s see, if he notices.”
Pete: Spoiler alert: they did notice, but very rarely did we ever have to move it back. We used our monitoring, and our monitoring of performance was our biggest cost management tool to identify those savings. Now, Amazon has resource groups, you can do AWS Config Managed Rules, Cloud Custodian, like Jesse mentioned. And you can even set up some interesting CloudWatch event-driven Lambda functions that can actually go and apply tags or identify things that are missing tags, after the fact. There’s just countless tools out there that can really help you with this.
And always, again, remember that this is continual; it’s always happening. If you can make it a little bit better every day, two, three years from now you’re going to look back and realize, “Wow, we are much further than we were when we started because we did a little bit at a time.”
Jesse: And then this ultimately allows you to make data-driven decisions based on all of this information, whether as Pete mentioned before, that’s forecasting, whether that’s a security or a compliance discussion, whether that’s a cost discussion, you now have all of this historical data for your AWS usage that you can leverage to really understand what you might ultimately be capable of in the future.
Pete: Exactly. So, if you’ve enjoyed this podcast, please go to lastweekinaws.com/review give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review give it a five-star rating on your podcast platform of choice, and tell us how many t2.micros that you deployed for your engineers. Again, don’t forget, please give us your feedback. Any questions as well, you can go to lastweekinaws.com/QA. If you have questions about some of these things or want some additional insights, we’re going to be taking some time in some future episodes to talk more about those. Thanks again, everyone.
Announcer: This has been a HumblePod production. Stay humble.