Join Pete Cheslock and Jesse DeRose as they take the reins of the Whiteboard Confessional podcast with an examination of the hot-off-the-presses AWS Cost Anomaly Detection service. Pete and Jesse do a deep dive of the new service and talk about Pete’s rule for gauging a company’s ability to do machine learning, the best part about the AWS Cost Anomaly Detection service, how AWS customers can help AWS train the algorithm and improve the service, why the walkthrough tour that AWS provides for the service is awesome, how to determine what notification threshold to use for AWS Cost Anomaly Detection, why it’s better to have too many alerts than not enough, and more.
Episode Show Notes & Transcript
About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.
Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.
Pete: Hello and welcome to the AWS Morning Brief: Whiteboard Confessional. Corey is still not back. Of course, he did just leave for paternity leave, so we will see him in a few weeks. So, you're stuck with me, Pete Cheslock, until then. But luckily, I am joined again by Jesse DeRose. Jesse, thanks again for joining me today.
Jesse: Thank you for having me. You know, I have to say I love recording from home. I can't see the look in our listeners’ eyes as they glaze over while we're talking. It's absolutely fantastic.
Pete: It's fantastic. It's like a conference talk, but there's no questions at the end. It's the best thing ever.
Jesse: Yeah, absolutely. I love it.
Pete: All right. Well, we had so much fun last week talking about a new service. Although it turns out it was new to us. It was the AWS Detective—or Amazon Detective. There's still some debate about what the actual official name of that service is. For some reason, I thought that service came out in the summertime, but it turns out it was earlier in the year. So, still a great service, AWS Detective—or Amazon Detective, whichever way you go with that one—but we had such a fun time talking about a new service that we had the opportunity of testing out an actual brand new service. This was a service that was just announced last Friday. And that's the AWS Cost Anomaly Detection service. Jessie, what is this service all about?
Jesse: So, you likely would notice if your AWS spend spiked suddenly, but only the really, really mature organizations would be able to tell immediately which service spiked. Like, if it's one of your top five AWS Services by spend, you'd probably be able to know that it's spiked, you'd probably be able to see that easily in either your billing statement or in Cost Explorer. But what if you're talking about a spike in a much smaller amount of spend, that's still important to you, but it's a service that you don't spend a ton of money on: it's a service that is not a large percentage of your bill. Let's say you use Workspace, and you only spend $20 a month on Workspace. You ultimately do want to know if that spend spikes 100 percent or 200 percent, but overall, that's only maybe $20 on your bills. So, that's not something to see very easily unless it spikes exponentially.
So, the existing solutions for this problem require a lot of hands-on work to build a solution. You either need to know what your baseline spend is in the case of AWS Budgets, or you need to perform some kind of manual analysis via custom spreadsheets or business intelligence tools. But AWS Cost Anomaly Detection kind of gets rid of a lot of those things. It allows you to look at anomalous spend as a first-class citizen within AWS.
Pete: Yeah, the other trick too, with this anomalous spending—and I've gotten really good at learning how to spell ‘anomaly’ because I've always spelled it very wrong my entire life, but in just writing the preparatory material for this, the number of times I spelled anomaly has really solved that problem for me. Now, sometimes those mature organizations, they might see that anomalous spend, maybe the day after, maybe the week after, but I've been a part of organizations who they see that spend when the bill comes. That's actually pretty common. You're not an outlier if you only identify these outliers in spend when your bill arrives. And that outlier in spend could be something like, “Wow, we changed a script, and we're doing a bunch of list requests, and wow, we're that $8,000 come from?” or, “We're testing out Amazon Aurora and we did a lot of IOs last weekend, and our estimated bill is going to be $20,000.” Those are all things that if you're not a crazy person who's so in love with your bill that you look at it every day, you're going to miss that, right? You're just going to wait to the invoice. That's what everyone happens, right, Jesse?
Jesse: Absolutely. Yeah, it has been really fascinating for us to see this pattern again and again, honestly, with some of the clients that we worked with, but also within the companies that I've worked with over the years. It's just not something that is highly thought about until finance sees the bill at the end of the month or after the end of the month, and then it becomes a retroactive conversation, or a retrospective to figure out what happened. And that's not the best way to think about this.
Pete: Yeah, exactly. I mean, the best way to save money on your bill—something we see every day—is to avoid the charge, right? Avoid those extra charges. And the way you can do that is to know of an anomaly in advance. So, one of the best parts of this feature—I can't believe it, we've made it nearly five minutes into this conversation without calling out the most impressive part of Anomaly Detection—is the fact that it's all ML-powered. Now, I know what you're thinking, that you just cringed when I said ML, it's machine learning. And I cringe whenever a company markets based on machine learning. And the rule that I have is, you need to tell me how many PhDs are on your staff before I believe you can actually do machine learning.
Pete: In the Amazon case, as it turns out, I could guess that they hire quite a few PhDs, so I feel like I'm going to give them a pass on this one.
Jesse: I feel like this is going to be a fun, over-under conversation of how many PhDs were on the team that put this service together, or built the machine learning component of AWS Cost Anomaly Detection.
Pete: I'll tell you what. It's good to be more than most SaaS services, that market towards machine learning.
Pete: Now, we got an insight into this from the product manager in advance; got to check it out, which was great. And then we learned that this is just there. It's in your account right now; you can go into your Amazon account, and enable Cost Anomaly Detection right now, and the best feature is it costs nothing. There's no charge for this. Now, there's some alerting you can set up, so there's charges for SNS or other notifications, but this service will alert you—anomalies, it will let you know of anomalies in your spend, you don't have to pay for it.
So, the best advice I can give you right now is that you should, at the conclusion of this wonderful podcast, go and enable this service and go and see what you find out. Now, there's a bunch of caveats, and we'll talk about some of those caveats. I think one of the things that we learned, which was pretty interesting, is that it will only currently let you know of spikes in spend, anomalies that are an increase in spend. And I'm sure that we can have a really long conversation with an actual PhD about why it's hard to identify dips in spend versus spikes in spend, but if you think about it, this is their initial release. And it's very clear, it's an initial release because it has the tag beta in it right in the UI. That was pretty interesting. Jesse, how many services have you seen ever pop up with specifically a beta tag within Amazon?
Jesse: I have seen plenty of services that have the preview tag, I've never seen an AWS service that has the beta tag, I feel like this is a new evolution of AWS Services that we are seeing for the very first time.
Pete: I think that's great. I mean, I have worked at companies, I've had a fight with product people about should we put a beta tag, should we not? What does that mean? What are we communicating? But I think in this scenario, it's perfect, right?
This is a new service. It's beta; it's free of any charge, other than the notifications that you might have set up. They really want users of this, and when we talk a little bit more about this, I think you'll see why, and how you can review these anomalies and report back to Amazon, how accurate those anomalies are. You can help actually improve the algorithm, which is pretty powerful if you think about it.
Jesse: I think that's one of the best features of this being in beta is that because it uses machine learning at its core, there is so much to be learned from many, many AWS customers enabling this now and training the model on their data and giving AWS the opportunity to continue to hone its model for each individual customer and for new customers.
Pete: Yeah, exactly. So, when you go into Cost Explorer—which is where you'll find this within the Amazon console—there'll be Anomaly Detection, you can find it and has the helpful beta tag for you to call it out. When you hop into it the first time, it actually has a self-guided tour: a pop up the first time you're going in there, it'll walk you through—it’s not an entirely complex application, but it will still walk you through what does each section mean? How do you set this up? How do you start using this? And kudos. Seriously, non-sarcastic kudos to the Amazon product teams and engineering teams for building that in. We should see more of those types of tours.
I know that Jesse and I both had issues with the AWS Detective service in that it just dumps you in and you're like, “I have nowhere to go.” This was like, not—it wasn't long, maybe—what—four-step tour, but it definitely explained where you are, and what you need to do for this to provide value to you. So, kudos for that; hopefully, we see that in a lot more services. So, as you go through, it'll start explaining, kind of, the Overview Section, the Anomalies Section, and the Alerting Section. It's really basic, and those are the three main areas. And it claims that it will automatically alert you for changes in your overall spend. It will create these anomalies for any type of anomaly that it thinks exists. But then there's these alerts. This is, like, these custom monitors you create. And there's a whole slew of custom monitors. Jesse, what did you find as we dove into the custom monitors that you can create for anomalies?
Jesse: This was definitely one of the most interesting features for me. It's something that I have definitely seen with a lot of customers, but have not really actively thought about until I saw it here. So, again, kudos to the team who built this product for creating multiple different custom monitors. The easiest way to get started with Anomaly Detection is enabling AWS Services Monitoring. So, effectively, you are looking at each AWS service individually and saying, “Does the spend for this particular service go up or down? Has my EC2 spend gone up or down? Has my S3 spend gone up or down?”
Or in this case, has it only gone up at the current state of affairs. But there's other opportunities as well, which I'm really excited about. We can create a custom monitor that looks at a specific set of linked accounts. So, if you have separated your AWS accounts into multiple different accounts based on business units, based on different application environments, or other criteria, you can create monitors that allow you to specifically alert on anomalous spend in production, or anomalous spend in development, or anomalous spend for one particular business unit, or one particular team depending on how you've sliced and diced your AWS account structure. So, there's lots of opportunities here to specifically focus on the things that you care about, and spend less time worrying about the components that you don't care about.
You can also use AWS Cost Categories or cost allocation tags as your monitor to slice and dice anomalous spend based on specific tags that you have created. So, if you've created tags for different products or tags for different teams, maybe you're not at the point where you're ready to break AWS spend out into different linked accounts, but you still want to alert on anomalous spend in different teams or different business units, you have that opportunity here with AWS Cost Categories and cost allocation tags. So, right out of the gate, not only is AWS telling you that they will look at your overall bill and alert you to anomalous spikes, but they're giving you multiple different vectors to alert on anomalous bad, which is really, really fantastic.
Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.
Pete: As we went through the setup process, selecting on what we wanted to monitor, at least for our accounts, was pretty easy. We just said Amazon services. But if we had a lot of accounts, or if we wanted to break it out like Jesse said, we could have selected on one or the other. But one of the first custom options that you have to fill out is something called an alert threshold. And the way that Amazon explains this area is a threshold is not the same as an anomaly.
So, anomalies, those are things that are detected via machine learning. And those are things that happen, completely separate to the monitors that you create, the monitors are the things that are going to notify you. And it even says, for example, you could set a zero dollar threshold alert of every anomaly, even if the cost impact is $1. And even though that sentence is written extremely poorly in the documentation, what I think they're trying to get to is if you wanted to get alerted for every anomaly that Amazon identifies via their ML, you put a threshold of zero. So, this, kind of, begs the question of, “Well, what dollar amount do I put here?”
And that is a big question, and it's going to be different for every organization. Maybe you want to look at your total spend, you spend $100 a month, do you want to be notified of a 10 percent swing? Put $10 in there, do you want to be notified of a 1 percent swing? Put $1 in there. So, I think it's best to think of this as what's the percentage spike that you would raise an eyebrow at if you were looking through Cost Explorer on your own. And that's the threshold you put there.
Now, granted, this is again for alerting you, and there are a lot of different ways to alert. You can use SNS to alert via countless options: send it to a Lambda, send it to a chatbot. But you can also summarize these alerts, so if you want every individual alert, send it through SNS; that's actually your only option. If you wanted daily or weekly summaries, you can identify right in the monitor alert and put in those email addresses you want, and those will just get created for you. So, that's really the only hard part of setting this up is identifying what the threshold anomaly you want to get identified. And my guess is, as we see some anomalies, that's a threshold that we can adjust later if we're getting alerted too often or not enough.
Jesse: I totally agree. And I think one of the big things to highlight there is this may be something that you can set up now at zero dollars to alert on everything and fine-tune later. Let the alert be noisy until you find the right threshold. I would rather have more alerts than too few alerts and may end up being something that you have a business conversation with engineering leadership and finance to understand what is the threshold that they want to be notified of for these alerts. Is it something that we create individual alerts for the individual teams so that the teams can receive individual alerts, but maybe engineering or finance leadership only cares about percentage point increase in spend, or maybe they care about a certain dollar amount? Those things are important items to discuss as you're enabling these alerts. But ultimately, you can start with a very noisy alert and then change it later.
Pete: Yeah, always remember, too, Jessie and I are former operators. We fully understand alert fatigue. And at this point, we're just not sure of how many anomalies we're going to see, we actually configure ours for an SNS notification, it's good to go to the Amazon chatbot service, and that's going to dump into our Slack channel. So, this is actually a part one of two Whiteboard Confessional. We're going to let this run for a few weeks and see what we get back.
Additionally, we're going to reach out to the Amazon product owners on this application because it wasn't all roses. Getting this set up, we actually did run into issues which, honestly, I was a little surprised by, only because there really aren't a lot of configuration options. But we did run into some issues. What was the first issue we ran into?
Jesse: The first issue that we ran into was actually one of the most interesting and actually took the most time to troubleshoot because the alert message that we received, when we tried to create our first monitor very generically said, “This monitor cannot be created.” It's that beautiful mix of telling me something that I already clearly know, and I don't know how to fix it. It's not giving me that information that tells me what's the next step I need to do in order to fix this. And we actually spent, I want to say 10, maybe 15 minutes, poking around different settings with our SNS topic, poking around different settings with permissions. We also found out that, ultimately, there are a certain number of permissions that you need to enable on your SNS topic ahead of time, otherwise Anomaly Detection won't let you write to that topic. So, there's little things like that that we ran into.
But ultimately, what we found was that there was already a monitor in our account for AWS Services, and we were trying to create a monitor for AWS Services. And this service does not let you create two monitors with the same custom monitor type right now. So, off the top of my head, I thought to myself, “Sure. I get that. It's trying to de-duplicate as much as possible.”
But one thing to think about is that there are different use cases for the same type of monitor. So, for example, you might want your team that is focusing on this cost optimization work or this anomalous spend work to receive individual alerts in an SNS topic and maybe go to a Slack channel. But maybe engineering leadership or finance wants to know, at a high level, what are those alerts on a daily basis or a weekly basis. And so ultimately, there's different alerts that may come out of the same monitor. So, there's definitely opportunities here to improve the service over time to allow some of these things to be fleshed out, which I think is part of what—you know, this is a service in beta, so we weren't expecting it to be perfect, and so this is also something that we completely understand and we expect that there to be slight rough edges. But also at the same time, this wasn't a huge loss in any way. The service is still absolutely usable, and we highly, highly recommend it. I highly, highly recommend it.
Pete: Yeah, it's also important to note that neither Jesse nor I spent any time reading the documentation in advance because we wanted to represent the average Amazon user, which is, “I’m going to go in and I'm going to start clicking around,” because that's what we do. We go in and we just try to figure it out. So, we didn't actually read the documentation, and I don't know if maybe the documentation has various caveats in there. So, I'm going to apologize in advance if it turns out that, “Oh, yeah. This is well documented that it cannot do that.”
It's the disconnect, of course, though, that the documents live in a completely different area, and also the error message could have been a little bit more helpful to simply say, “You cannot do this,” right? That would be a lot easier because as Jesse said, once I was debugging IAM permissions for the SNS topic, I kind of thought, “Wow, what has happened here?” I thought we were just going to click three or four times and make this magic happen. So, yeah, that being said, could be a caveat; we didn't read the documentation, but it's definitely something to keep in mind and, honestly, I hope it's a feature that they build for, and they bring to the surface in the future because I would like that ability. I would like an email to go to one team versus another team, maybe real-time alerts going into another location. There's just different ways that I might want to be notified. And of course, I could probably code up something with Lambda, but honestly, that just feels like a little bit of a cop-out. So, well, Jesse, what were your kind of final thoughts about this product? I personally really thought that this is really impressive initial release of a really interesting product, and it's free. But what were your thoughts?
Jesse: Absolutely. This is a product that may still be in beta, but it already has a lot of polish on it, there's a lot of really great value-add to this product, and if it's free, there's no reason not to use it right now. Even if you set your alerts at a high threshold—so maybe you're not getting email notifications regularly—just to see what kind of information the service shows you in terms of anomalous spend, I highly recommend enabling this service, just seeing what information it comes up with. And one thing that we didn't potentially talk about in too much detail, but I think is really important to note: because it is a machine learning model, you have the opportunity to train the model. So, when you receive alerts, you will receive a notification that says was this alert actually useful? Was this spend actually anomalous? And you can train the model to say, “Yes, the spend was anomalous,” or, “No the spend was not anomalous.” And help the model get better and better understand your spend, better understand any new customer’s spend. So, at the end of the day, for a free feature with minimal configuration setup, I highly recommend it.
Pete: Absolutely. It's free. It's free. Just go turn it on. Well Jesse, thanks again for joining me, for helping me deep dive into yet another Amazon product. Join us in a few weeks, I think, is our hope that we'll have some anomalies. That we'll see some alerts get sent to us but also, too, I think we're going to reach out to the account managers of—or the product owners of this service and get some clarification. Maybe learn a little bit more: what did we miss that we really should have been talking more about, and share that as well.
Thanks again, Jesse. Again, really appreciate it. If you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell Corey that you still miss his hot takes. This is Whiteboard Confessional. Thanks again.
Announcer: This has been a HumblePod production. Stay humble.