Infrastructure Code Smell (aka Who Microwaved the Fish?)

Episode Summary

Join Pete and Jesse as they continue the Unconventional Guide to AWS Cost Savings with a look at “code smell,” where the term comes from, and what it means. They also touch upon the important role context plays in understanding costs and usage impacts, how you’re eventually going to have to rearchitect your application when you achieve scale and how that should influence your thinking, why you should run proof of concept projects when you’re not sure how much something is going to cost in the cloud, how lifting and shifting can actually increase costs, an easy way to make sure you’re not storing data unnecessarily, why you should consider implementing lifecycle policies for data, why Pete loves intelligent tiering, and more.

Episode Show Notes & Transcript

Links:

Unconventional Guide to AWS Cost Management: https://www.duckbillgroup.com/resources/unconventional-guide-to-aws-cost-management/

Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Pete: Hello, and welcome to the AWS Morning Brief. I’m Pete Cheslock.

Jesse: I'm Jesse DeRose.

Pete: Fridays From the Field, Jesse. We're back again.

Jesse: Back, back, back again.

Pete: I always say that when I rage quit computers, it would be fun to be a farmer. And so maybe this is a little trial run “Fridays From the Field.” I'm just out in the field.

Jesse: So basically, what I'm hearing is that you are the old man out in the field, yelling at the clouds as they go by.

Pete: Well, now that I work from home pretty much all the time as part of Duckbill, but also due to COVID. I do yell at the squirrels who constantly tear up my yard. I've now turned into that person.

Jesse: [laugh]. Oh, oh, Pete, I'm so sorry.

Pete: Those squirrels. I hate them. So we're back again, talking about the Unconventional Guide to AWS Cost Savings. And this time, we're talking about ‘infrastructure code smell.’

Jesse: Ooh, fun one.

Pete: I like to equate this to, who brought the fish for lunch and microwave to that?

Jesse: I always understood that at a deep core level, but didn't really think about it until I actually did microwave fish one day, and I regret everything.

Pete: Don't do it. I'm telling you, folks, don't do it. You can bring tuna fish in. I guess that's fine. That's a little bit better. If it's packed in oil, it actually is a lot less smelly. Should we do a food podcast? No, I’m just kidding. [laugh].

Jesse: [laugh].

Pete: So ‘code smell,’ I do want to bring this one up because I actually did a little bit of a TIL—today I learned—with code smell. Yeah, this term was actually coined by someone that was a writer about the agile software movement, Kent Beck. He was working with Martin Fowler, who's a noted author about programming. In the book called Refactoring, they coined this phrase ‘code smell.’

Jesse: I did not know this.

Pete: Yeah. You know, you kind of hear a term, you just accept it without really understanding why. But what it was called in this book was, code smell is a surface indication that usually corresponds to a deeper problem in the system. So obviously, it is what it sounds like: something smells. Something doesn't seem good here. And obviously, it can take a lot of forms. You most often hear it in, obviously, software engineering but, guess what? Software engineering has expanded to manage our infrastructure, right?

Jesse: Mm-hm, absolutely. Yeah, it's not just about—or I should say, infrastructure smell is not just about wasted resources. It's really thinking about all of those one-off hacks that got you this far. So, that one time that you couldn't deploy something into production, so you just said, “You know what? I'm just going to log into the console and spin up that instance, and then call it a day, and close the change order, and be done with it so I don't have to worry about it. Maybe I'll open a ticket to see if I can figure out what happened in the deployment pipeline, but I'm not going to worry about it.” All those little things that you did along the way that aren’t probably the best practices that you ultimately should be following and ultimately want everybody else to be following.

Pete: Yeah, and I'm looking at you, software infrastructure manager, who is still running an m1.medium in production. That's code smell.

Jesse: Oof.

Pete: Anyway. Just don't use the m1.mediums. Let them go away. But, Jesse, you're right. It's not just those hacks and one-offs. It's kind of back to the context. It's the how. How you're doing certain things with these Amazon resources, right?

Jesse: Yeah. And I think that's something that's a really important caveat, the call out because there is always a balance between premature optimization and waste. I struggle with this one a lot. My brain automatically thinks, “Well, if I'm going to do this, I'm going to do it the right way the first time, and I'm going to do it the streamlined automated way the first time so that I can just have it all set up the very first go, and set it and forget it and be done and walk away.” But in most cases, that's not how it works.

Pete: Yeah, that is a complicated topic that I've struggled with as well. I've worked for predominantly unprofitable startups. We have a burn rate. We have only a certain amount of money in the bank and you divide by what your spend is, and that's when you're out of money. And doesn't necessarily mean the company's out of business, but it could mean that all that sweet equity that you have no chance of actually turning into real cash has even a less chance of turning into real cash. So, we often in the startup world make those decisions where we try to just get it done in what we hope is the best way possible. Again, we'll regret it two or three years later, but—

Jesse: Regardless of the way you set it up the first time, we will regret it two or three years later.

Pete: It's so true. Even if you say, “I’m going to set this up in the best way possible,” things change, and scale breaks everything eventually. So, in a couple of years, you're just going to be doing things in a different way—for better or worse—than you were doing. And it's kind of just all for not, in many cases.

Jesse: One of my favorites that I see is application logs that are pushed into CloudWatch because you want to be able to see all of your logs in CloudWatch or all your metrics in CloudWatch. But then those same logs and metrics are then being sent off to Kinesis for analysis, they're being sent to Splunk for analysis, they're being sent to Datadog, or insert other third-party vendor here for analysis. So effectively, all you're doing is putting the data into CloudWatch as a cue to go to somewhere else. And CloudWatch isn't cheap. CloudWatch logs are expensive.

Pete: Exactly. This is one of my most frustrating painful-to-see, dare I say anti-pattern of Amazon usage is, partly Amazon to blame on this one because they do make it so easy to get your logs into CloudWatch. It's a default option. If you turn on flow logs, you can have your flow logs go to CloudWatch. God forbid you do that, because your bill will be horrific in short order. But a lot of those services also have the ability to push to S3, as well. So, highly recommend, unless you're using CloudWatch for log analysis, push your logs to S3. In a previous episode, we talked about the data bagel, right, Jesse?

Jesse: Oh, the data bagel. My favorite.

Pete: Push all your data into the singular location—S3. It is very cheap, in many cases, free to do so, and avoid all of this kind of data duplication by sending it to a bunch of different places.

Jesse: And I think it's important to note that this can happen with any product initiative. It's not just the old stuff that you spun up back in the day, and you go back to look at that one line of infrastructure code or that one m.1 instance, and you think to yourself, “Oh, no, what idiot spun this up? I can't believe we still have this m.1 instance running. Who did this?” And if you go look at the tags—which of course you put tags on this thing—you find out, you did it.

Pete: It's me.

Jesse: Whoops.

Pete: Just at me next time, Jesse.

Jesse: Yeah. So, it is important to think about this, not just for the old infrastructure, but also for the new infrastructure that you're going to be building. Consider the cost and usage impacts before you start building. This kind of overlaps, again, with a concept from a previous episode where we talked about context is king. When you look at your application architecture and your infrastructure diagrams, think about all of the components that you're actually going to need to run your workloads, whether that is the actual compute resources, whether that is the databases, whether that is the logging structures. All of these components are important things to think about before you deploy.

Pete: I feel like this is the astronaut meme, where there's the astronaut with a gun holding on to the other astronaut. He's just looking at Earth going, “It's all context, isn't it?”

Jesse: [laugh].

Pete: Always has been.

Jesse: Yeah.

Pete: True. It's true, though, right? I think that's a really great point. Some of the most mature organizations that we work with actually bring us in to review architecture planning documents as they are building services to better understand what the cost impact would be in advance, those new product initiatives. They might have a thing—and this speaks a lot to lift and shift, which I know we've talked about many times in the past where you've lifted and shifted your workloads over and now you're trying to improve upon them.

Part of that improvement should actually reduce your costs. Right? Now, not always. Sometimes you're just having a better user experience, and less downtime, and less PagerDuty paging you, but if you can also do all of those things, plus save some cash, that money could be invested in other interesting projects.

Jesse: And it's also worth thinking about standardizing some of these procedures, documenting some of these procedures, maybe creating grassroots efforts or communities of practice around these procedures, around these ideas and norms because then, if you are running into these issues it's likely that somebody else is running into these issues, too in the company, especially if you work in a large enterprise company. And now you have the opportunity to bring multiple minds together to brainstorm and build these best practices together, and build off of one another and really help each other move forward together.

Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Check out CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.

Pete: If you're not sure how much something is going to cost, either, it's a great opportunity to run some proof of concept workloads to determine it.

Jesse: Absolutely.

Pete: We were working with one client who was going to be doing a large batch processing job, they wanted it to be as cheap as possible. Well, obviously, you want to use spot instances for things like that, things that can handle interruption, but even then, they really had no way of gauging the cost. What was the cost to the business? So, the business has this thing they want to do. They want to do this large batch processing, and they're saying, “Well, what is it going to cost us? We want to invest a certain amount of money in doing this, but if it's too much, maybe we don't want to do this.”

And so what this client did was run a series of these batch processes, continuously optimizing the code, until they got to the point where the code was really as optimized as they believed that they could do in the time needed, and now they wanted to get to optimizing the infrastructure side, right-sizing instances, extending spot usage, all of those different things that can give them the closest estimate possible of a defined period of time that they want to process. And then they can use that to essentially forecast out, this is the approximate cost plus or minus some amount of flexible difference in spend, and they can have confidence that their executive leadership is making the right decision. So, Jesse, what are some helpful tips that we've seen that folks can just right now actually go out there and hopefully improve some of the smellier bits of their infrastructure?

Jesse: There's so many. To start off, think about the log data retention and snapshot lifecycles. Think about how long do you actually need to keep your log data and keep your database snapshots, your EBS volume snapshots. And again, this may be a conversation with legal or with IT to understand those requirements, to understand, do we need to keep this data for some period of time for legal purposes? And then build your snapshots accordingly.

I remember there was one client we worked with who had really large, I think it was CloudWatch spend—or really large VPC spend, I forget which. And when I dug in a little bit further, I realized they had VPC flow logs enabled for one of their VPCs, which you should absolutely do if you want to investigate data for a period of time for the data that's flowing through that VPC. But, one, they never turned it off. And, two, they never set a data lifecycle policy. So, that data was just going up, and up, and up, and up, and just growing larger and larger on their AWS bill at the same time.

So, a really quick way to really make sure that you're not just storing data unnecessarily: look at those lifecycle policies, see if you actually need to retain all this data as long as you've got it. And if not, you can start getting rid of it earlier. You can lifecycle policies are fantastic because you can set them and forget them.

Pete: Exactly. The Amazon services allow for you to have those lifecycle policies. You don't really have to think about it. But yeah, Jesse, to your point, be sure you talk with some additional folks before you start deleting data as you could violate some of your SOC 2-related compliance needs if you are not holding the right amount of data. Something else that I think is missed out a lot with clients that we speak with is Compute Optimizer.

There is a Compute Optimizer within your EC2, within the ability to analyze the CPU usage. Now, you'll need the CloudWatch agent to give memory recommendations, so its value might be a little bit limiting if you're maybe memory heavy but light CPU. I guess in that case, you probably want t class instances, but it does include, now—this is a recent thing—EBS recommendations as well. And EBS is probably one of the places that is the greatest gain for a business on their costs. I mean, if you’re running a lot of EBS, odds are you're probably writing a lot of gp2, but guess what?

The workload that you're running is pretty probably better off on a different volume. And even Amazon says gp2 is general purpose. It's where you start, to understand and analyze the workload to then move it to a more appropriate volume: sc1, st1. These are significantly cheaper volumes that can still meet your kind of I/O needs. And it's awesome that now Compute Optimizer can include those EBS recommendations. The beauty of those EBS recommendations? These volumes can be modified on the fly. You go right into the UI: click box, save money.

Jesse: It's amazing.

Pete: And again, any of those times that there's a click box, save money, that is just a great feeling. And the fact that Amazon can analyze for you and make these recommendations, now you have this confidence. And guess what, if you screw up and you accidentally move something to a sc1 volume, and it's not performing as well, you can change it again. I think they're modifiable up to once every six hours. So, definitely check that out. I think that is a big win for a lot of folks.

Jesse: I think it's also worth noting, you talked about EBS tiers, it's also worth noting that S3 has tiers as well. And we've talked about this in previous episodes, but stop using S3 standard storage for all of your data storage. There are definitely use cases for S3 standard storage, but there are also lots of use cases for the other S3 storage tiers as well, especially infrequent access, or possibly intelligent tiering. There's great use cases for the archive tiers that were released recently, and then also for Glacier as well for maybe some of that data retention that we talked about earlier. So, move that data to the appropriate tier so that you're not spending as much money on it as if it all lives in the standard tier. The standard storage tier is great; it's essentially the general-purpose tier of S3, but there are other tiers that you can leverage as well.

Pete: Yeah, that's a great point. I love the intelligent tiering. For most workloads we see, that should be essentially the default storage location because it's getting rare that you can have a passive application on Amazon automatically save money for you. And that's what intelligent tiering will do. Now granted, if you are running lots of small files, then the monitoring costs of intelligent tiering could actually be prohibitive, so keep that in mind.

But it has the ability for you to set these archive tiers, Jesse, like you said before, and you can configure them based on specific timing. So, maybe you eventually do want things to go to Glacier but not for six months instead of 120 days. You can modify and adjust that. And then it will make these tiering decisions for you and move things into different places as needed. It is definitely a game-changer service that more folks should be using. And it just happens, right? It just happens in the background, which is fantastic.

All right. Well, hopefully, those tips are helpful to you. As a reminder, you can always go to lastweekinaws.com/QA if you have questions, or maybe there's a service that you have a lot of spending that you're just not sure how to, maybe, improve that one. We'd love to read those questions and take a shot at answering them.

But if you did enjoy this podcast, please go to lastweekinaws.com/review. Give it a five-star rating on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating but then also tell Corey that you want to see him back on Fridays From the Field. Maybe we could have him as a special guest, Jesse. What do you think?

Jesse: Oh, that would be fun.

Pete: You know, he can come and visit his former podcast.

Jesse: We can show him all that we've built from his empire.

Pete: [laugh]. “Look. Look at what we've built for you, Corey.”

Jesse: “Look, we've built a data bagel, just for you.”

Pete: Enjoy your bagel. Thanks, everyone. Buh-bye.

Announcer: This has been a HumblePod production. Stay humble.

Infrastructure Code Smell (aka Who Microwaved the Fish?)

Episode Summary

Episode Show Notes & Transcript

You might also like

Get the Newsletter

Sponsor an Episode