Creating “Quinntainers” with Casey Lee

Episode Summary

Casey Lee is the CTO at Gaggle, which actually is saving lives. While loads of companies make the claim, in this case it rings rather true. Gaggle sells software to school districts that use the software to help protect their students by looking for indicators of bullying, self-harm, and a litany of other challenges facing young people in today’s world. Casey expands on the 6 million and growing students that they and their software is working to protect. Corey and Casey also share about their serendipitous encounter at re:Invent. Form helping Gaggle save on their AWS bills, they then dive into Casey’s area of expertise: containers. And at the end of all, they land on “quinntainers” and the 18th way to run containers on AWS.

Episode Show Notes & Transcript

About Casey

Casey spends his days leveraging AWS to help organizations improve the speed at which they deliver software. With a background in software development, he has spent the past 20 years architecting, building, and supporting software systems for organizations ranging from startups to Fortune 500 enterprises.

Links Referenced:

“17 Ways to Run Containers in AWS”: https://www.lastweekinaws.com/blog/the-17-ways-to-run-containers-on-aws/
“17 More Ways to Run Containers on AWS”: https://www.lastweekinaws.com/blog/17-more-ways-to-run-containers-on-aws/
kubernetestheeasyway.com: https://kubernetestheeasyway.com
snark.cloud/quinntainers: https://snark.cloud/quinntainers
ECS Chargeback: https://github.com/gaggle-net/ecs-chargeback
twitter.com/nektos: https://twitter.com/nektos

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored by our friends at Revelo. Revelo is the Spanish word of the day, and its spelled R-E-V-E-L-O. It means “I reveal.” Now, have you tried to hire an engineer lately? I assure you it is significantly harder than it sounds. One of the things that Revelo has recognized is something I’ve been talking about for a while, specifically that while talent is evenly distributed, opportunity is absolutely not. They’re exposing a new talent pool to, basically, those of us without a presence in Latin America via their platform. It’s the largest tech talent marketplace in Latin America with over a million engineers in their network, which includes—but isn’t limited to—talent in Mexico, Costa Rica, Brazil, and Argentina. Now, not only do they wind up spreading all of their talent on English ability, as well as you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform are hands down the most talented engineers that I’ve ever spoken to. Let’s also not forget that Latin America has high time zone overlap with what we have here in the United States, so you can hire full-time remote engineers who share most of the workday as your team. It’s an end-to-end talent service, so you can find and hire engineers in Central and South America without having to worry about, frankly, the colossal pain of cross-border payroll and benefits and compliance because Revelo handles all of it. If you’re hiring engineers, check out revelo.io/screaming to get 20% off your first three months. That’s R-E-V-E-L-O dot I-O slash screaming.

Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured and fully managed with built in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: make your data sing.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. My guest today is someone that I had the pleasure of meeting at re:Invent last year, but we’ll get to that story in a minute. Casey Lee is the CTO with a company called Gaggle, which is—as they frame it—saving lives. Now, that seems to be a relatively common position that an awful lot of different tech companies take. “We’re saving lives here.” It’s, “You show banner ads and some of them are attack platforms for JavaScript malware. Let’s be serious here.” Casey, thank you for joining me, and what makes the statement that Gaggle saves lives not patently ridiculous?

Casey: Sure. Thanks, Corey. Thanks for having me on the show. So Gaggle, we’re ed-tech company. We sell software to school districts, and school districts use our software to help protect their students while the students use the school-issued Google or Microsoft accounts.

So, we’re looking for signs of bullying, harassment, self-harm, and potentially suicide from K-12 students while they’re using these platforms. They will take the thoughts, concerns, emotions they’re struggling with and write them in their school-issued accounts. We detect that and then we notify the school districts, and they get the students the help they need before they can do any permanent damage to themselves. We protect about 6 million students throughout the US. We ingest a lot of content.

Last school year, over 6 billion files, about the equal number of emails ingested. We’re looking for concerning content and then we have humans review the stuff that our machine learning algorithms detect and flag. About 40 million items had to go in front of humans last year, resulted in about 20,000 what we call PSSes. These are Possible Student Situations where students are talking about harming themselves or harming others. And that resulted in what we like to track as lives saved. 1400 incidents last school year where a student was dealing with suicide ideation, they were planning to take their own lives. We detect that and get them help within minutes before they can act on that. That’s what Gaggle has been doing. We’re using tech, solving tech problems, and also saving lives as we do it.

Corey: It’s easy to lob a criticism at some of the things you’re alluding to, the idea of oh, you’re using machine learning on student data for young kids, yadda, yadda, yadda. Look at the outcome, look at the privacy controls you have in place, and look at the outcomes you’re driving to. Now, I don’t necessarily trust the number of school administrations not to become heavy-handed and overbearing with it, but let’s be clear, that’s not the intent. That is not what the success stories you have alluded to. I’ve got to say I’m a fan, so thanks for doing what you’re doing. I don’t say that very often to people who work in tech companies.

Casey: Cool. Thanks, Corey.

Corey: But let’s rewind a bit because you and I had passed like ships in the night on Twitter for a while, but last year at re:Invent something odd happened. First, my business partner procrastinated at getting his ticket—that’s not the odd part; he does that a lot—but then suddenly ticket sales slammed shut and none were to be had anywhere. You reached out with a, “Hey, I have a spare ticket because someone can’t go. Let me get it to you.” And I said, “Terrific. Let me pay you for the ticket and take you to dinner.”

You said, “Yes on the dinner, but I’d rather you just look at my AWS bill and don’t worry about the cost of the ticket.” “All right,” said I. I know a deal when I see one. We grabbed dinner at the Venetian. I said, “Bust out your laptop.” And you said, “Oh, I was kidding.” And I said, “Great. I wasn’t. Bust it out.”

And you went from laughing to taking notes in about the usual time that happens when I start looking at these things. But how was your recollection of that? I always tend to romanticize some of these things. Like, “And then everyone’s restaurant just turned, stopped, and clapped the entire time.” Maybe that part didn’t happen.

Casey: Everything was right up until the clapping part. That was a really cool experience. I appreciate you walking through that with me. Yeah, we’ve got lots of opportunity to save on our AWS bill here at Gaggle, and in that little bit of time that we had together, I think I walked away with no more than a dozen ideas for where to shave some costs. The most obvious one, the first thing that you keyed in on, is we had RIs coming due that weren’t really well-optimized and you steered me towards savings plans. We put that in place and we’re able to apply those savings plans not just to our EC2 instances but also to our serverless spend as well.

So, that was a very worthwhile and cost-effective dinner for us. The thing that was most surprising though, Corey, was your approach. Your approach to how to review our bill was not what I thought at all.

Corey: Well, what did you expect my approach was going to be? Because this always is of interest to me. Like, do you expect me to, like, whip a portable machine learning rig out of my backpack full of GPUs or something?

Casey: I didn’t know if you had, like, some secret tool you were going to hit, or if nothing else, I thought you were going to go for the Cost Explorer. I spend a lot of time in Cost Explorer, that’s my go-to tool, and you wanted nothing to do with Cost Exp—I think I was actually pulling up Cost Explorer for you and you said, “I’m not interested. Take me to the bills.” So, we went right to the billing dashboard, you started opening up the invoices, and I thought to myself, “I don’t remember the last time I looked at an AWS invoice.” I just, it’s noise; it’s not something that I pay attention to.

And I learned something, that you get a real quick view of both the cost and the usage. And that’s what you were keyed in on, right? And you were looking at things relative to each other. “Okay, I have no idea about Gaggle or what they do, but normally, for a company that’s spending x amount of dollars in EC2, why is your data transfer cost the way it is? Is that high or low?” So, you’re looking for kind of relative numbers, but it was really cool watching you slice and dice that bill through the dashboard there.

Corey: There are a few things I tie together there. Part of it is that this is sort of a surprising thing that people don’t think about but start with big numbers first, rather than going alphabetically because I don’t really care about your $6 Alexa for Business spend. I care a bit more about the $6 million, or whatever it happens to be at EC2—I’m pulling numbers completely out of the ether, let’s be clear; I don’t recall what the exact magnitude of your bill is and it’s not relevant to the conversation.

And then you see that and it’s like, “Huh. Okay, you’re spending $6 million on EC2. Why are you spending 400 bucks on S3? Seems to me that those two should be a little closer aligned. What’s the deal here? Oh, God, you’re using eight petabytes of EBS volumes. Oh, dear.”

And just, it tends to lead to interesting stuff. Break it down by region, service, and use case—or usage type, rather—is what shows up on those exploded bills, and that’s where I tend to start. It also is one of the easiest things to wind up having someone throw into a PDF and email my way if I’m not doing it in a restaurant with, you know, people clapping standing around.

Casey: [laugh]. Right.

Corey: I also want to highlight that you’ve been using AWS for a long time. You’re a Container Hero; you are not bad at understanding the nuances and depths of AWS, so I take praise from you around this stuff as valuing it very highly. This stuff is not intuitive, it is deeply nuanced, and you have a business outcome you are working towards that invariably is not oriented day in day out around, “How do I get these services for less money than I’m currently paying?” But that is how I see the world and I tend to live in a very different space just based on the nature of what I do. It’s sort of a case study and the advantage of specialization. But I know remarkably little about containers, which is how we wound up reconnecting about a week or so before we did this recording.

Casey: Yeah. I saw your tweet; you were trying to run some workload—container workload—and I could hear the frustration on the other end of Twitter when you were shaking your fist at—

Corey: I should not tweet angrily, and I did in this case. And, eh, every time I do I regret it. But it played well with the people, so that does help. I believe my exact comment was, “‘me: I’ve got this container. Run it, please.’ ‘Google Cloud: Run. You got it, boss.’ AWS has 17 ways to run containers and they all suck.”

And that’s painting with an overly broad brush, let’s be clear, but that was at the tail end of two or three days of work trying to solve a very specific, very common, business problem, that I was just beating my head off of a wall again and again and again. And it took less than half an hour from start to finish with Google Cloud Run and I didn’t have to think about it anymore. And it’s one of those moments where you look at this and realize that the future is here, we just don’t see it in certain ways. And you took exception to this. So please, let’s dive in because 280 characters of text after half a bottle of wine is not the best context to have a nuanced discussion that leaves friendships intact the following morning.

Casey: Nice. Well, I just want to make sure I understand the use case first because I was trying to read between the lines on what you needed, but let me take a guess. My guess is you got your source code in GitHub, you have a Docker file, and you want to be able to take that repo from GitHub and just have it continuously deployed somewhere in Run. And you don’t want to have headaches with it; you just want to push more changes up to GitHub, Docker Build runs and updates some service somewhere. Am I right so far?

Corey: Ish, but think a little further up the stack. It was in service of this show. So, this show, as people who are listening to this are probably aware by this point, periodically has sponsors, which we love: We thank them for participating in the ongoing support of this show, which empowers conversations like this. Sometimes a sponsor will come to us with, “Oh, and here’s the URL we want to give people.” And it’s, “First, you misspelled your company name from the common English word; there are three sublevels within the domain, and then you have a complex UTM tagging tracking co—yeah, you realize people are driving to work when they’re listening to this?”

So, I’ve built a while back a link shortener, snark.cloud because is it the shortest thing in the world? Not really, but it’s easily understandable when I say that, and people hear it for what it is. And that’s been running for a long time as an S3 bucket with full of redirects, behind CloudFront. So, I wind up adding a zero-byte object with a redirect parameter on it, and it just works.

Now, the challenge that I have here as a business is that I am increasingly prolific these days. So, anything that I am not directly required to be doing, I probably shouldn’t necessarily be the one to do it. And care and feeding of those redirect links is a prime example of this. So, I went hunting, and the things that I was looking for were, obviously, do the redirect. Now, if you pull up GitHub, there are hundreds of solutions here.

There are AWS blog posts. One that I really liked and almost got working was Eric Johnson’s three-part blog post on how to do it serverlessly, with API Gateway, and DynamoDB, no Lambdas required. I really liked aspects of what that was, but it was complex, I kept smacking into weird challenges as I went, and front end is just baffling to me. Because I needed a front end app for people to be able to use here; I need to be able to secure that because it turns out that if you just have a, anyone who stumbles across the URL can redirect things to other places, well, you’ve just empowered a whole bunch of spam email, and you’re going to find that service abused, and everyone starts blocking it, and then you have trouble. Nothing lasts the first encounter with jerks.

And I was getting more and more frustrated, and then I found something by a Twitter engineer on GitHub, with a few creative search terms, who used to work at Google Cloud. And what it uses as a client is it doesn’t build any kind of custom web app. Instead, as a database, it uses not S3 objects, not Route 53—the ideal database—but a Google sheet, which sounds ridiculous, but every business user here knows how to use that.

Casey: Sure.

Corey: And it looks for the two columns. The first one is the slug after the snark.cloud, and the second is the long URL. And it has a TTL of five seconds on cache, so make a change to that spreadsheet, five seconds later, it’s live. Everyone gets it, I don’t have to build anything new, I just put it somewhere around the relevant people can access it, I gave him a tutorial and a giant warning on it, and everyone gets that. And it just works well. It was, “Click here to deploy. Follow the steps.”

And the documentation was a little, eh, okay, I had to undo it once and redo it again. Getting the domain registered was getting—ported over took a bit of time, and there were some weird SSL errors as the certificates were set up, but once all of that was done, it just worked. And I tested the heck out of it, and cold starts are relatively low, and the entire thing fits within the free tier. And it is reminiscent of the magic that I first saw when I started working with some of the cloud providers services, years ago. It’s been a long time since I had that level of delight with something, especially after three days of frustration. It’s one of the, “This is a great service. Why are people not shouting about this from the rooftops?” That was my perspective. And I put it out on Twitter and oh, Lord, did I get comments. What was your take on it?

Casey: Well, so my take was, when you’re evaluating a platform to use for running your applications, how fast it can get you to Hello World is not necessarily the best way to go. I just assumed you’re wrong. I assumed of the 17 ways AWS has to run containers, Corey just doesn’t understand. And so I went after it. And I said, “Okay, let me see if I can find a way that solves his use case, as I understand it, through a quick tweet.”

And so I tried to App Runner; I saw that App Runner does not meet your needs because you have to somehow get your Docker image pushed up to a repo. App Runner can take an image that’s already been pushed up and deployed for you or it can build from source but neither of those were the way I understood your use case.

Corey: Having used App Runner before via the Copilot CLI, it is the closest as best I can tell to achieving what I want. But also let’s be clear that I don’t believe there’s a free tier; there needs to be a load balancer in front of it, so you’re starting with 15 bucks a month for this thing. Which is not the end of the world. Had I known at the beginning that all of this was going to be there, I would have just signed up for a bit.ly account and called it good. But here we are.

Casey: Yeah. I tried Copilot. Copilot is a great developer experience, but it also is just pulling together tons of—I mean just trying to do a Copilot service deploy, VPCs are being created and tons IAM roles are being created, code pipelines, there’s just so much going on. I was like 20 minutes into it, and I said, “Yeah, this is not fitting the bill for what Corey was looking for.” Plus, it doesn’t solve my the way I understood your use case, which is you don’t want to worry about builds, you just want to push code and have new Docker images get built for you.

Corey: Well, honestly, let’s be clear here, once it’s up and running, I don’t want to ever have to touch the silly thing again.

Casey: Right.

Corey: And that’s so far has been the case, after I forked the repo and made a couple of changes to it that I wanted to see. One of them was to render the entire thing case insensitive because I get that one wrong a lot, and the other is I wanted to change the permanent 301 redirect to a temporary 302 redirect because occasionally, sponsors will want to change where it goes in the fullness of time. And that is just fine, but I want to be able to support that and not have to deal with old cached data. So, getting that up and running was a bit of a challenge. But the way that it worked, was following the instructions in the GitHub repo.

The developer environment had spun up in the Google’s Cloud Shell was just spectacular. It prompted me for a few things and it told me step by step what to do. This is the sort of thing I could have given a basically non-technical user, and they would have had success with it.

Casey: So, I tried it as well. I said, “Well, okay, if I’m going to respond to Corey here and challenge him on this, I need to try Cloud Run.” I had no experience with Cloud Run. I had a small example repo that loosely mapped what I understood you were trying to do. Within five minutes, I had Cloud Run working.

And I was surprised anytime I pushed a new change, within 45 seconds the change was built and deployed. So, here’s my conclusion, Corey. Google Cloud Run is great for your use case, and AWS doesn’t have the perfect answer. But here’s my challenge to you. I think that you just proved why there’s 17 different ways to run containers on AWS, is because there’s that many different types of users that have different needs and you just happen to be number 18 that hasn’t gotten the right attention yet from AWS.

Corey: Well, let’s be clear, like, my gag about 17 ways to run containers on AWS was largely a joke, and it went around the internet three times. So, I wrote a list of them on the blog post of “17 Ways to Run Containers in AWS” and people liked it. And then a few months later, I wrote “17 More Ways to Run Containers on AWS” listing 17 additional services that all run containers.

And my favorite email that I think I’ve ever received in feedback was from a salty AWS employee, saying that one of them didn’t really count because of some esoteric reason. And it turns out that when I’m trying to make a point of you have a sarcastic number of ways to run containers, pointing out that well, one of them isn’t quite valid, doesn’t really shatter the argument, let’s be very clear here. So, I appreciate the feedback, I always do. And it’s partially snark, but there is an element of truth to it in that customers don’t want to run containers, by and large. That is what they do in service of a business goal.

And they want their application to run which is in turn to serve as the business goal that continues to abstract out into, “Remain a going concern via the current position the company stakes out.” In your case, it is saving lives; in my case, it is fixing horrifying AWS bills and making fun of Amazon at the same time, and in most other places, there are somewhat more prosaic answers to that. But containers are simply an implementation detail, to some extent—to my way of thinking—of getting to that point. An important one [unintelligible 00:18:20], let’s be clear, I was very anti-container for a long time. I wrote a talk, “Heresy in the Church of Docker” that then was accepted at ContainerCon. It’s like, “Oh, boy, I’m not going to leave here alive.”

And the honest answer is many years later, that Kubernetes solves almost all the criticisms that I had with the downside of well, first, you have to learn Kubernetes, and that continues to be mind-bogglingly complex from where I sit. There’s a reason that I’ve registered kubernetestheeasyway.com and repointed it to ECS, Amazon’s container service that is not requiring you to cosplay as a cloud provider yourself. But even ECS has a number of challenges to it, I want to be very clear here. There are no silver bullets
in this.

And you’re completely correct in that I have a large, complex environment, and the application is nuanced, and I’m willing to invest a few weeks in setting up the baseline underlying infrastructure on AWS with some of these services, ideally not all of them at once because that’s something a lunatic would do, but getting them up and running. The other side of it, though, is that if I am trying to evaluate a cloud provider’s handling of containers and how this stuff works, the reason that everyone starts with a Hello World-style example is that it delivers ideally, the meantime to dopamine. There’s a reason that Hello World doesn’t have 18 different dependencies across a bunch of different databases and message queues and all the other complicated parts of running a modern application. Because you just want to see how it works out of the gate. And if getting that baseline empty container that just returns the string ‘Hello World’ is that complicated and requires that much work, my takeaway is not that this user experience is going to get better once I’d make the application itself more complicated.

So, I find that off-putting. My approach has always been find something that I can get the easy, minimum viable thing up and running on, and then as I expand know that you’ll be there to catch me as my needs intensify and become ever more complex. But if I can’t get the baseline thing up and running, I’m unlikely to be super enthused about continuing to beat my head against the wall like, “Well, I’ll just make it more complex. That’ll solve the problem.” Because it often does not. That’s my position.

Casey: Yeah, I agree that dopamine hit is valuable in getting attached to want to invest into whatever tech stack you’re using. The challenge is your second part of that. Your second part is will it grow with me and scale with me and support the complex edge cases that I have? And the problem I’ve seen is a lot of organizations will start with something that’s very easy to get started with and then quickly outgrow it, and then come up with all sorts of weird Rube Goldberg-type solutions. Because they jumped all in before seeing—I’ve got kind of an example of that.

I’m happy to announce that there’s now 18 ways to run containers on AWS. Because in your use case, in the spirit of AWS customer obsession, I hear your use case, I’ve created an open-source project that I want to share called Quinntainers—

Corey: Oh, no.

Casey: —and it solves—yes. Quinntainers is live and is ready for the world. So, now we’ve got 18 ways to run containers. And if you have Corey’s use case of, “Hey, here’s my container. Run it for me,” now we’ve got a one command that you can run to get things going for you. I can share a link for you and you could check it out. This is a [unintelligible 00:21:38]—

Corey: Oh, we’re putting that in the [show notes 00:21:37], for sure. In fact, if you go to snark.cloud/quinntainers, you’ll find it.

Casey: You’ll find it. There you go. The idea here was this: There is a real use case that you had, and I looked at AWS does not have an out-of-the-box simple solution for you. I agree with that. And Google Cloud Run does.

Well, the answer would have been from AWS, “Well, then here, we need to make that solution.” And so that’s what this was, was a way to demonstrate that it is a solvable problem. AWS has all the right primitives, just that use case hadn’t been covered. So, how does Quinntainers work? Real straightforward: It’s a command-line—it’s an NPM tool.

You just run a [MPX 00:22:17] Quinntainer, it sets up a GitHub action role in your AWS account, it then creates a GitHub action workflow in your repo, and then uses the Quinntainer GitHub action—reusable action—that creates the image for you; every time you push to the branch, pushes it up to ECR, and then automatically pushes up that new version of the image to App Runner for you. So, now it’s using App Runner under the covers, but it’s providing that nice developer experience that you are getting out of Cloud Run. Look, is container really the right way to go with running containers? No, I’m not making that point at all. But the point is it is a—

Corey: It might very well be.

Casey: Well, if you want to show a good Hello World experience, Quinntainer’s the best because within 30 seconds, your app is now set up to continuously deliver containers into AWS for your very specific use case. The problem is, it’s not going to grow for you. I mean that it was something I did over the weekend just for fun; it’s not something that would ever be worthy of hitching up a real production workload to. So, the point there is, you can build frameworks and tools that are very good at getting that initial dopamine hit, but then are not going to be there for you unnecessarily as you mature and get more complex.

Corey: And yet, I’ve tilted a couple of times at the windmill of integrating GitHub actions in anything remotely resembling a programmatic way with AWS services, as far as instance roles go. Are you using permanent credentials for this as stored secrets or are you doing the [OICD 00:23:50][00:23:50] handoff?

Casey: OIDC. So, what happens is the tool creates the IAM role for you with the trust policy on GitHub’s OIDC provider, sets all that up for you in your account, locks it down so that just your repo and your main branch is able to push or is able to assume the role, the role is set up just to allow deployments to App Runner and ECR repository. And then that’s it. At that point, it’s out of your way. And you’re just git push, and couple minutes later, your updates are now running an App Runner for you.

Corey: This episode is sponsored in part by our friends at Vultr. Optimized cloud compute plans have landed at Vultr to deliver lightning fast processing power, courtesy of third gen AMD EPYC processors without the IO, or hardware limitations, of a traditional multi-tenant cloud server. Starting at just 28 bucks a month, users can deploy general purpose, CPU, memory, or storage optimized cloud instances in more than 20 locations across five continents. Without looking, I know that once again, Antarctica has gotten the short end of the stick. Launch your Vultr optimized compute instance in 60 seconds or less on your choice of included operating systems, or bring your own. It's time to ditch convoluted and unpredictable giant tech company billing practices, and say goodbye to noisy neighbors and egregious egress forever.

Vultr delivers the power of the cloud with none of the bloat. "Screaming in the Cloud"
listeners can try Vultr for free today with a $150 in credit when they visit getvultr.com/screaming. That's G E T V U L T R.com/screaming. My thanks to them for sponsoring this ridiculous podcast.

Corey: Don’t undersell what you’ve just built. This is something that—is this what I would use for a large-scale production deployment, obviously not, but it has streamlined and made incredibly accessible things that previously have been very complex for folks to get up and running. One of the most disturbing themes behind some of the feedback I got was, at one point I said, “Well, have you tried running a Docker container on Lambda?” Because now it supports containers as a packaging format. And I said no because I spent a few weeks getting Lambda up and running back when it first came out and I’ve basically been copying and pasting what I got working ever since the way most of us do.

And response is, “Oh, that explains a lot.” With the implication being that I’m just a fool. Maybe, but let’s be clear, I am never the only person in the room who doesn’t know how to do something; I’m just loud about what I don’t know. And the failure mode of a bad user experience is that a customer feels dumb. And that’s not okay because this stuff is complicated, and when a user has a bad time, it’s a bug.

I learned that in 2012. From Jordan Sissel the creator of LogStash. He has been an inspiration to me for the last ten years. And that’s something I try to live by that if a user has a bad time, something needs to get fixed. Maybe it’s the tool itself, maybe it’s the documentation, maybe it’s the way that GitHub repo’s readme is structured in a way that just makes it accessible.

Because I am not a trailblazer in most things, nor do I intend to be. I’m not the world’s best engineer by a landslide. Just look at my code and you’d argue the fact that I’m an engineer at all. But if it’s bad and it works, how bad is it? Is sort of the other side of it.

So, my problem is that there needs to be a couple of things. Ignore for a second the aspect of making it the right answer to get something out of the door. The fact that I want to take this container and just run it, and you and I both reach for App Runner as the default AWS service that does this because I’ve been swimming in the AWS waters a while and you’re a frickin AWS Container Hero, where it is expected that you know what most of these things do. For someone who shows up on the containers webpage—which by the way lists, I believe 15 ways to run containers on mobile and 19 ways to run containers on non-mobile, which is just fascinating in its own right—and it’s overwhelming, it’s confusing, and it’s not something that makes it is abundantly clear what the golden path is. First, get it up and working, get it running, then you can add nuance and flavor and the rest, and I think that’s something that’s gotten overlooked in our mad rush to pretend that we’re all Google engineers, circa 2012.

Casey: Mmm. I think people get stressed out when they tried to run containers in AWS because they think, “What is that golden path?” You said golden path. And my advice to people is there is no golden path. And the great thing about AWS is they do continue to invest in the solutions they come up with. I’m still bitter about Google Reader.

Corey: As am I.

Casey: Yeah. I built so much time getting my perfect set of RSS feeds and then I had to find somewhere else to—with AWS, the different offerings that are available for running containers, those are there intentionally, it’s not by accident. They’re there to solve specific problems, so the trick is finding what works best for you and don’t feel like one is better than the other is going to get more attention than others. And they each have different use cases.

And I approach it this way. I’ve seen a couple of different people do some great flowcharts—I think Forrest did one, Vlad did one—on ways to make the decision on how to run your containers. And I break it down to three questions. I ask people first of all, where are you going to run these workloads? If someone says, “It has to be in the data center,” okay, cool, then ECS Anywhere or EKS Anywhere and we’ll figure out if Kubernetes is needed.

If they need specific requirements, so if they say, “No, we can run in the cloud, but we need privileged mode for containers,” or, “We need EBS volumes,” or, “We want really small container sizes,” like, less than a quarter-VCP or less than half a gig of RAM—or if you have custom log requirements, Fargate is not going to work for you, so you’re going to run on EC2. Otherwise, run it on Fargate. But that’s the first question. Figure out where are you going to run your containers. That leads to the second question: What’s your control plane?

But those are different, sort of related but different questions. And I only see six options there. That’s App Runner for your control plane, LightSail for your control plane, Rosa if you’re invested in OpenShift already, EKS either if you have Momentum and Kubernetes or you have a bunch of engineers that have a bunch of experience with Kubernetes—if you don’t have either, don’t choose it—or ECS. The last option Elastic Beanstalk, but let’s leave that as a—if you’re not currently invested in Elastic Beanstalk don’t start today. But I look at those as okay, so I—first question, where am I going to run my containers? Second question, what do I want to use for my control plane? And there’s different pros and cons of each of those.

And then the third question, how do I want to manage them? What tools do I want to use for managing deployment? All those other tools like Copilot or App2Container or Proton, those aren’t my control plane; those aren’t where I run my containers; that’s how I manage, deploy, and orchestrate all the different containers. So, I look at it as those three questions. But I don’t know, what do you think of that, Corey?

Corey: I think you’re onto something. I think that is a terrific way of exploring that question. I would argue that setting up a framework like that—one or very similar—is what the AWS containers page should be, just coming from the perspective of what is the neophyte customer experience. On some level, you almost need a slide of have choose your level of experience ranging from, “What’s a container?” To, “I named my kid Kubernetes because I make terrible life decisions,” and anywhere in between.

Casey: Sure. Yeah, well, and I think that really dictates the control plane level. So, for example, LightSail, where does LightSail fit? To me, the value of LightSail is the simplicity. I’m looking at a monthly pricing: Seven bucks a month for a container.

I don’t know how [unintelligible 00:30:23] works, but I can think in terms of monthly pricing. And it’s tailored towards a console user, someone just wants to click in, point to an image. That’s a very specific user, there’s thousands of customers that are very happy with that experience, and they use it. App Runner presents that scale to zero. That’s one of the big selling points I see with App Runner. Likewise, with Google Cloud Run. I’ve got that scale to zero. I can’t do that with ECS, or EKS, or any of the other platforms. So, if you’ve got something that has a ton of idle time, I’d really be looking at those. I would argue that I think I did the math, Google Cloud Run is about 30% more expensive than App Runner.

Corey: Yeah, if you disregard the free tier, I think that’s have it—running persistently at all times throughout the month, the drop-out cold starts would cost something like 40 some odd bucks a month or something like that. Don’t quote me on it. Again and to be clear, I wound up doing this very congratulatory and complimentary tweet about them on I think it was Thursday, and then they immediately apparently took one look at this and said, “Holy shit. Corey’s saying nice things about us. What do we do? What do we do?” Panic.

And the next morning, they raised prices on a bunch of cloud offerings. Whew, that’ll fix it. Like—

Casey: [laugh].

Corey: Di-, did you miss the direction you’re going on here? No, that’s the exact opposite of what you should be doing. But here we are. Interestingly enough, to tie our two conversation threads together, when I look at an AWS bill, unless you’re using Fargate, I can’t tell whether you’re using Kubernetes or not because EKS is a small charge. And almost every case for the control plane, or Fargate under it.

Everything else just manifests as EC2 spend. From the perspective of the cloud provider. If you’re running a Kubernetes cluster, it is a single-tenant application that can have some very funky behaviors like cross-AZ chatter back and fourth because there’s no internal mechanism to say talk to the free thing, rather than the two cents a gigabyte thing. It winds up spinning up and down in a bunch of different ways, and the behavior patterns, because of how placement works are not necessarily deterministic, depending upon workload. And that becomes something that people find odd when, “Okay, we look at our bill for a week, what can you say?”

“Well, first question. Are you running Kubernetes at all?” And they’re like, “Who invited these clowns?” Understand, we’re not prying into your workloads for a variety of excellent legal and contractual reasons, here. We are looking at how they behave, and for specific workloads, once we have a conversation engineering team, yeah, we’re going to dive in, but it is not at all intuitive from the outside to make any determination whether you’re running containers, or whether you’re running VMs that you just haven’t done anything with in 20 years, or what exactly is going on. And that’s just an artifact of the billing system.

Casey: We ran into this challenge in Gaggle. We don’t use EKS, we use ECS, but we have some shared clusters, lots of EC2 spend, hard to figure out which team is creating the services that’s running that up. We actually ended up creating a tool—we open-sourced it—ECS Chargeback, and what it does is it looks at the CPU memory reservations for each task definition, and then prorates the overall charge of the ECS cluster, and then creates metrics in Datadog to give us a breakdown of cost per ECS service. And it also measures what we like to refer to as waste, right? Because if you’re reserving four gigs of memory, but your utilization never goes over two gigs, we’re paying for that reservation, but you’re underutilizing.

So, we’re able to also show which services have the highest degree of waste, not just utilization, so it helps us go after it. But this is a hard problem. I’d be curious, how do you approach these shared ECS resources and slicing and dicing those bills?

Corey: Everyone has a different approach, too. This there is no unifiable, correct answer. A previous show guest, Peter Hamilton, over at Remind had done something very similar, open-sourced a bunch of these things. Understanding what your spend is important on this, and it comes down to getting at the actual business concern because in some cases, effectively dead reckoning is enough. You take a look at the cluster that is really hard to attribute because it’s a shared service. Great. It is 5% of your bill.

First pass, why don’t we just agree that it is a third for Service A, two-thirds for Service B, and we’ll call it mostly good at that point? That can be enough in a lot of cases. With scale [laugh] you’re just sort of hand-waving over many millions of dollars a year there. How about we get into some more depth? And then you start instrumenting and reporting to something, be it CloudWatch, be a Datadog, be it something else, and understanding what the use case is.

In some cases, customers have broken apart shared clusters for that specific reason. I don’t think that’s necessarily the best approach from an engineering perspective, but again, this is not purely an engineering decision. It comes down to serving the business need. And if you’re taking up partial credits on that cluster, for a tax credit for R&D for example, you want that position to be extraordinarily defensible, and spending a few extra dollars to ensure that it is the right business decision. I mean, again, we’re pure advisory; we advise customers on what we would do in their position, but people often mistake that to be we’re going to go for the lowest possible price—bad idea, or that we’re going to wind up doing this from a purely engineering-centric point of view.

It’s, be aware of that in almost every case, with some very notable weird exceptions, the AWS Bill costs significantly less than the payroll expense that you have of people working on the AWS environment in various ways. People are more expensive, so the idea of, well, you can save a whole bunch of engineering effort by spending a bit more on your cloud, yeah, let’s go ahead and do that.

Casey: Yeah, good point.

Corey: The real mark of someone who’s senior enough is their answer to almost any question is, “It depends.” And I feel I’ve fallen into that trap as well. Much as I’d love to sit here and say, “Oh, it’s really simple. You do X, Y, and Z.” Yeah… honestly, my answer, the simple answer, is I think that we orchestrate a cyber-bullying campaign against AWS through the AWS wishlist hashtag, we get people to harass their account managers with repeated requests for, “Hey, could you go ahead and [dip 00:36:19] that thing in—they give that a plus-one for me, whatever internal system you’re using?”

Just because this is a problem we’re seeing more and more. Given that it’s an unbounded growth problem, we’re going to see it more and more for the foreseeable future. So, I wish I had a better answer for you, but yeah, that’s stuff’s super hard is honest, but it’s also not the most useful answer for most of us.

Casey: I’d love feedback from anyone from you or your team on that tool that we created. I can share link after the fact. ECS Chargeback is what we call it.

Corey: Excellent. I will follow up with you separately on that. That is always worth diving into. I’m curious to see new and exciting approaches to this. Just be aware that we have an obnoxious talent sometimes for seeing these things and, “Well, what about”—and asking about some weird corner edge case that either invalidates the entire thing, or you’re like, “Who on earth would ever have a problem like that?” And the answer is always, “The next customer.”

Casey: Yeah.

Corey: For a bounded problem space of the AWS bill. Every time I think I’ve seen it all, I just have to talk to one more customer.

Casey: Mmm. Cool.

Corey: In fact, the way that we approached your teardown in the restaurant is how we launched our first pass approach. Because there’s value in something like that is different than the value of a six to eight-week-long, deep-dive engagement to every nook and cranny. And—

Casey: Yeah, for sure. It was valuable to us.

Corey: Yeah, having someone come in to just spend a day with your team, diving into it up one side and down the other, it seems like a weird thing, like, “How much good could you possibly do in a day?” And the answer in some cases is—we had a Honeycomb saying that in a couple of days of something like this, we wound up blowing 10% off their entire operating budget for the company, it led to an increased valuation, Liz Fong-Jones says that—on multiple occasions—that the company would not be what it was without our efforts on their bill, which is just incredibly gratifying to hear. It’s easy to get lost in the idea of well, it’s the AWS bill. It’s just making big companies spend a little bit less to another big company. And that’s not exactly, you know, saving the lives of K through 12 students here.

Casey: It’s opening up opportunities.

Corey: Yeah. It’s about optimizing for the win for everyone. Because now AWS gets a lot more money from Honeycomb than they would if Honeycomb had not continued on their trajectory. It’s, you can charge customers a lot right now, or you can charge them a little bit over time and grow with them in a partnership context. I’ve always opted for the second model rather than the first.

Casey: Right on.

Corey: But here we are. I want to thank you for taking so much time out of well, several days now to argue with me on Twitter, which is always appreciated, particularly when it’s, you know, constructive—thanks for that—

Casey: Yeah.

Corey: For helping me get my business partner to re:Invent, although then he got me that horrible puzzle of 1000 pieces for the Cloud-Native Computing Foundation landscape and now I don’t ever want to see him again—so you know, that happens—and of course, spending the time to write Quinntainers, which is going to be at snark.cloud/quinntainers as soon as we’re done with this recording. Then I’m going to kick the tires and send some pull requests.

Casey: Right on. Yeah, thanks for having me. I appreciate you starting the conversation. I would just conclude with I think that yes, there are a lot of ways to run containers in AWS; don’t let it stress you out. They’re there for intention, they’re there by design. Understand them.

I would also encourage people to go a little deeper, especially if you got a significantly large workload. You got to get your hands dirty. As a matter of fact, there’s a hands-on lab that a company called Liatrio does. They call it their Night Lab; it’s a one-day free, hands-on, you run legacy monolithic job applications on Kubernetes, gives you first-hand experience on how to—gets all the way up into observability and doing things like Canary deployments. It’s a great, great lab.

But you got to do something like that to really get your hands dirty and understand how these things work. So, don’t sweat it; there’s not one right way. There’s a way that will probably work best for each user, and just take the time and understand the ways to make sure you’re applying the one that’s going to give you the most runway for your workload.

Corey: I will definitely dig into that myself. But I think you’re right, I think you have nailed a point that is, again, a nuanced one and challenging to put in a rage tweet. But the services don’t exist in a vacuum. They’re not there because, despite the joke, someone wants to get promoted. It’s because there are customer needs that are going on that, and this is another way of meeting those needs.

I think there could be better guidance, but I also understand that there are a lot of nuanced perspectives here and that… hell is someone else’s workflow—

Casey: [laugh].

Corey: —and there’s always value in broadening your perspective a bit on those things. If people want to learn more about you and how you see the world, where’s the best place to find you?

Casey: Probably on Twitter: twitter.com/nektos, N-E-K-T-O-S.

Corey: That might be the first time Twitter has been described as a best place for anything. But—

Casey: [laugh].

Corey: Thank you once again, for your time. It is always appreciated.

Casey: Thanks, Corey.

Corey: Casey Lee, CTO at Gaggle and AWS Container Hero. And apparently writing code
in anger to invalidate my points, which is always appreciated. Please do more of that, folks. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, or the YouTube comments, which is always a great place to go reading, whereas if you’ve hated this podcast, please leave a five-star review in the usual places and an angry comment telling me that I’m completely wrong, and then launching your own open-source tool to point out exactly what I’ve gotten wrong this time.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. My guest today is someone that I had the pleasure of meeting at re:Invent last year, but we’ll get to that story in a minute. Casey Lee is the CTO with a company called Gaggle, which is—as they frame it—saving lives. Now, that seems to be a relatively common position that an awful lot of different tech companies take. “We’re saving lives here.” It’s, “You show banner ads and some of them are attack platforms for JavaScript malware. Let’s be serious here.” Casey, thank you for joining me, and what makes the statement that Gaggle saves lives not patently ridiculous?

Casey: Cool. Thanks, Corey.

Casey: [laugh]. Right.

Casey: Yeah. I saw your tweet; you were trying to run some workload—container workload—and I could hear the frustration on the other end of Twitter when you were shaking your fist at—

Casey: Sure.

Corey: Well, honestly, let’s be clear here, once it’s up and running, I don’t want to ever have to touch the silly thing again.

Casey: Right.

Corey: Oh, no.

Corey: Oh, we’re putting that in the [show notes 00:21:37], for sure. In fact, if you go to snark.cloud/quinntainers, you’ll find it.

Corey: It might very well be.

Corey: As am I.

And the next morning, they raised prices on a bunch of cloud offerings. Whew, that’ll fix it. Like—

Casey: [laugh].

Casey: Yeah, good point.

Casey: I’d love feedback from anyone from you or your team on that tool that we created. I can share link after the fact. ECS Chargeback is what we call it.

Casey: Yeah.

Corey: For a bounded problem space of the AWS bill. Every time I think I’ve seen it all, I just have to talk to one more customer.

Casey: Mmm. Cool.

Casey: Yeah, for sure. It was valuable to us.

Casey: It’s opening up opportunities.

Casey: Right on.

Casey: Yeah.

I think there could be better guidance, but I also understand that there are a lot of nuanced perspectives here and that… hell is someone else’s workflow—

Casey: [laugh].

Corey: —and there’s always value in broadening your perspective a bit on those things. If people want to learn more about you and how you see the world, where’s the best place to find you?

Casey: Probably on Twitter: twitter.com/nektos, N-E-K-T-O-S.

Corey: That might be the first time Twitter has been described as a best place for anything. But—

Casey: [laugh].

Corey: Thank you once again, for your time. It is always appreciated.

Casey: Thanks, Corey.

Corey: Casey Lee, CTO at Gaggle and AWS Container Hero. And apparently writing code in anger to invalidate my points, which is always appreciated. Please do more of that, folks. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, or the YouTube comments, which is always a great place to go reading, whereas if you’ve hated this podcast, please leave a five-star review in the usual places and an angry comment telling me that I’m completely wrong, and then launching your own open-source tool to point out exactly what I’ve gotten wrong this time.

Announcer: This has been a HumblePod production. Stay humble.

Creating “Quinntainers” with Casey Lee

Episode Summary

Episode Show Notes & Transcript

Transcript

You might also like

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

The Latest State of IaC with Ido Neeman

Get the Newsletter

Sponsor an Episode