The Hidden Costs of Cloud Computing with Jack Ellis

Episode Summary

On this week’s episode of Screaming in the Cloud, Corey Quinn is joined by Jack Ellis. He is the technical co-founder of Fathom Analytics, a privacy-first alternative to Google Analytics. Corey and Jack talk in-depth about a wide variety of AWS services, which ones have a habit of subtly hiking the monthly bill, and why Jack has moved towards working with consultants instead of hiring a costly DevOps team. This episode is truly a deep dive into everything AWS and billing-related led by one of the best in the industry. Tune in.

Episode Video

Episode Show Notes & Transcript

Show Highlights

(00:00) - Introduction and Background
(00:31) - The Birth of Fathom Analytics
(03:35) - The Surprising Cost Drivers: Lambda and CloudWatch
(05:27) - The New Infrastructure Plan: CloudFront and WAF Logs
(08:10) - The Unexpected Costs of CloudWatch and NAT Gateways
(10:37) - The Importance of Efficient Data Movement
(12:54) - The Hidden Costs of S3 Versioning
(14:33) - The Benefits of AWS Compute Optimizer
(17:38) - The Implications of AWS's New IPv4 Address Charges
(18:57) - Considering On-Premise Data Centers
(21:05) - The Economics of Cloud vs On-Premise
(24:05) - The Role of Consultants in Cloud Management
(31:05) - The Future of Cloud Management
(33:20) - Closing Thoughts and Contact Information

About Jack Ellis

Technical co-founder of Fathom Analytics, the simple, privacy-first alternative to Google Analytics.

Links:

Twitter: @JackEllis
Website: https://usefathom.com/
Blog Post: An alterNAT Future: We Now Have a NAT Gateway Replacement
Sponsor: Oso - osohq.com

Transcript

Jack Ellis: Yeah, we had old logs. We absolutely did, but that was not the big cost driver. People assumed it was though. The big cost driver was that ingest thing you talk about.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I've been paying attention to t he world of web traffic analytics for a little while now because it seems that we've basically ceded the entire space to, well, the only way to know what people are doing on your website is to send all the information to Google.

A while back I heard about a company called Fathom that was launching something in the space. That actually treated your data with, you know, respect and and dignity. It was kind of wild. I recently re-encountered the company when they had a whole Twitter thread series on things that they had done to save money on their AWS bill, which is basically like, in my case, like taunting a tiger by waving raw meat in front of it.

Here today to talk about some of those things, and I'm sure much more, is Jack Ellis the co-founder and CTO of Fathom Analytics. Jack, Thank you for agreeing to suffer through my nonsensical questions.

Jack Ellis: Uh, thanks for having me, my friend.

Corey: Oso makes it easy for developers to build authorization into their applications. With Oso you can model, extend and enforce your authorization as your applications scale. Organizations like Intercom, Headway ProductBoard and Pagerduty have migrated to Oso to build fine grained authorization, backed by a highly available and performant service. Check out Oso today at osohq.com today. That’s O-S-O-H-Q-dot-com.

So I wanna start at the beginning here, which is when you're building something to do analytics for a small website that does not get hits, Apache logs tend to basically be sufficient and in time the complexity grows.

And then people have different problems that they want to wind up addressing. And one thing leads to another. I did not expect to find a company that was relatively early in its journey already caring about the AWS bill. So I have to ask, what, was there a tipping point that made you say, ah, we should definitely dive into this and fix it?

Did it cross some threshold? Was it just Mm feels like it's time for a good citizen effort or something else?

Jack Ellis: So the leading motive, I think a lot of companies have a lot of money to burn through and there's lots of venture capital involved. Our company's fully bootstrapped so that cash, it is cash, it's profits, it's employee raises, and things like that.

So we have to ask ourselves, as our business grows, what's going to hurt our profit margin or available cash to spend elsewhere? And AWS on a per page view level was becoming concerning and it, and the spending was wasteful. And so it didn't really matter that sure, it was only a hundred thousand this year. Based on our growth, we were going to see it become 200, 300, 400.

And then before you know it, we'd be reaching out to you saying, look on my AWS bill, it's gone- It's gone crazy, which is what most people do, but we, we try to get it ahead of time.

Corey: Most of my customers these days tend to be enterprise scale. Not to talk, not to cast aspersions on them at all, but at Enterprise Scale, the bills get a lot less interesting in most cases, where you have this giant conglomerate with, okay, they're spending hundreds of millions a year, but the biggest workloads, a couple million bucks, and there's just a very long tail of those things.

It becomes more about central planning and you don't see the same fun level of misconfigurations. Because yeah, if you're running a managed net gateway and that's driving 20 grand a month in spend, like that's okay. That wasn't being a fifth of your bill. In some cases, people feel foolish. And they fix it.

You don't let that grow until, oh, what is that $30 million a year charge? We're getting someone notices when the numbers get big enough so it starts to normalize toward a certain spend. Accounts at your scale are a lot more fun 'cause you get to see things that, uh, catch folks who are paying attention to it by surprise. What surprised you the most?

Jack Ellis: Lambda surprised me, but more than anything, CloudWatch really surprised me, but spending a significant amount and we weren't getting any value from it. We'd come up with this approach once upon a time and we were completely fine with it. And I was surprised to see that the Lambda was so high because we were effectively doing double requests, the HTTP into the SQS and then triggering a Lambda.

And it just, I dunno, I suppose it surprised my own in incompetence surprised me. I just wasn't happy about what I was seeing. Right. And it was just this inefficient use of money. Things that we just didn't have to do now that things had changed once upon a time, SQS had some relevance, but why are we now spending this money when it's not actually delivering any value in our particular use case, and it's driving up the Lambda bill because we are seeing those documented, I believe it can go into the hundreds of milliseconds for latency.

We are seeing that and everyone says that's crazy, but it is documented and they say that can happen. So that surprised me. The SQS time, actually, now that we talk about being surprised. Yeah.

Corey: Cost and architecture and cloud are the same thing. It's very odd seeing the drivers of your cost. It, it, it definitely leads to a better understanding of your own architecture once you start seeing it in black and white in the bills that show up, that start to resemble telephone numbers.

Jack Ellis: I completely agree. And so, yeah, we just, we had to attack it. We had to do something because, and I'll, I'll talk about this. It hasn't happened yet. We are looking at workloads that are going to double the volume and so my mind goes, okay, that's going to be nearly double the bill if Lambda is already inefficient and SQS is in heavy use.

The bill was going to double and we weren't doing clever things like batching SQS to Lambda, right? So it was one, one page view in Lambda, SQS, Lambda. You can see that's not an efficient infrastructure to have as you're going to scale. So we had to fix it.

Corey: Yeah. At at scale everything is small inefficiencies start to add up.

What are you looking at doing instead? Uh, fewer Lambdas not using Lambdas at all.

Jack Ellis: Right. So I have not talked about the finalized infrastructure for this. So this is a, this is an exclusive for you. We are doing cloud front and WAF. WAF logs, and we're still prototyping this, but WAF logs into kines. It's now called Data Firehose.

It's just been renamed. Into Data Firehose, transform batched, through Lambda and that's to anonymize the data before it hits S3. And there are some privacy law reasons I don't wanna get into, but we're doing that. It gets into S3 and then we have single store database running a pipeline, which is a massively scalable, like millions per second that it can handle, I think probably more. Pulls the data from S3 using their own special proprietary stuff and loads that data into our database.

This is more equivalent to the big companies and how they handle, what do they call it, click point, or you know, those kinds of analytical workloads. So we're now looking, this is the cool part. We're now looking at the CloudFront 250,000 requests per second limit. We know that WAF batches through to data firehose, so we can fit within their default limit.

Might have to increase it a little bit, but we're getting that without any provisioned servers, and we've got no Lambda burst concerns because the Lambda team wouldn't increase our burst, um, concurrency. And I appreciate the scaling work they've done the new every 10 seconds. It's an extra thousand invocations or whatever it is.

That's fantastic. We're keeping our dashboard and our API going to work great for us. When we have so many customers who can go bursty at any minute, you know the initial, is it 5,000? I dunno how many, I think it's a thousand now.

Corey: Some people are seeing 10 in new accounts and getting declined on increasing them.

Jack Ellis: And that's a problem, right?

So we, we are feeling like we're being forced by AWS into something that was purpose made for this use case. And I'm really proud that we push Lambda this far for our use case and using Laravel and PHP on the ingest. And that's been a story for years, but. We are at the point where we have to go into different directions and it's, uh, I'm happy and sad about that.

Corey: Lambda is one of those things that can do an awful lot, but at some point it feels like you're trying to stretch it in a direction it wasn't intended for, and you start to feel the sharp edges coming apart under you.

Jack Ellis: That's- that's exactly it. And with AWS not giving us limit increases, it's not even an option.

'Cause we were thinking, you know, we can bring in Redis and keep, keep it really efficient with the latency to external services, private link VPC peering, get the, uh, execution time down. But even if we do that, we're still not getting that, that burst that we need. And that just, versus 250,000 requests per second, um, on CloudFront, which is obviously a, a much more, you know, CloudFront's, CloudFront, it's made for scale.

That just was gonna work better for us. And that's the fool. By the way.

Corey: You also mentioned that the cloud, that, uh, CloudWatch was a, uh, fun challenge for you. Did you, I imagine you went through the same thing most of us do, where it's, okay. CloudWatch, that's a lot. CloudWatch covers a lot of surface area.

What are the expensive parts of it? And sometimes people wind up going down the wrong path of, oh, I'm storing too many old logs. Yeah, ingest is 50 cents a gigabyte, but now they have a twenty-five cent option, and storing it is 3 cents a gigabyte per month. Old logs are not really the cost driver in most environments.

Jack Ellis: You're absolutely right. So yeah, we had old logs. We absolutely did, but that was not the big cost driver. People assumed it was though. The big cost driver was that ingest thing you talk about, and it was pointless. Logs at Laravel Vapor, you know, the, the runtime they had, this is a PHP Laravel runtime.

They were writing pointless logs once upon a time. Starting up injecting secrets into the, into the runtime. But they fixed that. So I'm thinking, oh, good, I can disable those pointless logs. I'm going to be great. And yet I was still seeing these logs and it's, it's lambda writing the execution time and things like that.

And I'm not using this, you know, and it, it just blew my mind and it infuriated me. And then I said to myself, okay, cool. We will, we will go without everything. All logs from Lambda can just go, including those beautiful logs where you can see the, the throttles and the concurrency and the, but all of that will just get rid of it and we will live, well.

It turns out those graphs are actually included in the price of Lambda. So we still see the concurrency and the requests and everything. We just don't have that pointless logging to CloudWatch, which we never wanted in the first place.

Corey: Even the, uh, function started. Function ended. Here's a report, three lines of logs on every invocation you wound up, uh, documenting the, the advice we've given people.

And after extensive testing to make sure it doesn't destroy things of, the only way to turn this off is to remove the ability to put CloudWatch logs in from the execution role the Lambda is running in, which is insane.

Jack Ellis: Yeah, and like I said, I said to you before the call, I dunno if they've improved on that, but that was the way that I found to do things.

I dunno if the JSON logging changes things people are suggesting. It does, but I haven't checked that out. So yeah, the thing we had to do was crazy.

Corey: You also wound up getting bitten by my personal favorites, obnoxious bugbear, the managed nat Gateway charges, and, and there's always two ways that hits. One is in you have a lot of them and very little traffic's going through, so the hourly cost is through the roof, or you have relatively few, and the traffic through them is enormous.

Which one was you?

Jack Ellis: So we were the enormous traffic. And you know, that is, I call it incompetence. I'm being harsh on myself, but just learning the right way to move data around and to move it around efficiently. You know, you can have in good practice, but it is possible, and I know Heroku have done this for years, you can have your database traffic go over the internet.

Now if you say that to me now, I say, of course you wouldn't do that, but i t's not unusual for people to do that. Not everyone has these, these VPCs locked down. I know your clients a hundred percent surely do, but smaller companies are not always doing that. They're rarely doing, in fact that, you know, that the system I'm talking about.

And so we were hit. I thought to myself, you know, temporarily, all the database can go over here and it's going to be fine. I didn't realize how much traffic was going back and forth, so it was incompetence on my part, but it was so easy to fall into that trap. And so our NAT Gateway spend is literally, we've got private link and VPC peering set up now.

So our NAT Gateway spend is, is pennies, is cents, it's next to nothing each day.

Corey: We, one of the, uh, projects that came out of Chime Financial was Alternat, where it runs its own NAT instance and then as a failback of the managed NAT gateway. So you can maintain uptime in the event that the instance has a problem or whatnot, but you are stuffing things through it at significant scale.

Save them something like 30 grand a month. I did a whole blog post about it two years

Jack Ellis: ago. That's incredible. And, and I know you can self-host NAT Gateways and things like that. I just, I don't want to be hands-on with anything. They've, they've probably got a team of, they definitely have a team of DevOps if they're spending that much money.

We don't have a team for DevOps, so we have to think about manage, manage, managed.

Corey: You also had some fun things that, um, that make sense, like the old-school sysadmin approach used to be for, um, load purposes, but now it seems as a financial one too. You save 2,500 bucks a year on Route 53 just by increasing TTLs for some records.

How do you figure out which ones were too short?

Jack Ellis: So, um, we know which ones we are seldom going to change, and if we're gonna change something we'll know weeks in advance. And so I haven't gone ahead and increased it to something ridiculously high. But we had it at the core. It was so low, it was, we're talking maybe 60 on some of them, and they're not changing at all.

I love this one because people that read this article, this isn't a groundbreaking change for people, but they hadn't necessarily thought about the TTLs and the impact they have at scale, and they really do.

Corey: You also knocked almost six grand off your S-III bill just by, uh, fixing versioning, being turned on on a particular bucket.

But I like that all of AWS's recommendations and the default config and guard duty and the rest, demand you turn it on for every bucket. It's like you're a little self-interested there, buddy, aren't you?

Jack Ellis: AWS, their S-III stuff drives me wild. Even how the new config doesn't want you to put anything public.

It's aggressively just, no, nothing's going public. It feels very hard to use now. But with the versioning, I had to tick this toggle to show that I was versioning. I'd forgotten about it. Right. I hadn't seen this tiny, tiny thing in the UI. And again, incentives are the incentivized to make that more obvious.

No, they're not. And so I spot this thing, I click it and I go, oh no. And that is what was contributing towards our AWS S-III bill. So I've forgotten about that one. You're bringing it back to my memory

Corey: I am, it's, uh, yeah. I don't have this off the top of my head. I pulled up the blog post and I'll throw a link to it in the show notes.

Corey: No one is excited by the prospect of building permissions – except for the people at Oso. With Oso’s authorization as a service you have building blocks for basic permissions patterns like RBAC, ReBAC, ABAC and the ability to extend to more fine-grained authorization as your applications evolve. Build a centralized authorization service that helps your developers build and deploy new features quickly. Check out Oso today at osohq.com. That’s O-S-O-H-Q-dot-com.

But the, the, what I'm curious about too is not, not, I mean, the stuff that you wound up putting in here. I did not notice that you got anything wrong, which is something of a rarity in posts like this. People often like to get ahead of their skis and they'll get some trivial thing wrong, and I try not to be like the uh, aha, you missed this thing.

It's like, I don't wanna be the person that shows up to an effort like this and starts chipping away at the validity of what folks have done. But what I'm curious is that the stuff that you didn't put in here, for example, like you talk about saving money on S-III by turning off versioning. I would wonder if, uh, but again, it's all gonna be based on what the service drivers are, but taking S-III as an example, did you do any analysis of your data access patterns and figure out if there were lifecycle changes or intelligent tiering that would potentially make sense for you to implement?

Jack Ellis: No. This is, this is more. No, and there probably could have been something we could have done there. That's a very valid point.

Corey: And that that was an example, like there are a bunch of things you could go down the path on. It sounds like you took the same approach that I believe in taking, which is, it's this ancient secret of cloud economics where you start by with the biggest numbers rather than alphabetically and understand the items contributing to that, and then work your way down.

Well, why didn't you optimize your dollar 50 charge for, I don't know, KMS. Because no one cares buddy. Go back to work.

Jack Ellis: Um, and I, I completely agree. I completely agree. And there's, there's, there's things for sure we could, I mean, you even told me something we spoke on, on Twitter DMs, and you, you, um, told me to explore this.

I think there was one thing that came out of that that was something to do with the compute optimizer. The ingest itself was already good, but there was something on the dashboard that we actually went off and changed, um, that they were recommending. So thank you for that. By the way,

Corey: the AWS Compute Optimizer, which is, should be part of the billing console, but it's not because of internal.

I don't know. Feudal warlords fighting, whatever it is, but it, when it launched, it was pretty crap. And it has gotten disturbingly good. Uh, it corrected me on the optimization of one of my Lambda functions, and I, I just wanna know the answer to this for- just for my own purposes because I need to understand how this all works.

So it was right and it saved me a penny a month. Um, you'll forgive me if I'm not falling all over myself with excitement at the cost savings, but it has gotten good enough that I have deprecated some of the analytical tooling that I've had used to use for a number of things around right sizing. It sees so many workloads and it knows what it's looking at, at reinvent, they launched the ability for you to start customizing how it works, like what headroom should be built in, how conservative do you want it to be, and its defaults are pretty sensible, too.

Jack Ellis: Easy to actually take action based on what they were giving us. And you're right, the cost savings at the current scale for that, they won't, I don't think they're going to be huge, but I mean, I, I still like optimizing things, not, not overly optimizing them, but if it's a case of me tweaking a little value here and it will add up over time, I absolutely will do that.

I felt, also felt happy to know that ingest was, you know, we're moving away from it, but to be validated that it was in a good place with the provisioned memory. And you know what? I also think I need to go back to it after having made these changes and see if anything's changed there, because that would be interesting to see.

Corey: The problem too is that if you spin up a resource, it's not just what the resource charges you along an ever-increasing array of dimensions. It's okay, so now it's causing log events and config rule evaluations and snapshots and whatnot. And then those things in turn have downstream effects. And it's, it, it's turtles all the way down.

Jack Ellis: No, for sure. Um, and the data transfer stuff is interesting. We're actually, um, as part of this process, we're spinning up some EU isolation, EU data processing stuff within AWS. Even the Kinesis writing through to the S-III in the US from the EU is interesting. And these things you have to know about to price out to make sure that it's economical and for what the business is doing.

I think 2 cents per gigabyte and we can absorb that, it's fine. Knowing to know that I find is a challenge and that that's AWS a lot of the time knowing to know this is, it's hard.

Corey: We're recording this conversation in the middle of February and starting on, starting back on February 1st, AWS started charging half a penny per hour per provisioned public IPV four address.

Most people don't read, so I'm expecting my phone to basically explode right around March 3rd. Have you done the numbers on that yet or are you just waiting for the delightful surprise?

Jack Ellis: Honestly, I don't know. I don't think it affects us that much because we only have two workloads we're concerned about. You know, if we're talking enterprise customers or even slightly bigger businesses, it's gonna be crazy, isn't it?

All the EC twos, they've got all the things tied together. I, I think you're gonna have a fun time this year.

Corey: Between three and 10% is what I'm seeing in various sample customer environments across the board. Mine is almost 10, but I have a weird architecture. But again, we're talking 50 bucks, so, okay.

People are gonna be unpleasantly surprised by this.

Jack Ellis: Because they want you to move to IPvSix and they're trying to push you.

Corey: Well, I'd like them to move to IPvSix first. So many of the things I want to run will not work, full stop, in a pure IPvSix environment internally on AWS services. Back when they announced this last summer, it sounded like, oh great.

They're gonna have these things ready to catch customers. They didn't.

Jack Ellis: Yeah. You gotta love them. You really have.

Corey: Oh yeah. You've been an AWS customer for a while and you've been doing a lot of. Interesting things with them. And as an AWS customer, you have your fair share of frustrations around a lot of the things that they do and how they operate.

Are you planning to, uh, lead to follow all of the think pieces that are getting written and repatriate all of your workloads to an on-premise data center. How do you think about this?

Jack Ellis: Yeah, so we keep getting asked this and I find it funny, like isn't, it's a funny question. I'm sure some people are trolling and I appreciate the trolling.

Corey: Yeah, there, there was an element of sarcasm in my question because at your scale I'd be very hard for us to build an economical case for you to do that. But I've been, I can be surprised. I'm, I am curious. The question is in good faith, even if I'm 90% certain, I know where it's going.

Jack Ellis: No, absolutely. Okay.

So it's funny because I try and get in the head of someone who's thinking about doing this. And I think, okay, I've already got the DevOps team, they're managing cloud. I can have them manage on-premise, so I haven't got to worry about the extra salaries required for that and benefits and everything else.

I've got my team, let's say it's five to 10 people. I have no idea. Wait. It's so funny to think of, okay, if you've got the team already, sure, go ahead and do it. But then, then someone comes back to me when I've said that and they challenge me and they say, okay, but people leave the company and then you've got to worry about these staff that are managing this. It's not just popping someone into place and to replace them, like there's training required there.

Corey: They're not just replacing hard drives.

Jack Ellis: No, exactly. I, I can't see it. I mean, for me it's a, this is never going to happen. I would always prefer, unless we're spending, if we're spending, how much would it have to be?

I mean, senior devops the best of the best in devops. These are big salaries that you are, we're talking about. So the bill would have to be so substantial that it was causing so much pain that I’d do it. But I probably thinks it's motivated if we're talking about a specific situation here, um, I think it's motivated by them being bootstrapped and wanting more cash out of the company for themselves, which I understand, but I think it's, it's good for them and their beliefs, whoever we may be talking about.

It's just not for me. I dunno, Corey, I just think it, the whole thing just breaks my brain. Even thinking about it. my brain just goes a bit all over the place.

Corey: And remember at points of scale, starting at a million bucks a year in spend, in return for committed spend on all the major cloud providers, you get discounting like these people spending $50 million a year are not paying retail prices.

Jack Ellis: And I've, I also saw some, some of the workloads and some of the databases used for various things. I dunno if it was Elasticsearch and I'm just thinking. I wouldn't have chosen that for that problem. And that's, I appreciate I'm in the poor seats here, but are there really, is there really nothing else you can do to reduce that cloud cost?

Even kind of hard-balling with, with Amazon Web services about what they're charging? Once you get to a certain spend, I'm sure you've got- no, you do isn't, doesn't your company do negotiations on behalf of people?

Corey: We, it's about half of our consulting. Oh yes.

Jack Ellis: Okay. So that's what I mean. So there must be a way.

Well, there is a way. You just told me there's a way. Just going on premise feels like such a big jump and it's a almost like a marketing stunt, but I appreciate there's a real business there.

Corey: There are a bunch of analyst reports saying that everyone's doing it on some level. I don't see it. What I see is companies who already have data centers moving some workloads around.

Cool. I don't see people shrinking their cloud footprint. I see steady-state workloads in many cases. Things that do not work well in a cloud environment. Not moving in for obvious reasons, but I've never yet found a company of any scale where the AWS bill was larger than payroll. People are expensive and people lose sight of the fact that they're expensive, so they just look at a raw hardware cost and maybe some of the forward-looking ones look at the power cost too.

There's a lot more to it.

Jack Ellis: And I don't want to donk by saying this, but you know, the. Everyone knows who we're talking about, but like the move to on-premise and, and, and badmouthing the cloud and everything else. And then a DDoS attack happened and the first thing they did was spin up Cloudflare. I'm not having, like, I'm not dunking on them for being DDS, that's horrible.

But the cloud has its place, even if you think you are exiting the cloud, the, the cloud size, I mean AWS, Shield advanced, um, WAF, these things are amazing. CloudFront scalability. I just can't imagine having to try and I guess if your business isn't growing, maybe that's okay, but still you've got the management.

I, it goes round in my head in circles and I, I just can't imagine doing that ever. We will never do that. Let's just say that

Corey: I, I spent the last month or so building myself a Kubernetes for an upcoming conference talk I'm giving at scale, a terrible ideas in Kubernetes, and I'm doing it out of raspberry pies.

And the problem I keep running into is, oh yeah, I'd forgotten this aspect of it. Waiting on parts to show up. Some of the parts don't seem to work right. Um, inconsistencies in a batch of cables, uh, getting the power hooked up, and I'm not even putting my time into this. It's fun, but oh yeah, right. I would, I should be doing this in the cloud now.

First's a small home lab environment. I'm one of the best in the world at AWS billing, but I still would not be confident based on what I've seen so far, that I wouldn't get a giant surprise bill if I did this on EKS. So of course I'm doing it at home. But that doesn't mean I'm moving the production things that make money and hold client data into my spare room too.

That'd be ridiculous.

Jack Ellis: And I honestly think time will tell and I think people have got to watch it and be critical of people doing this and, and people make their choices, realize that there's some marketing going on, but just watch the outcomes and then make your own decisions. Uh, we've made our decision and we're never going on premise.

The stress of, of knowing that our infrastructure is in some data center and that we're as a company responsible, our team could, I mean, our team could walk out, our team could say, you know, we're not doing this anymore. Or they could. They could be sick. I can't even imagine the stress. I'd much rather have Amazon's engineers dealing with it.

They've got plenty of engineers, I'd imagine

Corey: They can lose racks facilities in some cases, and you barely notice, if at all, they have some of the best in the world engineering these problems out. You are worse at replacing failed hard drives than they are guaranteed.

Jack Ellis: I might have to write about this.

'cause when we talk about this, my, when my brain does this, it's trying to get these crazy ideas all together and it's, it's hard. I'd love to see you write about, have you written about it?

Corey: I have a talk coming up on the economics of, on-Prem versus data centers of on-prem versus cloud on economics. Slap Fight.

It's a keynote at SREcon in San Francisco next, uh, a month from now. So March. I should probably write the talk At this point, I'm, I'm creeping in and doing the, uh, speaker procrastination thing, but yeah, it's time for me to go in some depth on this one.

Jack Ellis: I love it. I want to hear it. I think you know about cloud as, and you know about cost.

That's what's interesting to me. You have that insight. Someone like me, I've not seen the negotiations. I have no idea what you guys are pulling off when you have these negotiations. So I just, I want to know more because there's another side I can't have these debates about. Knowing what goes on there, you actually know, and I'd love to hear it from you.

Corey: It's time for us to be a lot more public about what we're seeing and how it works. So there's more of that coming out this year too. It's time. It's always custom and you don't wanna tell any particular company stories or that will enrage the beast, but it's, it's the open secrets in the industry that everyone at a certain scale knows exist.

But if you don't know that's there, these companies look on sound for like, the economics for that don't make sense. Well, there are service specific discounts, so if someone's doing an awful lot of S3, for example, and as a certain use case, yeah, you can get very compelling discount options. That mean your cost for whatever metric you care about, MAU transaction, et cetera, down to a very reasonable place.

Jack Ellis: I like it. I like that look.

Corey: Something else you mentioned even at your scale, you have committed to never having a DevOps team, which having used, I used to be a DevOps and yeah, those people are miserable, but why? Why do you not want one as opposed to, you know, why those people are miserable. We can guess it.

Jack Ellis: It's, I think it's not that we would never hire, you know, a couple of people to help with things. It's just the idea of having this big team to manage. Seem like quite basic infrastructure doesn't feel right when, when I can set it up with consultants and then it's just, it's effectively hands off. We are paying a premium for this and we are using Multi-AZ and everything else, services that AWS is managing, even Lambda and things like that.

The idea is just to be hands off. So we'll do the upfront spend with consultants to put things in place so that we don't need a DevOps team to do it. And I appreciate not everyone can do that, but that's just, that's just how we are at the moment. I want to see how far we can really push this managed services all in on that.

That's really where we're going.

Corey: But you pay more for managed services. Like there's a 10 to 20% high uh, premium of using RDS over EC two. Yeah. But when it works, you don't have to have any database expertise internally. The way you would if you were running this at scale yourself with open source MySQL.

Or post-Crisqueal or whatever it is you choose to use.

Jack Ellis: That's just it. I think when you have good partners that care about your success, we've got great partners at single store aws, no one really talks about this part. AWS, they really want to help you, give you credits and invest in your use cases and help help you to grow.

So your spend more with them, I like it's incentivized. Sure.

Corey: I have a hard time viewing here's some store credit as an investment. I, I never liked that, that turn of phrase.

Jack Ellis: All right, fine. Fine. That's how they Yeah, that's how they phrase it.

Corey: The free sample from the drug dealer is not them investing in your future, let's put it that way.

Jack Ellis: Alright, fine. But they'll, they'll bring on experts and everything else and they'll help you with things. And I've just been blown away by that. And there isn't the salesy part I think I like, like the elastic, what was it they just released? ElastiCash Serverless. I had that team reaching out to me and, and telling me the limitations that bigger companies were facing.

'cause I, I said to them. What are the limitations that your bigger customers are saying? And it's to do with the total size, which I think is like 90 terabytes or something stupid. And the bigger companies are saying, that's not big enough, which, that blows my mind altogether. I like that the teams are very involved in customer relations.

I've had the same thing with AWS Shield, uh, Jeffrey Leon, one of the guys that used to work there. Emailing me and talking about things. I really like that.

Corey: He's great.

Jack Ellis: Oh, you know that. Okay.

Corey: Yeah. The danger of the dangerous part that sucks about SHIELD is it costs $3,000 a month, and there is to us a non-deterministic, and internally, I'm sure it's deterministic way, but what looks to all the world like the charge gets allocated to a random AWS account in your org every month.

So there have been a couple of times where devs had minor heart attacks when their, you know, $20 a month dev environment suddenly got a $3,000 charge slapped onto it.

Jack Ellis: Alright, that's fair. Do you see AWS Shield as an insurance policy? Because that's how we've been thinking about it internally because they'll absorb the actual, the WAF cost per request is, is my understanding.

So we now see it as insurance.

Corey: I think about it slightly differently. I view it as getting you a hotline to their DDoS folks when you need it. It is a, the in it is insurance. But it's the, it's talking to some of the best in the world at these problems in that moment without having to sit through a sales pitch and sign over a credit card.

Geoffrey Leon being a terrific example, back when he worked there before he left to go, uh, sell, what was it? Cryptocurrency, Rob, I think Robinhood is where he went. So kind of.

Jack Ellis: They had the, the bat signal. You just. Run this Lambda and it's something, it's probably changed now, but I know that was our experience when we had it.

I don't know if you've read my DDoS attack article. This is from years ago now. They, they were great and I definitely do enjoy that service, but yes, it is expensive, especially for smaller businesses.

Corey: Oh yeah. And, and the problem is, is so many of their services are clearly designed for enterprises, but they don't mention that up front.

The only way to really figure it out is the pricing. Kendra's a good example. It's like, this sounds awesome. Oh. And it's 7,500 bucks a month, so it is not for me. Cool. Like that is hire someone whose full-time job is basically like the archivist of everything I care about and ask them as a human to go and get me the thing I care about.

Jack Ellis: Ditto. The times are interesting. I, you know, everyone's said, oh, throw up a capture. We're analytics and it happens in the background. We can't throw up a capture to make sure it's a legitimate person coming in. And, you know, Cloudflare can do this. No, Cloudflare can not do layer. Layer seven is really hard and you can't throw up a capture.

Corey: No customer in the universe is going to fill out a capture for the freaking analytics on your webpage.

Jack Ellis: So you've gotta make people understand that when they try and give you advice. But that was a fun experience and we're, we are, um, returning to, to be using that is the expectation.

Corey: So, so I have to ask, now that you've successfully knocked a hundred thousand dollars a year off of your bill, are you done?

Are you gonna keep going? What does done look like?

Jack Ellis: Oh, we feel done. We feel good. Our employees got raises and people are being laid off, and we were able to do that, and that feels really good. We're done. We're good? No, we're good. Honestly, we're good. I think that the main thing now moving forward is we are bringing in consultants when we're building things to make sure we're really squeezing this, you know, I'm friends with Alex Dupree.

I talked to him about things and get his, his thoughts on,

Corey: we have hired him for a number of projects ourselves when it comes to the deep dynamo stuff. Hard to find anyone better. Okay.

Jack Ellis: Him and serverless though, like he just, he knows so much. So talking to him, bringing in other consultants, making sure we're doing it right in the first place versus Jack's going to make a guess at doing something right, that's going to hurt us down the road is a, is a balance there, you know?

And now we're bringing in consultants 'cause we couldn't afford to hire consultants. We couldn't at the beginning, you know, things. We couldn't afford these, these luxuries. So things have changed. Now we've optimized our spend. We're great, but as we do new things, we're going to be in consultants to make sure we're not going to have these huge, you know, amounts we have to cut off down the, down the road.

Corey: That's why I'm always interested to talk to people who reach out like, okay, your, your bill is 50 bucks a month. Why are we having this conversation? And very often it's, well, we're about to scale this and want you to check our napkin math before we have to raise a round to pay the bill.

Jack Ellis: When I was doing consulting, I had people reach out and my job at the day with PHP and serverless stuff, my role was to make sure that it would scale for their use case.

They hadn't even reached this scale, but they wanted to make sure they could. So I get this preventative thinking and I'd always understand why people would come to you for that, because people can blow up like that. And Bills get, you know, AWS bills for everything. So they've really gotta make sure they, they've got it set up.

Corey: Yeah. I'm curious to see how this winds up unfolding in the future. The real trick is at some point, once you've reached equilibrium, keep an eye on it. But you don't necessarily need to go into, uh, super deep, uh, weeds every month looking for spikes. Look for trends.

Jack Ellis: Yeah. Set up notifications in the, in the billing and all of that jazz.

Corey: The alerts are great, if you remember to check them. Sometimes they wind up in the founders Gmail inbox, which they still have from their personal nonsense years ago and getting lost among everything else. I really wanna thank you for taking the time to speak with me. If people wanna learn more, either about you, what you've done, the company, anything, where's the best place for them to go?

Jack Ellis: UseFathom.com is the best place and, and follow me on Twitter. I'm Jack Ellis and that's, that's pretty much it.

Corey: We'll put links to all of that and his blog post in the show notes. Thank you so much for taking the time to talk to me about this. I really appreciate it.

Jack Ellis: Thanks, man.

Corey: Jack Ellis is the CTO and co-founder of Fathom Analytics.

I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice along with an insulting comment that inadvertently will cost that platform $6 'cause they have no idea how their architecture works in relation to the AWS build.

The Hidden Costs of Cloud Computing with Jack Ellis

Episode Summary

Episode Video

Episode Show Notes & Transcript

Transcript

You might also like

Reliable Software by Default with Jeremy Edberg

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

Get the Newsletter

Sponsor an Episode