Mastering Kubernetes for Multi-Cloud Efficiency With Nick Eberts

Episode Summary

In this episode, Corey chats with Google's Nick Eberts about how Kubernetes helps manage applications across different cloud environments. They cover the benefits and challenges of using Kubernetes, especially in Google's cloud (GKE), and discuss its role in making applications more flexible and scalable. The conversation also touches on how Kubernetes supports a multi-cloud approach, simplifies the deployment process, and can potentially save costs while avoiding being tied down to one cloud provider. They wrap up by talking about best practices in cloud infrastructure and the future of cloud-native technologies.

Episode Video

Episode Show Notes & Transcript


Show Highlights: 
(00:00) - Introduction to the episode
(03:28) - Google Cloud's approach to egress charges and its impact on Kubernetes
(04:33) - Data transfer costs and Kubernetes' verbose telemetry
(07:23) - The nature of Kubernetes and its relationship with cloud-native principles. 
(11:14) - Challenges Nick faced managing a Kubernetes cluster in a home lab setting
(13:25) - Simplifying Kubernetes with Google's Fleets
(17:34) - Introduction to GKE Fleets for managing Kubernetes clusters 
(20:39) - Building Kubernetes-like systems for complex application portfolios 
(24:06) - Internal company platforms and the utility of Kubernetes for CI/CD 
(27:49) - Challenges and strategies of updating old systems for today's cloud environment
(32:43) - The dividing line between Kubernetes and GKE from a product perspective. 
(35:07) - Where to find Nick 
(36:48) - Closing remarks 

About Nick:
Nick is an absolute geek who would prefer to spend his time building systems, but he has succumbed to capitalism and moved into product management at Google. For the last 20 years, he has worked as a systems engineer, solution architect, and outbound product manager. He is currently the product manager for GKE Fleets & Teams, focusing on multi-cluster capabilities that streamline GCP customers' experience while building platforms on GKE. 


Links referenced: 

Sponsor
  • Panoptica Academy: https://panoptica.app/lastweekinaws

Transcript

Nick: Maybe that's where kubernetes has a strength because you get a lot of it for free. It's complicated, but if you figure it out and then create the right abstractions, you can end up being a lot more efficient than trying to manage, you know, a hundred different implementations.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined today by Someone rather exciting. You don't often get to talk to, you know, paid assassins, but Nick Eberts is a product manager over at Google, so I can only assume that you kill things for a living.

Nick: Not if we can help it, right? So if you're listening to this, and you're using anything that I make, which is along the lines of GKE, Pleats, multi cluster stuff.

Please use it. Otherwise, you're gonna make me and do a killer.

Corey: This episode's been sponsored by our friends at Panoptica, part of Cisco. This is one of those real rarities where it's a security product that you can get started with for free, but also scale to enterprise grade. Take a look. In fact, if you sign up for an enterprise account, they'll even throw you one of the limited, heavily discounted AWS Skill Builder licenses they got, because believe it or not, unlike so many companies out there, they do understand AWS.

To learn more, please visit panoptica. app slash last week in AWS. That's panoptica. app slash last week in AWS. Exactly. If our customers don't use this, we're going to have to turn it off. That's an amazing shakedown approach. Although let's be honest, every company implicitly does have that. Like if we don't make enough money, we're going to go out of business is sort of the general trend.

At least for the small scale companies, and then at some point it's, Ah, we're going to indulge, we're going to indulge our own corporate ADHD and just lose interest in this thing that we've built and shipped. We'd rather focus on the new things, not the old things. That's boring. But Kubernetes is not boring.

I will say that. One of the things that led to this is, A few weeks before this recording, I wound up giving a talk at the Southern California Area Linux Expo called Terrible Ideas in Kubernetes. Because five years ago, I ran my mouth on Twitter, imagine that, and predicted that no one would care about Kubernetes five years from now.

It would drift below the surface level of awareness that most people had to think about. Either Oops. I think I'm directionally correct, but I got the timing wrong. I'll blame COVID for it. Why not? And as penance, I installed a Kubernetes of my very own in my spare room on a series of 10 raspberries Pi and ran a bunch of local workloads on it for basically fun.

And I learned so many things. Uh, I want to say about myself, but no, not really. Mostly about how the world thinks about these things and, and how, what Kubernetes is once you get past conference stage talking points and actually run it yourself. I get the sense you probably know more about this stuff than I do.

I would seriously hope anyway. GKE is one of those things where people have said for a long time, the people I trust, most people call them customers, of that they have been running Kubernetes in different places and GKE was the most natural expression of it. It didn't feel like you were effectively fighting upstream trying to work with it.

And I want to preface this by saying so far all of my Kubernetes explorations personally have been in my on prem environment because given the way that all of the clouds charge for data transfer, I can't necessarily afford it. Ford to run this thing in a cloud environment, which is sad, but true.

Nick: On that note, specifically, I think maybe you've noted this at other times, um, Google Cloud stopped charging for egress.

Corey: You stopped charging for data egress when customers agree to stop using Google Cloud. All three of the big clouds have done this. And I think it's, it's genius from the perspective of, it's a terrific sales tool. If you don't like it, we won't charge you to get your data back. But what hurts people is not, I want to move the data out permanently.

It's the ongoing. A cost of doing business. Perfect example. I have a 10 node Kubernetes cluster that really isn't doing all that much that's spitting out over a hundred gigabytes of telemetry every month, which gets fairly sizable, would be the single largest expense of running this in a cloud expense other than the actual raw compute and running it.

It's doing nothing, but it's talking an awful lot, and we've all had co workers just like that. It's usually not a great experience. So, it's the ongoing ebb and flow, and why is it sending all that data? What is in that data? It gets very tricky to understand and articulate that.

Nick: So, like, you know, the data transfer is interesting.

I mean, I'd want to ask you, what metrics or what signals are you sending out to cross a point in which you would get built? Because that's interesting to me. I mean Do you not like the in cloud logging and operations monitoring stuff? Because when we ship metrics there, we're not billing you for it. Now we are billing you for the, the storage.

Corey: Sure. And to be fair, storage of metrics has never been something I found prohibitive on any provider. This is again, this is running in my spare room. It is just spitting things out. Like why don't you use the in cloud provided stuff? It's like, well, it's not really a cloud in the traditional sense. And we will come back to that topic in a minute.

But, But I want to get things out somewhere. In fact, I'm doing this multiple times, which makes this fun. I use Axiom for logs, because that's how I tend to think about this. And I've also instrumented it with Honeycomb. Axiom is what told me it were about 250 gigabytes and climbing the last time I looked at it.

And it's at least duplicating that, presumably, for what gets sent off to Honeycomb as well. I also run Prometheus and Grafana locally, because I want to have all the cool kids do, and And frankly, having a workload that runs Kubernetes means that I can start actively kicking the tires on other products that really are, it's contrived, you try and shove it into just this thing locally on your laptop or something that, like, I've had some of my actual standing applications are for pure serverless build on top of Lambda functions, that gets really weird for some visions of what observability should be.

So I have an actual Kubernetes that now I can throw things at and see what explodes.

Nick: Now, that makes sense. I mean, like, listen, I love Honeycomb, and there's a lot of third party tools out there and providers. And one of the things that we do at Google, probably done across the board, is, is work with them to provide an endpoint, or a data store, or an existence of their service that's local within the cloud, right?

So, If you're using Honeycomb and that Honeycomb instance that is your SaaS provider actually is an endpoint that's reachable inside of Google Cloud without going out from the network, then you can reduce the cost. So we try to work with them to do things. One example technology we have is Private Service Connect, which allows you, uh, third party companies, to sort of host their endpoint in your VPC with an IP that's inside of your VPC, right?

So then, then you're not, you're, you're, you're egress charges are from a, a node running in a cluster to a private IP, not going out through the internet. So we're trying to help because our customers do prefer not to pay large amounts of money to use, Essentially is a service that most of these services are running on Google Cloud.

Corey: I do confess to having a much deeper understanding of the AWS billing architectures and challenges among it. But one of the big challenges I've found Let's get into this. This will lead naturally into this from the overall point that few of us here at the Duck Bill Group have made on Twitter, which is how you and I started talking.

Specifically, we have made the assertion that Kubernetes is not cloud native, which sounds an awful lot like clickbait, but it is a sincerely held belief. It's, it's not one of those, somebody needs to pay attention to me. No, no, no, there are, I have better stunts for that. This is, this is based upon a growing conviction that I've seen from The way that large companies are using Kubernetes on top of a cloud provider and how Kubernetes itself works.

It sounds weird to say that I have built this on a series of raspberries pie in my spare room. That's not really what it's intended for or designed to do, but I would disagree. Because, you know, What a lot of folks are using is treating Kubernetes as a multi cloud API, which I think is not the worst way to think of it.

If you have a bunch of servers sitting in a rack somewhere, what, how are you going to run workloads on it? How are you going to divorce workloads from the underlying hardware platform? How do you start migrating it to handle hardware failures, for example? Kubernetes seems to be a decent answer on this.

It's almost a cloud in and of itself. It's similar to a data center operating system. It's, it's realizing the vision that OpenStack sort of defined but could never realize.

Nick: No, that's 100 percent it. And you're not going to get an argument from me there. Kubernetes, running your applications in Kubernetes do not make them cloud native.

One of the problems with this argument in general is that Who can agree on what cloud native actually means? It means

Corey: I have something to sell you in my experience.

Nick: Right. My interpretation, it's, it's sort of adheres to like the value prop of what the cloud was when it came out. Flexible, just pay for what you want when you need it, scale out on demand, these kinds of things.

So applications definitely are not immediately, cloud native when you put them in Kubernetes. You have to do some work to make them autoscale. You have to do some work to make them stateless, maybe 12 factor, if you will, if you want to go back like a decade. Yeah, you can't take a Windows app, run it on Kubernetes clusters that have Windows Node support, that's a monolith, and then call it cloud native.

Also, not all applications need to be cloud native. That is not the metric that we should be measuring ourselves by. So, it's fine. Kubernetes is the lowest common denominator. Where it's becoming the lowest common denominator of compute. That's the point. If you have to build a platform, or you're a business that has several businesses within it, and you have to support a portfolio of applications, it's more likely that you'll be able to run a high percentage of them, a high percentage of them on Kubernetes.

than you would on some fancy pass. Like that's, that's been the death of all pass. It's like, Ooh, this is really cool. I have to rewrite all my applications in order to fit into this paradigm.

Corey: I built this thing and it's awesome for my use case. And it's awesome right until it gets a second user, at which point the whole premise falls completely to custard.

It's a custard. It's awful. There's a, it's a common failure pattern where anyone can solve something to solve for their own use cases, but how do you make it extensible? How do you make it more universally applicable? And the way that Kubernetes has done this has been to effectively, you're building your own cloud when you're using Kubernetes to no small degree.

I, I, one of the cracks I made in my talk, for example, was that Google has a somewhat condescending and difficult engineering interview process. So if you can't pass through it, the consolation prize is you get to cosplay as working at Google by running Kubernetes yourself. And the problem when you start putting these things on top of other cloud provider abstractions is you have a cloud within a cloud, and to the, to the cloud provider, What you've built looks an awful lot like a single tenant app with very weird behavioral characteristics that, for all intents and purposes, remain non deterministic.

So, as a result, you're staring at this thing that the cloud provider says, well, you have an app and it's doing some things, and the level of native understanding of what your workload looks like from the position of that cloud provider become obfuscated through that level of indirection. It effectively winds up creating a host of problems while solving for others.

As with everything, it's built on trade offs.

Nick: Yeah, I mean, not everybody needs a Kubernetes, right? If there's a certain complexity that you have to have of the applications that you need to support, then it's beneficial, right? It's not just immediately beneficial. A lot of the customers that I work with actually too much, I don't want to say dismay, but a little bit like they're doing the hybrid cloud thing of running this.

application across multiple clouds. And Kubernetes helps them there because while it's not identical on every single cloud, it does take like 80, maybe 85, 90 percent of the configuration and, and the application itself can be treated the same across these three different clouds. There's, you know, 10 percent that's different per cloud provider, but it does help in that degree.

We have customers that can hold us accountable. They can say, you know what, this other cloud provider is doing something better or giving it to us cheaper. And we, we have a dependency on open source Kubernetes, and we built all our own tooling. We can move it, and it works for them.

Corey: That's one of those things that has some significant value for folks.

I'm not, I'm not saying that Kubernetes is not adding value. That I, and again, nothing is ever an all or nothing approach. But Easy example where I tend to find a number of my customers struggling. Most people will build a cluster to span multiple availability zones over in AWS land, because that is what you are always told.

Oh, well, yeah, we constrain blast radiuses, so of radii. So of course, we're going to be able to sustain the loss of an availability zone. So you want to be able to have it flow between those. Great. The problem is, is it costs two cents per gigabyte to have data transfer between availability zones. Which means that in many cases, Kubernetes itself is not in any way zone aware.

It has no sense of, of pricing for that. So it'll just as cheerfully toss something over a two gigabyte link as opposed to the thing, two gigabit link, as opposed to the thing right next to it for free. And it winds up in many cases bloating those costs. It's one of those areas where if the, if the system understood its environment, the environment understood its system a little bit better, this would not happen, but it does.

Nick: So I have worked on Amazon. I didn't work for them. I've worked, I've used EC2 for two or three years. That was my first foray into cloud. I then worked for Microsoft. So I worked on Azure for five years and I, now I've been on Google for a while. So I will say this, I, I, my information with Amazon's a little bit dated, but I can tell you from a Google perspective, like that specific problem you call out, there's.

At least there's upstream Kubernetes configurations that can allow you to have affinity with transactions. It's complicated though, it's not easy. We also, so one of the things that I'm responsible for is building this idea of fleets. This idea of fleets is that you have n number of clusters that you sort of manage together.

And not all of those clusters need to be homogenous, but pockets of them are homogenous, right? And so one of the patterns that I'm seeing our bigger customers do is create a cluster per zone. And they stitch them together with the fleet, use namespace sameness, treat everything the same across them, slap a load balancer on front, but then silo the transactions in each zone so they can just have an easy and efficient and sure way to ensure that, you know, interzonal costs are not popping up.

Corey: In many cases, the right approach. I learned this from some former Google engineers back in the noughts, which back when being a Google engineer was a sort of thing where the hush came over the room and everyone leaned in to see what this genius wizard would be able to talk about. It was a different era on some level.

And one of the things I learned was that In almost every scenario, when you start trying to build something for high availability, this is before cross AZ data transfer was even on anyone's radar, uh, but for availability alone, you have a, you know, phantom router that was there to take over in case the primary fails, the number one cause of outages, and it wasn't particularly close, was by a failure in the heartbeat protocol, or the control handover.

So rather than trying to build, uh, data center pods that were highly resilient, the approach instead was, alright, load balance between a bunch of them and constrain transactions within them, but make sure you can failover reasonably quickly and effectively and automatically. Because then you can just write off a data center in the middle of the night when it fails, fix it in the morning, and the site continues to remain up.

That is, that is a fantastic approach. Again, having built this in my spare room, at the moment I just have the one. I feel like after this conversation, it may split into two, just on, just on sheer sense of this is what smart people tend to do at scale.

Nick: Yeah, it's funny, so, uh, when I first joined Google, I was super interested in going through their, like, SRE program.

And so, one thing that's great about this company that I work for now is, They give you the time and the opportunities. So I, I wanted to go through what SREs go through when they get, when they come on board and train. So I went through the training interview

Corey: process, and I believe that process is called hazing, but continue.

Nick: Yeah. But the funniest thing is, so you go through this and you're actually playing with tools and affecting real org cells and using all of the Google terms to do things, um, obviously not in production. And then you have like these tests and most of the time, the answer to the test was, Drain the cell.

Just turn it off. And then turn

Corey: another one on. It's the right approach in many cases. I love, that's what I love about the container world is that it becomes ephemeral. That's why observability is important because you better be able to get the telemetry for something that stopped existing 20 minutes ago to diagnose what happened.

But once you can do that, it, it really does free up a lot of things, mostly. But even that I ran into significant challenges with. I come from the world of being a grumpy old sysadmin, and I've worked with data center remote hands employees that were, yeah, let's just say that was a heck of a fun few years.

So the first thing I did once I got this up and running, got a workload on it, is I yanked the power cord out of the back of one of the node members that was running a workload like I was rip starting a lawnmower enthusiastically at two in the morning, like someone might have done to a core switch once.

But, yeah, it was Okay, so I'm waiting for the pod to the cluster to detect the pod is no longer there and reschedule it somewhere else and it didn't for two and a half days. It's like, ah, because I was under the and again, there are ways to configure this and you have to make sure the workload is aware of this.

But again, to my naive understanding, part of the reason that people go for Kubernetes the way that they do is that it abstracts the application away from the hardware and you don't have to worry about individual node failures. Well, apparently I have more work to do.

Nick: These are things that are tunable and configurable.

Uh, one of the things that we strive for on GKE is to make a lot of these best practices. Like, this would be a best practice, like recovering the node, reducing the amount of time it takes for the disconnection to actually release whatever lease that's holding that pod on that particular node is. We do all this stuff in GKE and we don't even tell you we're doing it because we just know that this is the way that you do things and I, and I hope that other providers are doing something similar just to make it easier.

Corey: They are. This is, again, I've only done this on a bare metal sense. I intend to do it on most of the major cloud providers at some point over the next year or two. Few things are better for your career and your company than achieving more expertise in the cloud. Security improves, compensation goes up, employee retention skyrockets.

Panoptica, a cloud security platform from Cisco, has created an academy of free courses just for you. Head on over to academy. panoptica. app to get started. The most common problem I had was all related to the underlying storage subsystem. Longhorn is what I use.

Nick: I was going to say, can I give you a fun test?

When you're doing this on all the other cloud providers, don't use Rancher or Longhorn. Use their persistent disk option.

Corey: Oh, absolutely. I'm using, the reason I'm using Longhorn, to be very clear on this, is that I don't trust individual nodes very well. And yeah, EBS or any of the providers have a block storage option that is far superior to what I'll be able to achieve with local hardware.

Because I don't happen to have a few spare billion dollars in engineering lying around in order to abstract a really performant, really durable, uh, block store. And I don't, that's not on my list. Well, so

Nick: I think all the cloud providers have really performant, uh, durable block store that's presented as disk store, right?

They all do. But the real test is when you, when you rip out that node or effectively unplug that, that network interface, How long does it take for their storage system to release the claim on that disk and allow it to be attached somewhere else? That's the test.

Corey: Exactly. And that is a good, that is a great question.

And the real, the honest, there are ways, of course, to tune all of these things across every provider. I did no tuning, which means the time was effectively infinite, as best I could tell. It wasn't just for this. I had a number of challenges with the storage provider over the, over the course of a couple months.

And It's challenging. I mean, there are other options out there that might have worked better. I switched all the nodes that have a backing store over to using relatively fast SSDs because having it on SD cards seemed like it might have been a bottleneck around there. And there were still challenges on things and in ways I did not inherently expect.

Nick: That makes sense. So can I ask you a question? Please. If Kubernetes is too complicated, let's just, let's just say, okay, it is complicated. It's not, it's not good for everything, but passes to most passes are a little bit too constrictive, right? Like they, their opinions are too strong. Most of the time I have to use a very explicit programming model to take advantage of them.

That leaves us with VMs in the cloud, really.

Corey: Yes and no. For one workload right now that I'm running in AWS, I've had great luck with ECS, which is, of course, uh, despite their word about ECS anywhere, it is a single cloud option. Let's be clear on this. It is, you are effectively agreeing to a lock in on some form, but it is, it does have some elegance because of how it was designed in ways that resonate with the underlying infrastructure in which it operates.

Nick: Yeah, no, that, that makes sense. I guess what I was trying to get at though is if you had, if ECS wasn't an option and you had to build these things, I feel like my experience working with customers, because before I was a PM, I was very much field, um, consultant, customer engineer, solution architect, all those words, customers just ended up rebuilding Kubernetes, and they built something that autoscale, they built something that had service discovery, they built something that repaired itself, like, They ended up creating a good bit of the API, is what I found.

Now, ECS is interesting. It's a little bit hairy when you actually, if you were going to try to implement something that's got smaller services that talk to each other. If you just have one service, and you're auto scaling it behind a load balancer.

Corey: Yeah, they talked about S3 on stage at one point with something like 300 and some odd microservices that all comprised to make the thing work.

Which is Phenomenal. I'm sure it's the right decision for their, for their workloads and whatnot. I, I felt like I had to jump on that as soon as it was said, just a warning. This is what works at a global hyperscale centuries long thing that has to live forever. Your blog does not need to do this. This is not a to do list.

So, but yeah, back when I was doing this stuff in Anger, uh, which is of course my name for production, as opposed to staging environment, which is always called Theory, because it works in Theory but not in production. Exactly. Back when I was running things in Anger, it was always, it was before containers had gotten big, so it was always, uh, take AMIs, uh, to a certain point and then do configuration management and, uh, code deploys in order to get them to current.

And yeah, then we bolt on all the things that Kubernetes does offer that any system has to offer. Kubernetes didn't come up with these concepts. The idea of durability, of autoscaling, of load balancing, of service discovery. Those things inherently become a problem that needs to be solved for. Kubernetes has solved some of them in very interesting, very elegant ways.

Others of them it has solved for by, Oh, you want an answer for that? Here's 50. Pick your favorite. And I think we're still seeing best practices continue to emerge.

Nick: No, we are. And, and I did the same thing. Like my, my first role that I was using cloud, we were rebuilding an actuarial on, on EC2. And the value prop obviously, obviously for our customers was like, Hey, you don't need to rent a thousand cores for the whole year from us.

You could just use them for the two weeks that you need them. Awesome, right? That was my first foray into infrastructure code. I was using Python and the Boto SDK and just automating the crap out of everything. And it worked great. But I imagine that if I had stayed on at that role, repeating that process for n number of applications would start to become a burden.

So you'd have to build some sort of template, some engine, you'd end up with an API. Once it gets beyond a handful of applications, I think maybe that's where has a strength because you get a lot of it for free. It's complicated, but if you figure it out and then create the right abstractions for the different user types you have, you can end up being a lot more efficient than trying to manage, you know, a hundred different implementations.

Corey: We see the same thing right now. Whenever someone starts their own new open source project or even starts something new within a company, great. You still, the problem I've always found is building the CI CD process. How do I hook it up to GitHub Actions or whatever it is to fire off a thing? And until you build sort of what looks like an internal company platform, you're starting effectively at square one each and every time.

I think that building an internal company platform at anything beyond giant scale is probably ridiculous. But it is something that people are increasingly excited about. So, it could very well be that I'm the one who's wrong on this. I just know that every time I build something new, I, there's a significant boundary between me being able to yolo slam this thing into place and having, uh, having merges into the main branch wind up getting automatically released on, uh, through a, through a process that has some responsibility to it.

Nick: Yeah. I mean, there's, there's no perfect answer for everybody, but I do think, I mean, you'll get to a certain point where the complexity warrants a system like Kubernetes, but also the, this, the CICD angle of, of Kubernetes is not unique to Kubernetes either. I mean, you're just talking about pipelines.

We've been using pipelines forever. Oh,

Corey: absolutely. And even that doesn't give it to you out of the box. You still have to play around with getting Argo or whatever it is you choose to use set up.

Nick: Yeah. It's funny, actually, a weird tangent. I have this weird offense when people use the term GitOps, like it's new.

So first of all, we've all been, well, as an aged man who's been in this industry for a while, like we've been doing GitOps for quite some time. Now, if you're specifically talking about a pull model, fine, that may be unique, but GitOps is simply just, hey, I'm making a change in source control. And then that change is getting reflected in an environment.

That's, that's how I consider it. What do you think?

Corey: Well, yeah, we store all of our configuration in Git now. It's called GitOps. What were you doing before? Oh, yeah. Go, go retrieve the other, the previous copy of what the configuration looked like. It's called copyof, copyof, copyof, thing. back. cjq. usethisone.

doc. zip. Yeah, it's great.

Nick: That's even, that's, yeah, that's even going further back. Yeah, let me please, let me please make a change to something that's in a file store somewhere and copy that down to X amount of, uh, VMs or even hard, you know, Uh, hardware machines just running across my data center. And, and hopefully, uh, that configuration change doesn't take something down.

Corey: Yeah. The idea of blast radius starts to become very interesting and canary deployments and, you know, all the things that you basically rediscover from first principles. Every time you start building something like this, it feels like Kubernetes gives a bunch of tools that are, effective for building a lot of those things.

But you still need to make a lot of those choices and implementation decisions yourself. And it feels like whatever you choose is not necessarily going to be what anyone else has chosen. It seems like it's easy to wind up in unicorn territory fairly quickly.

Nick: But I just, I don't know. I think as we're thinking about what the alternative for a Kubernetes is, or what the alternative for a PaaS is, no one, I don't really see anyone building a platform to run old shitty apps.

Who's going to run that platform? Because that's, that's the, what, 80 percent of the market of workloads that are out there that need to be improved. So we're either waiting for these companies to rewrite them all, Or we're going to make their life better somehow.

Corey: That's what makes containers so great in so many ways.

It's not the best approach, obviously, but it works. You can just take up something that is 20 years old, written in some ancient version of something, shove it into a container as a monolith. Sure, it's an ugly big container, but then you can at least start moving that from place to place and unwinding your dependency rat's nest.

Nick: That's how I think about it, only because, like I said, I've spent 10, 12 years working with a lot of customers trying to unwind these. old applications. And a lot of the times they lose interest in doing it pretty quickly because they're making money and there's not a whole lot of incentive for them to to break them up and do anything with them.

In fact, I often theorize that whatever their business is, the real catalyst for change is when another startup or another smaller company comes up and does it more cloud natively and beats their pants off in the market, which then forces them to have to adjust. But that kind of stuff doesn't happen in, like, payment transaction companies.

Like, no one, like, you have, there's a heavy price to pay to even be in that business. And so, like, what's the incentive for them to change?

Corey: I think that there's also a desire on the part of technologists, many times, and I'm as guilty as anyone of this, to, Walk in and say, this thing's an ancient piece of crap, what's it doing?

And the answer is like, oh, about 4 billion in revenue, so maybe, mind your matters. And, oh, yeah, okay. Is this, is this engineeringly optimal? No, but it, it's kind of load bearing, so we, we need to work with it. If people are not still using mainframes because they believe that in 2024 they're going to greenfield something and that's the best they'd be able to come up with, it's because that's what they went with 30, 40 years ago and there has been so much business process built around its architecture, around its constraints, around its outputs, that unwinding that hairball is impossible.

Nick: It is a bit impossible and also, is it bad? Those systems are pretty reliable. The only, the downside is just the cost of whatever IBM is going to charge you to have support.

Corey: So we're going to re architect and then migrate it to the cloud. Yeah, because that'll be less expensive. Good call. It's a, it's always a trade off.

Economics are one of those weird things where people like to think in terms of cash dollars they pay vendors as the end all be all. But they forget the most expensive thing that every company has to deal with is its personnel, is its personnel. The, the payroll costs dwarf cloud infrastructure costs unless you're doing something truly absurd at very small scale of a company.

Like there's, like I've never heard of a big company that spends more on cloud than it does on people.

Nick: Oh, that's an interesting data point. I figured we'd need at least a handful of them, but interesting. I

Corey: mean, you see it in some very small companies where like, all right, we're a, we're a two person startup and we're not taking market rate salaries and we're doing a bunch of stuff with AI and okay, yeah, I can see driving that cost into the stratosphere, but you don't see it at significant scale.

In fact, for most companies that are legacy, which is the condescending engineering term for it makes money, uh, which means in turn that it was founded before five years ago, A lot of companies that are, that are, their number two expense is real estate, more so than it is infrastructure costs. Sometimes, yeah, you can talk about data centers being as part of that, but office buildings are very expensive.

Then there's a question of, okay, cloud is usually number three. But there are exceptions for that. Because they're public, we can talk about this one. Netflix has said for a long time that their biggest driver, even beyond people, has been content. Licensing all of the content and producing all of that content is not small money.

So there are going to be individual weird companies doing strange things. But it's fun. I mean, you also get to this idea as well, that, oh, no one can ever run on prem anymore. It's worth it. Well, not for nothing, technically, Google is on prem. Yeah, so is Amazon is on prem. It's, they're not just these magic companies are the only ones that can, that have remembered how to be able to replace hardware and walk around between aisles, between racks of servers.

It's a, there's, It's just, is it economical? Is it, when does it make sense to start looking at these things? And even strategically, tying yourself indelibly to a particular vendor, because people remember the mainframe mistake with IBM. Even if I don't move this off of Google or off of Amazon today, I don't want it to be impossible to do so in the future.

Kubernetes does present itself as a reasonable hedge.

Nick: It, yeah, it neutralizes that, that lock in with vendor, if you are to run it, you know, run your own data centers or whatever, but then you're locked into a lot of the times you end up getting locked into specific hardware, which is not that different than cloud because they're, I do work with a handful of customers who are sensitive to even like very specific versions of, of, of chips, right?

They need, they need version N because it gives them 10 percent more performance and at the scale they're running, that's something that's very important to them.

Corey: Yeah. Uh, it's. One last question before we wind up calling this an episode that I'm curious to get your take on given that you are a product, you work over in product, where do you view the dividing line between Kubernetes and GKE?

Nick: So this is actually a struggle that I have because I am historically much more open source oriented and about the community itself. I think it's our job to bring the, you know, to bring the community up, to bring Kubernetes up. But of course it's a business, right? So the dividing line for us, that I like to think about is the The cloud provider code, the ways that we can make it work better on Google Cloud without really making the API Weird.

Right. I don't wanna, we don't wanna run some version of the API that you can't run anywhere.

Corey: Yeah. Otherwise you just draw. I'll roll out Borg and call it a day.

Nick: Yeah. But when you use a load balancer, we want it to be like fast. Smooth, seamless, and easy. When you use, uh, persistent storage, like, we have persistent storage that automatically replicates the disk across all three zones so that when one thing fails, you go to the other one and it's nice and fast, right?

So like, these are the little things that we try to do to make it better. Another example is Like that we're working on as fleets, that's specifically the product that I work on, GKE fleets, and we're working upstream with cluster inventory to ensure that there is a good way for customers to take a dependency on our fleets without getting locked in, right?

So we adhere to this open source standard. Third party tool providers can build fun implementations that then work with fleets. And if other cloud providers. decide to take that same dependency on cluster inventory, then we've just created another good abstraction for the ecosystem to grow without forcing customers to lock into specific cloud providers to get valuable services.

Corey: It's always a tricky balancing act because at some level, being able to integrate with the underlying ecosystem it lives within makes for a better product, a better customer experience, but then you get accused of trying to drive lock in.

Nick: The whole, I think the main, if you talk to my, my skit, uh, Drew Bradstock runs Cloud Container Runtimes for all of Google Cloud.

I think he would say, and I agree with him here, that we're trying to get you to come over to use GKE because it's a great product, not because we want to lock you in. So we're doing all the things to make it easier for you because you listed out a whole lot of complexity. We're really trying to remove that complexity so at least when you're building your platform on top of Kubernetes there's maybe, I don't know, 30 percent less things you have to do when you do it on Google Cloud.

Another one. We're on track.

Corey: Yeah, it would be nice. But I really want to thank you for taking the time to speak with me. If people want to learn more, where's the best place for them to find you?

Nick: Um, if people want to learn more, I'm active on Twitter. So, hopefully you can just add my handle in the show notes.

Um, and also just if you're already talking to Google, then feel free to drop my name and bring me into any call you want as a customer. I'm happy to jump on and help, uh, Work through things. I have this crazy habit where I can't, I can't get rid of old habits. So I don't just come on the calls as a PM and help you.

I actually put on my, my architect consultant hat and like, I can't turn that part off.

Corey: I don't understand how people can engage in the world of cloud without that skill set and background. Personally, it's so core and fundamental to how I view everything. I mean, I'm sure there are other paths. I just have a hard time seeing it.

Nick: Yeah. Yeah. Yeah. It's, it's, it's, it's a lot less about, let me pitch this thing to you. And much more about, okay, well, how does this fit into the larger ecosystem of things you're doing, the problems you're solving? Because, I mean, we didn't get into it on this call, and I know it's about to end, so we shouldn't.

But Kubernetes is just a runtime. There's like a thousand other things that you have to figure out with an application sometimes, right? Like storage, bucket storage, databases. I am. I am.

Corey: Yeah, that is a whole separate kettle of nonsense. You won't like what I did locally, but that's beside the point.

Nick: Uh, but are you allowing all anonymous?

Corey: Exactly. Uh, the trick is, is just you, if you, if you harden the perimeter hard, uh, well enough, then nothing is going to ever get in, so you don't have to worry about it. Let's also be clear, this is running a bunch of very small scale stuff. It does use a real certificate authority, but still.

Nick: I have the most secure Kubernetes cluster of all time running in my house back there.

Yeah, it's, it's turned off.

Corey: Even then, I still feel better if we're, uh, sunken to concrete and dropped into a river somewhere, but you know, we'll get there. Thank you so much for taking the time to speak with me. I appreciate it.

Nick: No, I, I really appreciate your time. This has been fun. Um, you're a legend, so keep going.

Corey: Oh, I'm something alright. I think I'd have to be dead to be a legend. Uh, Nick Eberts, Product Manager at Google. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a 5 star review on your podcast platform of choice, or on the YouTubes. Whereas if you hated this podcast, please continue to leave a 5 star review on your podcast platform of choice, along with an angry, insulting comment saying that that is absolutely not how Kubernetes and hardware should work, but remember to disclose which large hardware vendor you work for in that response.

Nick: Maybe that's where kubernetes has a strength because you get a lot of it for free. It's complicated, but if you figure it out and then create the right abstractions, you can end up being a lot more efficient than trying to manage, you know, a hundred different implementations.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined today by Someone rather exciting. You don't often get to talk to, you know, paid assassins, but Nick Eberts is a product manager over at Google, so I can only assume that you kill things for a living.

Nick: Not if we can help it, right? So if you're listening to this, and you're using anything that I make, which is along the lines of GKE, Um, fleets, multi cluster stuff.

Please use it. Otherwise, you're going to make me into a killer.

Corey: This episode is sponsored in part by my day job, the Duck Bill Group. Do you have a horrifying AWS bill? That can mean a lot of things. Predicting what it's going to be. Determining what it should be. Negotiating your next long term contract with AWS, or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is.

To learn more, visit duckbillgroup. com. Remember, you can't duck the duck bill, Bill. And my CEO informs me that is absolutely not our slogan. Exactly. If our customers don't use this, we're going to have to turn it off. That's an amazing shakedown approach. Although, let's be honest, every company implicitly does have that.

Like, if we don't make enough money, we're going to go out of business, is sort of the general trend. At least for the small scale companies, and then at some point it's, Ah, we're going to indulge our own corporate ADHD and just lose interest in this thing that we've built and shipped. We'd rather focus on the new things, not the old things.

That's boring. But Kubernetes is not boring. I will say that. One of the things that led to this is, A few weeks before this recording, I wound up giving a talk at the Southern California Area Linux Expo called Terrible Ideas in Kubernetes. Because five years ago, I ran my mouth on Twitter, imagine that, and predicted that no one would care about Kubernetes five years from now.

It would drift below the surface level of awareness that most people had to think about. Either I think I'm directionally correct, but I got the timing wrong. I'll blame COVID for it. Why not? And as penance, I installed a Kubernetes of my very own in my spare room on a series of 10 raspberries Pi and ran a bunch of local workloads on it for basically fun.

And I learned so many things. Uh, I want to say about myself, but no, not really. Mostly about how the world thinks about these things and, and how, what Kubernetes is once you get past conference stage talking points and actually run it yourself. I get the sense you probably know more about this stuff than I do.

I would seriously hope anyway. GKE is one of those things where people have said for a long time, the people I trust, most people call them customers, of that they have been running Kubernetes in different places and GKE was the most natural expression of it. It didn't feel like you were effectively fighting upstream trying to work with it.

And I want to preface this by saying so far all of my Kubernetes explorations personally have been in my on prime environment because given the way that all of the clouds charge for data transfer, I can't necessarily afford it. Ford to run this thing in a cloud environment, which is sad, but true.

Nick: On that note, specifically, I think maybe you've noted this at other times, um, Google Cloud stopped charging for egress.

Corey: You stopped charging for data egress when customers agree to stop using Google Cloud. All three of the big clouds have done this. And I think it's, it's genius from the perspective of, it's a terrific sales tool. If you don't like it, we won't charge you to get your data back. But what hurts people is not, I want to move the data out permanently.

It's the ongoing. The cost of doing business. Perfect example. I have a 10 node Kubernetes cluster that really isn't doing all that much. It's spitting out over a hundred gigabytes of telemetry every month, which gets fairly sizable. It would be the single largest expense of running this in a cloud expense other than the actual raw compute.

And, uh, It's doing nothing, but it's talking an awful lot, and we've all had co workers just like that. It's usually not a great experience. So, it's the ongoing ebb and flow, and why is it sending all that data? What is in that data? It gets very tricky to understand and articulate that.

Nick: So like, no, the data transfer is interesting.

I mean, I'd, I'd want to ask you what metrics or, or or what signals are you sending out to, uh, cross a point in which you would get built? 'cause that's interesting to me. I mean, do you not like the in cloud logging and operations monitoring stuff? Because when we ship metrics there, we're not billing for it.

Um, now we are billing you for the, the storage.

Corey: Sure, and to be fair, storage of metrics has never been something I found prohibitive on any provider. This is, again, this is running in my spare room. It is just spitting things out like, why do you use the in cloud provided stuff? It's like, well, it's not really a cloud in the traditional sense.

And we will come back to that topic in a minute. But, um, But I want to get things out somewhere. In fact, I'm doing this multiple times, which makes this fun. I use Axiom for logs, because that's how I tend to think about this. And I've also instrumented it with Honeycomb. Axiom is what told me it were about 250 gigabytes and climbing the last time I looked at it.

And it's at least duplicating that, presumably, for what gets sent off to Honeycomb as well. I also run Prometheus and Grafana locally, because I want to have all the cool kids do. And And frankly, having a workload that runs Kubernetes means that I can start actively kicking the tires on other products that really are, it's contrived, you try and shove it into just this thing locally on your laptop or something that, like, I've had some of my actual standing applications are for pure serverless build on top of Lambda functions, that gets really weird for some visions of what observability should be.

So I have an actual Kubernetes that now I can throw things at and see what explodes.

Nick: Now, that makes sense. I mean, like, listen, I love Honeycomb, and there's a lot of good third party, third party tools out there and providers. And one of the things that we do at Google, probably done across the board, is, is work with them to provide an endpoint location or, or a data store or an existence of their service that's local within the cloud, right?

So, If you're using Honeycomb and that Honeycomb instance that is your SaaS provider actually is an endpoint that's reachable inside of Google Cloud without going out from the network, then you can reduce the cost. So we try to work with them to do things. One example technology we have is Private Service Connect, which allows you, uh, third party companies to sort of host their endpoint in your VPC with an IP that's inside of your VPC, right?

So then, then you're not, you're, you're, you're egress charges. Or from a node running in a cluster to a private IP not going out through the internet. So we're trying to help because our customers do prefer not to pay large amounts of money to use. Essentially, it's a service that most of these services are running on Google Cloud.

Corey: I do confess to having a much deeper understanding of the AWS billing architectures and challenges among it. But one of the big challenges I found Let's get into this. This will lead naturally into this from the overall point that a few of us here at the Duck Bill Group have made on Twitter, which is how you and I started talking.

Specifically, we have made the assertion that Kubernetes is not cloud native, which sounds an awful lot like clickbait, but it is a sincerely held belief. It's, it's not one of those, somebody needs to pay attention to me. No, no, no. There are, I have better stunts for that. This is, this is based upon a growing conviction that I've seen from The way that large companies are using Kubernetes on top of a cloud provider and how Kubernetes itself works.

It sounds weird to have said that, to say that I have built this on a series of raspberries pie in my spare room. That's not really what it's intended for or designed to do, but I would disagree because What a lot of folks are using is treating Kubernetes as a multi cloud API, which I think is not the worst way to think of it.

If you have a bunch of servers sitting in a rack somewhere, what, how are you going to run workloads on it? How are you going to divorce workloads from the underlying hardware platform? How do you start migrating it to handle hardware failures, for example? Kubernetes seems to be a decent answer on this.

It's almost a cloud in and of itself. It's similar to a data center operating system. It's, it's realizing the vision that OpenStack sort of defined but could never realize.

Nick: No, that's 100 percent it. And you're not going to get an argument from me there. Kubernetes, running your applications in Kubernetes do not make them cloud native.

One of the problems with this argument in general is that Who can agree on what cloud native actually means?

Corey: It means I have something to sell you, in my experience.

Nick: Right. In my interpretation, it sort of adheres to like the value prop of what the cloud was when it came out. Flexible, you just pay for what you want when you need it, scale out on demand, these kinds of things.

So applications definitely are not for sale. immediately cloud native when you put them in Kubernetes. You have to do some work to make them autoscale. You have to do some work to make them stateless, maybe 12 factor, if you will, if you want to go back like a decade. Yeah, you can't take a Windows app, run it on Kubernetes clusters that have Windows Node support, that's a monolith, and then call it cloud native.

Also, not all applications need to be cloud native. That is not the metric that we should be measuring ourselves by. So, it's fine. Kubernetes is the lowest common denominator, or it's becoming the lowest common denominator of compute. That's the point. If you have to build a platform, or you're a business that has several businesses within it, and you have to support a portfolio of applications, it's more likely that you'll be able to run a high percentage of them, a high percentage of them on Kubernetes.

than you would on some fancy PaaS. Like that's, that's been the death of all PaaS. It's like, Ooh, this is really cool. I have to rewrite all my applications in order to fit into this paradigm.

Corey: I built this thing and it's awesome for my use case. And it's awesome right until it gets a second user at which point the whole premise falls completely to custard.

It's a custard. It's awful. There's a, it's a common failure pattern where anyone can solve something to solve for their own use cases, but how do you make it extensible? How do you make it more universally applicable? And the way that Kubernetes has done this has been to effectively, you're building your own cloud when you're using Kubernetes to no small degree.

One of the cracks I made in my talk, for example, was that Google has a somewhat condescending and difficult engineering interview process. So if you can't pass through it, the consolation prize is you get to cosplay as working at Google by running Kubernetes yourself. And the problem when you start putting these things on top of other cloud provider abstractions, is you have a cloud within a cloud, and to the, to the cloud provider, What you've built looks an awful lot like a single tenant app with very weird behavioral characteristics that for all intents and purposes remain non deterministic.

So as a result, you're staring at this thing that the cloud provider says, well, you have an app and it's doing some things and the level of native understanding of what your workload looks like from the position of that, uh, of that cloud provider become obfuscated through that level of indirection. It effectively winds up creating a host of problems while solving for others.

As with everything, it's built on trade offs.

Nick: Yeah, I mean, not everybody needs a Kubernetes, right? If there's a certain complexity that you have to have of the applications that you need to support, that then it's beneficial, right? It's not just immediately beneficial. A lot of the customers that I work with actually too much, I don't want to say dismayed, but a little bit like they're doing the hybrid cloud thing.

I'm running this application across multiple clouds and Kubernetes helps them there because while it's not identical on every single cloud, it does take like 80, maybe 85, 90 percent of the configuration and, and the application itself can be treated the same across these three different clouds. There's, you know, 10 percent that's different per cloud provider, but it does help in that degree.

Like we have customers that can hold us accountable. They can say, you know what? This other cloud provider is doing something better or giving it to us cheaper. And we have a dependency on open source Kubernetes and we built all our own tooling. We can move it. And it works for them.

Corey: That's one of those things that has some significant value for folks. I'm not saying that Kubernetes is not adding value. And again, nothing is ever an all or nothing approach. But an easy example where I tend to find a number of my customers struggling. Most people will build a cluster to span multiple availability zones. Over in AWS land, because that is what you are always told.

Oh, well, yeah, we, we can strain blast radiuses. So of radii. So of course, we're going to be able to sustain the loss of an availability zone. So you want to be able to have it flow between those. Great. The problem is, is it costs two cents per gigabyte to have data transfer between availability zones, which means that in many cases, Kubernetes itself is not in any way zone aware.

It has no sense of. of pricing for that. So it'll just as cheerfully toss something over a two gigabyte link as opposed to the thing, two gigabit link, as opposed to the thing right next to it for free. And it winds up in many cases bloating those costs. It's one of those areas where if the, if the system understood its environment, the environment understood its system a little bit better, this would not happen.

But it does.

Nick: So I have worked on Amazon. I didn't work for them. I've worked, I've used EC2 for Two, three years. That was my first foray into cloud. I then worked for Microsoft. So I worked on Azure for five years and I now I've been on Google for a while. So I will say this, I, I, my information with Amazon's a little bit dated, but I can tell you from a Google perspective, like that specific problem you call out, there's at least there's upstream Kubernetes configurations that can allow you to have affinity with transactions.

It's complicated though. It's not easy. We also, so one of the things that I'm responsible for is building this idea of fleets. This idea of fleets is that you have n number of clusters that you sort of manage together. And not all of those clusters need to be homogenous, but pockets of them are homogenous, right?

And so one of the patterns that I'm seeing our bigger customers do is create a cluster per zone. And they stitch them together with the fleet, use namespace sameness, treat everything the same across them, slap a load balancer on front, but then silo the transactions in each zone so they can just have an easy and efficient and sure way to ensure that, you know, interzonal costs are not popping up.

Corey: In many cases, the right approach. I learned this from some former Google engineers back in the noughts, which back when being a Google engineer was a sort of thing where the hush came over the room and everyone leaned in to see what this genius wizard would be able to talk about. It was a different era on some level.

And one of the things I learned was that In almost every scenario, when you start trying to build something for high availability, this is before cross AZ data transfer was even on anyone's radar, uh, but for availability alone, you have a, you know, phantom router that was there to take over in case the primary fails.

The number one cause of outages, and it wasn't particularly close, was by a failure in the heartbeat protocol or the control handover. So rather than trying to build, uh, data center pods that were highly resilient, The approach instead was, all right, load balance between a bunch of them and constrain transactions within them, but make sure you can failover reasonably quickly and effectively and automatically.

Because then you can just write off a data center in the middle of the night when it fails, fix it in the morning, and the site continues to remain up. That is, that is a fantastic approach. Uh, again, having built this in my spare room, at the moment I just have the one. I feel like after this conversation it may split into two, just on, Just on sheer sense of this is what smart people tend to do at scale.

Nick: Yeah, it's funny. So, uh, when I first joined Google, I was super interested in going through their like SRE program. And so one thing that's great about this company that I work for now is They give you the time and the opportunities. So I, I wanted to go through what SREs go through when they get, when they come on board and train.

Corey: So I went through the training interview process. And I believe that process is called hazing, but continue.

Nick: Yeah. But the funniest thing is, so you go through this and you're actually playing with tools and affecting real org cells and using all of the Google terms to do things, um, obviously not in production.

And then you have like these tests. And most of the time, the answer to the test was, Hey, drain the cell, just turn it off. And then turn another

Corey: one on. It's the right approach in many cases. I love, that's what I love about the container world, is that it becomes ephemeral. That's why observability is important, because you better be able to get the telemetry for something that stopped existing 20 minutes ago to diagnose what happened.

But once you can do that, it, it really does free up a lot of things, mostly. But even that, I ran into significant challenges with. I come from the world of being a grumpy old sysadmin. And I've worked with datacenter remote hands employees that were, yeah, let's just say that was a heck of a fun few years.

So the first thing I did once I got this up and running, got a workload on it, is I yanked the power cord out of the back of one of the Node members that was running a workload like I was rip starting a lawnmower enthusiastically at 2 in the morning, like someone might have done to a core switch once.

But, yeah, it was, okay, so I'm waiting for the pod to, the cluster to detect the pod is no longer there and reschedule it somewhere else, and it didn't for two and a half days. It's like, ah, because I was under the, and again, there are ways to configure this, and you have to make sure the workload is aware of this.

But again, to my naive understanding, part of the reason that people go for Kubernetes the way that they do, is that it abstracts the application away from the hardware, and you don't have to worry about individual node failures. Well, apparently I have more work to do.

Nick: These are things that are tunable and configurable.

Uh, one of the things that we strive for on GKE is to make a lot of these best practices, like this would be a best practice, like recovering the node, reducing the amount of time it takes for the disconnection to actually release whatever lease that's holding that pod on that particular node is, We do all this stuff in GKE and we don't even tell you we're doing it because we just know that this is the way that you do things and I, and I hope that other providers are doing something similar just to make it easier.

Corey: They are. This is, again, I've only done this on a bare metal sense. I intend to do it on most of the major cloud providers at some point over the next year or two. Here at the Duck Bill Group, one of the things we do with, you know, my day job, is we help negotiate AWS contracts. We just recently crossed 5 billion of contract value negotiated.

It solves for fun problems, such as, how do you know that your contract that you have with AWS is the best deal you can get? How do you know you're not leaving money on the table? How do you know that you're not doing what I do on this podcast and on Twitter constantly and sticking your foot in your mouth?

To learn more, come chat at duckbillgroup. com. Optionally, I will also do podcast voice when we talk about it. Again, that's duckbillgroup. com. The most common problem I had was all related to the underlying storage subsystem. Uh, Longhorn is what I use.

Nick: I was going to say, can I give you a fun test when you're doing this on all the other cloud providers?

Don't use Rancher or Longhorn. Use their persistent disk option.

Corey: Oh, absolutely. I'm using it. The reason I'm using Longhorn to be very clear on this is that I don't trust individual nodes very well. And yeah, EBS or any of the providers have a block storage option that is far superior to what I'll be able to achieve with local hardware because I don't happen to have a few spare billion dollars in engineering lying around in order to extract a really performant, really durable block store.

Nick: And I don't, that's not on my list.

performant, uh, durable, um, block store that's presented as disk store, right? They all do. But the real test is, when you, when you rip out that node or effectively unplug that, that network interface, how long does it take for their storage system to release the claim on that disk and allow it to be attached somewhere else?

That's the test.

Corey: Exactly. And that is a good, that is a great question. And the, the real, the honest, there are ways, of course, to tune all of these things across every provider. I did no tuning, which means the time was effectively infinite, as best I could tell. It wasn't just for this. I had a number of challenges with the storage provider over the, over the course of a couple months.

And it's challenging. I mean, there are other options out there that might have worked better. I switched all the nodes that have a backing store over to using relatively fast SSDs, because Having it on SD card seemed like it might have been a bottleneck around there. And there were still challenges on things and in ways I did not inherently expect.

Nick: That makes sense. So can I ask you a question?

Corey: Please.

Nick: If Kubernetes is too complicated, let's just, let's just say, okay, it is complicated. It's not, it's not good for everything, but passes to most passes are a little bit too constrictive, right? Like they, their opinions are too strong. Most of the time I have to use a very explicit programming model to take advantage of them.

That leaves us with VMs in the cloud really, right?

Yes and no. Uh, what I'm I for one workload right now that I'm running in AWS I've had great luck with ECS, which is of course, uh, despite their word about ECS anywhere, it is a single cloud option. Let's be clear on this. It is, you are effectively, uh, agreeing to a lock in on some form, but it is, it does have some elegance because of how it was designed in ways that resonate with the underlying infrastructure in which it operates.

Yeah, no, that makes sense. I guess what I was trying to get at though is if you had, if ECS wasn't an option and you had to build these things, I feel like my experience working with customers, because before I was a PM, I was very much field consultant, customer engineer, solution architect, all those words.

Customers just ended up rebuilding Kubernetes. Like, they built something that autoscaled, they built something that had service discovery, they built something that repaired itself. Like, they ended up creating a good bit of the API, is what I found. Now, ECS is interesting, It's interesting. It's, it's, um, it's a little bit hairy when you actually, if you were going to try to implement something that's got smaller services that talk to each other a lot.

If you just have one service and you're auto scaling it behind a load balancer.

Corey: Yeah, they, they, they talked about S3 on stage at one point with something like 300 and some odd microservices that all comprised to make the thing work, which is crazy. Phenomenal. I'm sure it's the right decision for their, for their workloads and whatnot.

I, I felt like I had to jump on that as soon as it was said, just a warning. This is what works at a global hyperscale centuries long thing that has to live forever. Your blog does not need to do this. This is not a to do list. So, yeah, back when I was doing this stuff in Anger, which is of course my name for production, as opposed to staging environment, which is always called theory because it works in theory but not in production.

Exactly. Back when I was running things in Anger, it was always, it was before containers had gotten big, so it was always, uh, take AMIs, uh, to a certain point and then do configuration management and, uh, code deploys. in order to get them to current. And yeah, then we bolt on all the things that Kubernetes does offer that any system has to offer.

Kubernetes didn't come up with these concepts. The idea of durability, of autoscaling, of load balancing, of service discovery. Those things inherently become a problem that needs to be solved for. Kubernetes has solved some of them in very interesting, very elegant ways. Others of them it has solved for by, Oh, you want an answer for that?

Here's 50. Pick your favorite. And I think we're still seeing best practices continue to emerge.

Nick: No, we are. And, and I did the same thing. Like my, my first role, um, that I was using cloud, we were rebuilding an actuarial on, on EC2. And the value prop obviously, obviously for our customers was like, Hey, you don't need to rent a thousand cores for the whole year from us.

You could just use them for the two weeks that you need them. Awesome. Right. That was my, my first foray into infrastructure code. I was using Python and the Botto SDK and just, Automating the crap out of everything, and it worked great. But I imagine that if I had stayed on at that role, repeating that process for N number of applications would start to become a burden.

So you'd have to build some sort of template, some engine, you'd end up with an API. Once it gets beyond a handful of applications, I think, maybe that's where Kubernetes has a strength. Because you get a lot of it for free, it's complicated, but if you figure it out and then create the right abstractions, For the different user types you have, you can end up being a lot more efficient than trying to manage, you know, a hundred different implementations.

Corey: We see the same thing right now. Whenever someone starts their own, a new open source project, or even starts something new within a company, great. You still, the problem I've always found is building the CI, CD process. How do I hook it up to GitHub actions or whatever it is to fire off a thing? And until you build sort of what looks like an internal company platform, you're starting as effectively at square one each and every time.

I think that building an internal company platform at anything beyond giant scale is probably ridiculous, but it is something that people are increasingly excited about. So it could very well be that I'm the one who's wrong on this. I just know that every time I build something new, I, there's a significant boundary between me being able to YOLO slam this thing into place and having, uh, having merges into the main branch wind up getting automatically released on through a, through a process that has some responsibility to it.

Nick: Yeah. I mean, there's, there's no perfect answer for everybody, but I do think, I mean, you'll get to a certain point where the complexity warrants a system like Kubernetes, but also the, the CICD. angle of, of Kubernetes. It's not unique to Kubernetes either. I mean, you're just talking about pipelines. We've been using pipelines forever.

Corey: Oh, absolutely.

And even that doesn't give it to you out of the box. You still have to play around with getting Argo or whatever it is you choose to use setup or.

Nick: Um, yeah, it's funny, actually, uh, weird tangent. Um, I have this weird offense when people use the term GitOps, like it's new, like, So, first of all, we've all been, well, as an aged man who's been in this industry for a while, like, we've been doing GitOps for quite some time.

Now, if you're specifically talking about a pull model, fine, that may be unique, but GitOps is simply just, hey, I'm making a change in source control. And then that change is getting reflected in an environment. That's, that's how I consider it. What do you think?

Corey: Well, yeah, we store all of our configuration in Git now.

It's called GitOps. What were you doing before? Oh, yeah. Go, go retrieve the other, the previous copy of what the configuration looked like. It's called copyof, copyof, copyof, thing. back. cjq. usethisone. doc. zip.

Nick: Yeah, it's great. That's even, that's, yeah, that's even going further back. Yeah, let me please, let me please make a change to something that's in a file store somewhere and copy that down to X amount of, uh, VMs or even hard, you know, uh, hardware machines just running across my data center and, and hopefully, uh, that configuration change doesn't take something down.

Corey: Yeah. The idea of blast radius starts to become very interesting, and canary deployments, and you know, all the things that you basically rediscover from first principles every time you start building something like this. It feels like Kubernetes gives a bunch of tools that are effective for building a lot of those things, but you still need to make a lot of those choices and implementation decisions yourself.

And it feels like whatever you choose is not necessarily going to be what anyone else has chosen. It seems like it's easy to wind up in unicorn territory fairly quickly.

Nick: But I just, I don't know, I think as we're thinking about what the alternative for a Kubernetes is or what the alternative for a PaaS is, no one, I don't really see anyone building a platform to run old shitty apps.

Who's going to run that platform? Because that's, that's the, what, 80 percent of the market of workloads that are out there that need to be improved. So we're either waiting for these companies to rewrite them all, or we're going to make their life better somehow.

Corey: That's what makes containers so great in so many ways.

It's not the best approach, obviously, but it works. So you can just take up something that is 20 years old, written in some ancient version of something, Shove it into a container as a monolith, sure it's an ugly big container, but then you can at least start moving that from place to place and unwinding your dependency rat's nest.

Nick: That's how I think about it, only because, like I said, I've spent 10, 12 years working with a lot of customers trying to unwind these things. old applications. And a lot of the times they lose interest in doing it pretty quickly because they're making money and there's not a whole lot of incentive for them to to break them up and do anything with them.

In fact, I often theorize that whatever their business is, the real catalyst for change is when another startup or another smaller company comes up and does it more cloud natively and beats their pants off in the market, which then forces them to have to adjust. But that kind of stuff doesn't happen in like payment transaction companies like no one like you have there's a heavy price to pay to even be in that business and so how like what's the incentive for them to change?

Corey: I think that there's also a desire on the part of technologists many times and I'm as guilty as anyone of this to Walk in and say, this thing's an ancient piece of crap, what's it doing? And the answer is like, oh, about 4 billion in revenue, so maybe mind your matters. And, oh, yeah, okay. Is this, is this engineeringly optimal?

No, but it, it's kind of load bearing, so we, we need to work with it. People are not still using mainframes because they believe that in 2024 they're going to greenfield something and that's the best they'd be able to come up with. It's because that's what they went with 30, 40 years ago and there has been so much business process built around its architecture, around its constraints, around its outputs, that unwinding that hairball is impossible.

Nick: It is a bit impossible and also Is it bad? Those systems are pretty reliable. The only, the, the downside is just the cost of, uh, whatever IBM is going to charge you to have support.

Corey: So we're going to, we're going to re architect then migrate it to the cloud. Yeah, because that's, that'll be less expensive.

Good call. It's a, it's always a trade off. And economics are, economics are, are one of those weird things where people like to think in terms of cash dollars they pay vendors as the end all be all, but they forget the most expensive thing that every company has to deal with is its personnel, is its personnel.

The, the payroll costs dwarf cloud infrastructure costs unless you're doing something truly absurd at very small scale of a company. Like there's, like I've never heard of a big company that spends more on cloud than it does on people.

Nick: Oh, that's interesting data point. I figured we'd need at least a handful of them, but interesting.

Corey: I mean, you'll see it in some very small companies where like, all right, we're a, we're a two person startup and we're not taking market rate salaries and we're doing a bunch of stuff with AI and okay, yeah, I can see driving that cost into the stratosphere, but you don't see it at significant scale. In fact, for most companies that are legacy, which is the condescending engineering term for it makes money, uh, which means in was founded before five years ago.

A lot of companies that are, that are, their number two expense is real estate, more so than it is infrastructure costs. Sometimes, yeah, you can talk about data centers being as part of that, but office buildings are very expensive. Then there's a question of, okay, cloud is usually number three. But there are exceptions for that.

Because they're public, we can talk about this one. Netflix has said for a long time that their biggest driver, even beyond people, has been content. Licensing all of the content and producing all of that content is not small money. So there are going to be individual weird companies doing strange things.

But it's, it's fun. I mean, you also get to this idea as well that, oh, no one, no one can ever run on prem anymore. Well, not for nothing, technically Google is on prem. Yeah, so is Amazon is on prem. It's, they're not just these magic companies are the only ones that can, that have remembered how to be able to replace hardware and walk around between aisles, between racks of servers.

It's a, there's, it just, is it economical? Is it, when does it make sense to start looking at these things? And even strategically, tying yourself indelibly to a particular vendor, because people remember the mainframe mistake with IBM, even if I don't move this off of Google or off of Amazon today, I don't want it to be impossible to do so in the future, Kubernetes does present itself as a reasonable hedge.

Nick: It, yeah, it neutralizes that, that lock in with vendor if you are to run it. You know, run your own data centers or whatever, but then you're locked into a lot of the times you end up getting locked into specific hardware, which is not that different than cloud because they're, I do work with a handful of customers who are sensitive to even like very specific versions of, of, of chips, right?

They need, they need version N because it gives them 10 percent more performance. And at the scale they're running, that's something that's very important to them. Yeah.

Corey: It's, One last question before we wind up calling this an episode that I'm curious to get your take on given that you are a product, you work over in product, where do you view the dividing line between Kubernetes and GKE?

Nick: So this is actually a struggle that I have because I am historically much more open source oriented and about the community itself. I think it's our job to bring the, you know, to bring the community up, to bring Kubernetes up. But of course it's a business, right? So the dividing line for us, that I like to think about is the The cloud provider code, the ways that we can make it work better on Google Cloud without really making the API weird, right?

I don't want to, we don't want to run some version of the API that you can't run anywhere.

Corey: Yeah, otherwise you just roll out Borg and call it a day.

Nick: Yeah, but when you use a load balancer, we want it to be like fast. Smooth, seamless, and easy. When you use, uh, persistent storage, like we have persistent storage that automatically replicates the disk across all three zones so that when one thing fails, you go to the other one.

It's nice and fast, right? So like these are the little things that we try to do to make it better. Another example is like that we're working on is Fleets. That's specifically the product that I work on, GKE Fleets, and we're working upstream with cluster inventory to ensure that there is a good way for customers to take a dependency on our Fleets without getting locked in.

Right. So we adhere to this open source standard, third party tool providers can build on implementations that then work with fleets. And if other cloud providers decide to take that same dependency on cluster inventory, then we've just created another good abstraction for the ecosystem to grow without forcing customers to lock into specific cloud providers to get valuable services.

Corey: It's always a tricky balancing act because at some level being able to integrate with the underlying ecosystem, it lives within, Makes for a better product and a better customer experience, but then you get accused of trying to drive lock in.

Nick: The whole, I think the main, if you talk to my, my skit, uh, Drew Bradstock runs cloud, uh, container runtimes for all of Google Cloud.

I think he would say, and I, I agree with him here, that we, we're trying to get you to come over to, to, to use GKE because it's a great product, not because we want to lock you in. So we're, we're doing all the things to make it easier for you, like, because you, you listed out a whole, lot of complexity.

We're really trying to remove that complexity. So at least when you're building your platform on top of Kubernetes, there's maybe, I don't know, 30 percent less things you have to do when you do it on Google Cloud and other, or on prem.

Corey: Yeah, it would be nice. I really want to thank you for taking the time to speak with me.

If people want to learn more, where's the best place for them to find

Nick: you? Um, if people want to learn more, I'm active on Twitter. So, hopefully you can just add my handle in the show notes. Um, and also just if you're already talking to Google, then feel free to drop my name and bring me into any call you want as a customer.

I'm happy to jump on and help, uh, Work through things. I have this crazy habit where I can't, I can't get rid of old habits. So I don't just come on the calls as a PM and help you. I actually put on my, my architect consultant hat and like, I can't turn that part off.

Corey: I don't understand how people can engage in the world of cloud without that skill set and background.

Personally, it's so core and fundamental to how I view everything. I mean, I'm sure there are other paths. I just have a hard

Nick: time seeing it. Yeah, yeah, yeah. It's, it's, it's a lot less about, let me pitch this thing to you and much more about, okay, well, how does this fit into the larger ecosystem of things you're doing, the problems you're solving?

Because, I mean, we didn't get into it on this call and I know it's about to end, so we shouldn't, but Kubernetes is just a runtime. There's like a thousand other things that you have to figure out with an application sometimes, right? Like storage, bucket storage. Databases, IAM, IAM.

Corey: Yeah, that is a whole separate kettle of nonsense.

You won't like what I did locally, but that's beside the point.

Nick: Uh, but are you allowing all anonymous?

Corey: Exactly. Uh, the trick is, is just you, if you, if you harden the perimeter hard, well enough, then nothing is going to ever get in, so you don't have to worry about it. Let's also be clear, this is running a bunch of very small scale stuff.

It does use a real certificate authority, but still.

Nick: I have the most secure Kubernetes cluster of all time running in my house back there. Yeah, it's true. It's turned off.

Corey: Even then, I still feel better if we're, uh, sunken to concrete and dropped into a river somewhere, but you know, we'll get there. Thank you so much for taking the time to speak with me.

I appreciate it.

Nick: No, I, I really appreciate your time. This has been fun. Um, you're a legend, so keep going.

Corey: Oh, I'm something alright. I think I'd have to be dead to be a legend. Uh, Nick Eberts, Product Manager at Google. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five star review on your podcast platform of choice or on the YouTubes.

Whereas if you hated this podcast, please continue to leave a five star review on your podcast platform of choice, along with an angry, insulting comment saying that that is absolutely not how Kubernetes and hardware should work, but remember to disclose which large hardware vendor you work for in that response.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.