Observing The Hidden Complexity Behind Simple Cloud Networks with Avi Freedman

Episode Summary

Avi Freedman, CEO at Kentik, joins Corey on Screaming in the Cloud to discuss the fun of solving for observability. Corey and Avi discuss how great simplicity can be deceiving, and Avi points out that with great simplicity comes great complexity. Avi discusses examples of this that he sees in Kentik customer environments, as well as the differences he sees in cloud environments from traditional data center environments. Avi also reveals his predictions for the future and how enterprise M&A will affect the way companies view data centers and VPCs.

Episode Show Notes & Transcript

About Avi

Avi Freedman is the co-founder and CEO of network observability company Kentik. He has decades of experience as a networking technologist and executive. As a network pioneer in 1992, Freedman started Philadelphia’s first ISP, known as netaxs. He went on to run network operations at Akamai for over a decade as VP of network infrastructure and then as chief network scientist. He also ran the network at AboveNet and was the CTO of ServerCentral.

Links Referenced:

Kentik: https://kentik.com
Email: [email protected]
Twitter: https://twitter.com/avifreedman
LinkedIn: https://www.linkedin.com/in/avifreedman

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Most Companies find out way too late that they’ve been breached. Thinkst Canary changes this. Deploy Canaries and Canarytokens in minutes and then forget about them. Attackers tip their hand by touching ’em giving you the one alert, when it matters. With 0 admin overhead and almost no false-positives, Canaries are deployed (and loved) on all 7 continents. Check out what people are saying at canary.love today!

Corey: Welcome to Screaming in the Cloud, I’m Corey Quinn. This promoted guest episode is brought to us by our friends at Kentik. And into my social grist mill, they have thrown Avi Freedman, their CEO. Avi, thank you for joining me.

Avi: Thank you for having me, Corey. I’ve been a big fan for some time, I have never actually fallen off my seat laughing, but I’ve come close a couple times on some of your threads.

Corey: You must have a great chair.

Avi: I should probably upgrade it [laugh].

Corey: [laugh]. I have been looking forward to this conversation for a while because you are one of those rare creatures who comes from a similar world to what I did where we were grumpy and old before our time because we worked on physical infrastructure in data centers, we basically wrangled servers into doing the things that we wanted them to do when hardware reliability was an aspiration rather than a reality. And we also moved on from that, in many ways. We are not blind to the modern order of how computers work. But you still run a lot of what you do in data centers, but many of your customers are in cloud. You speak both languages very fluently because of the unifying thread between all of this, which is, of course, the network. How did you wind up in, I guess we’ll call it network hell.

Avi: [laugh]. I mean, network hell was truly… in the ’90s, when the internet was—I mean, the internet is sort of like the human body: the more you study it, the more amazing it is that it ever worked in the first place, not that it breaks sometimes—was the bugs, and trying to put together the technology back then, you know, that we had the life is pretty good nowadays, other than the [laugh] immense complexity that has been unleashed on us by everyone taking the same technology and then writing it in their own software and giving it their own marketing names. And thus, you have multi-cloud networking. So, got into it because it’s a problem that needs to be solved, right? There’s no ESP that connects the applications together; the network still needs to make it work. And now people own some of it, and then more of it, they don’t own, but they’re still responsible for it. So, it’s a fun problem to solve.

Corey: The timing of this episode is apt because I’ve used Kentik myself for a few things over the years. And to be fair, using it for any of my personal networking problems is a bit like noticing, “Oh, I have a loose thread here on my shirt. Pass me the chainsaw.” It’s, my environment is tiny and it’s over-scoped. But I just earlier this week wound up having to analyze a day’s worth of Flow Logs from one of my clients, and to do this, I had to spin up an EC2 instance with 128 gigs of RAM and then load the Flow Logs for that day into RAM, and then—not kidding—I ran into OOM Killer because I ran out of RAM on this thing.

Avi: [laugh].

Corey: It is, like, yeah, that’s right. The network is chatty, the logs are immense, and it’s easy to forget. Because the reason I was doing this was just to figure out what are the things that are talking to each other in this environment to drive up some aspects of data transfer costs. But that is an esoteric use case for this; it’s not why most people tend to think about network observability. So, I’m going to ask you the blunt question up front here because it might be a really short episode. Do we have to care about networking in the least now that cloud is the default in most locations? It is just an API call away, isn’t it?

Avi: With great simplicity comes great complexity. So, to the people running infrastructure, to developers or architects, turning it all on, it looks like just API calls. But did you set the policies right? Can the things talk to each other? Are they talking in patterns that are causing you wild data transfer costs?

All these things ultimately come back to some team that actually has to make it go. And can be pretty hard to figure that out, right, because it’s not just the VPC Flow Logs. It’s, what’s the policy? It’s, what are they talking to that maybe isn’t in that cloud, that’s maybe in another cloud? So, how do you bring it all together? Like, you could have—and maybe you should have—used Athena, right? You can put VPC Flow Logs in S3 buckets and use Athena and run SQL queries if all you want is your top talker.

Corey: Oh, I did. That’s how I started, but Athena is, uh… it has some challenges. Let’s just put it that way and leave it there. DuckDB is what I was using and I’m much happier with it for a variety of excellent reasons.

Avi: Okay. Well, I’ll tease you another time about, you know—I lost this battle at Kentik. We actually don’t use swap, but I’m a big fan of having swap and monitoring it so the OOM Killer only does what you want or doesn’t fire at all. But that’s a separate religious debate.

Corey: There’s a counterargument of running an in-memory data store. And then oh, we’re going to use it as swap though, so it’s like, hang on, this just feels like running a normal database with extra steps.

Avi: Computers allow you to do amazing things and only occasionally slap you nowadays with it. It’s pretty amazing. But back to the question. APIs make it easy to turn on, but not so easy to run. The observability that you get within a given cloud is typically very limited.

Google actually has the best. They show some topology and other things. I mean, a lot of what we do involves scraping API calls in the cloud to figure out what does this all mean, then convolving it with the VPC Flow Logs and making it look like a network, and what are the gateways, and what are the rules being applied and what can’t talk to itself? If you just look at VPC Flow Logs like it’s Syslog, good luck trying to figure out what VPCs are talking to each other. It’s exactly the problem that you were describing.

So, the ease of turning it on is exactly inversely proportional to the ease of running it. And, you know, as a vendor, we think it’s an awesome [laugh] problem, but we feel for our customers. And you know, occasionally it’s a pain to get the IAM roles set up to scrape things and help them, but that’s you know, that’s just part of the job.

Corey: It’s fascinating to me, just looking from an AWS perspective, just how much work clearly has to be done to translate their Byzantine and very strange networking environment and concepts into things that customers see. Because in many cases, the things that the virtual machines that we’ve run on top of EC2, let alone anything higher level, is being lied to the entire time about what the actual topology of the environment is. It was most notable, for me at least, at re:Invent 2022, the most recent one, where they announced they have a TCP replacement, scalable, reliable data grammar SRD. It’s a new protocol entirely. It’s, “Oh, wow, can we use it?” “No.” “Okay.” Like, I get that it’s a lot of work, I get you’re excited about it. Are you going to talk to us about how it actually works? “Oh, absolutely not.” So… okay, good for you, I guess.

Avi: Doesn’t Amazon have to write a press release before they build anything, and doesn’t the press release have to say, like, why people give a shit, why people care?

Corey: Yep. And their story on this was oh, it enables us to be a lot faster at letting EBS volumes talk to some of our beefier instances.

Avi: [laugh].

Corey: And that’s all well and good, don’t get me wrong, but it’s also, “Yay, it’s more reliable,” is a difficult message to send. I mean, it’s hard enough when—and it’s necessary because you’ve got to tacitly admit that reliability and performance haven’t been all they could be. But when it’s no longer an issue for most folks, now you’re making them wonder, like, wait, how bad was it? It’s just a strange message.

Avi: Yeah. One of my projects for this weekend is, I actually got a gaming PC and I’m going to try compression offload to the CUDA cores because right now, we do compress and decompress with Intel cores. And like, if I’m successful there and we can get 30% faster subqueries—which doesn’t really matter, you know, on the kind of massive queries we run—and 20% more use out of the computers that we actually run, I’m probably not going to do a press release about it. But good to see the pattern.

But you know, what you said is pretty interesting. As people like Kentik, we have to put together, well, on Azure, you can have VPCs that cross regions, right? And in other places, you can’t. And in Google, you have performance metrics that come out and you can get it very frequently, and in Amazon and Azure, you can’t. Like, how do you take these kinds of telemetry that are all the same stuff underneath, but packaged up differently in different quantos and different things and make it all look the same is actually pretty fun and interesting.

And it’s pretty—you know, if you give some cloud engineers who focus on the infrastructure layer enough beers or alcohol or just room to talk, you can hear some funny stories. And it all made sense to somebody in the first place, but unpacking it and actually running it as a common infrastructure can be quite fun.

Corey: One of the things that I have found notable about your perspective, as particularly, you’re running all of the network ingest, to my understanding, in your data center environment. Because we talked about this when you were kind enough to invite me to your company all-hands offsite, presumably I assume when people do that, it’s so they can beat me up in the alley, but that only happened twice. I was very pleasantly surprised.

Avi: [And you 00:09:23] made fun of us only three times, so you know, you beat us—

Corey: Exactly.

Avi: —but it was all enjoyed.

Corey: But always with love. Now, what I found fascinating was you and I sat down for a while and you talked about your data center architecture. And you asked me—since I don’t have anything to sell you—is there an economical way that I could see running your environment on top of AWS? And the answer was sure, if by economical you mean an absolute minimum of six times what you’re currently paying a year, sure you can get there. But it just does not make sense for any realistic approach to doing this.

And the reason I bring this up is that you’re in a data center not because of religious beliefs, “Of, well, this is good enough for my grandpappy, so it’s good enough for me.” It’s because it solves the problem you have in a way that the cloud providers clearly cannot. But you also are not anti-cloud. So, many folks who are all-in on data centers seem to be doing it out of pure self-interest where, well, if everyone goes all-in on cloud, then we have nothing left to sell them. I’ve used AWS VPC Flow Logs. They have nothing that could even remotely be termed network observability. Your future is assured as long as people understand what it is that you’re providing them and what are you that adds. So yeah, people keep going in a cloud direction, you’re happy as houses.

Avi: We’ll use the best tools for building our infrastructure that we can, right? We use cloud. In fact, we’re just buying some reserved instances, which always, you know, I give it the hairy eyeball, but you know, we’re probably always going to have our CI/CD bursty stuff in the cloud. We have performance testing regions on all the major clouds so that we can tell people what performance is to and from cloud. Like, that we have to use cloud for.

And if there’s an always-on model, which starts making sense in the cloud, then I try not to be the first to use anything, but [laugh] we’ll be one of the first to use it. But every year, we talk to, you know, the major clouds because we’re customers of all them, for as I said, our testing infrastructure if nothing else, and you know, some of them for some other parts, you know, for example, proxying VPC Flow Logs, we run infrastructure on Kubernetes in all—in the three biggest to proxy VPC Flow Logs, you know, and so that’s part of our bill. But if something’s always on, you know, one of our storage servers, it’s a $15,000 machine that, you know, realistically runs five years, but even if you assume it runs three years, we get financing for it, cost a couple $100 a month to host, and that’s inclusive of our ops team that runs, sort of, everything, you just do the math. That same machine would be, you know, even not including data transfer would be maybe 3500 a month on cloud. The economics just don’t quite make sense.

For burst, for things like CI/CD, test, seasonality, I think it’s great. And if we have patterns like that, you know, we’re the first to use it. So, it’s just a question of using what’s best. And a lot of our customers are in that realm, too. I would say some of them are a little over-rotated, you know, they’ve had big mandates to go one way or the other and don’t have the right, you know, sort of nuanced view, but I think over time, that’s going to fix itself. And yeah, as you were saying, like, the more people use cloud, the better we do, so it’s just really a question of what’s the best for us at our infrastructure and at any given time.

Corey: I think that that is something that is not fully appreciated or well understood is that I work with cloud technologies because for what I do, it makes an awful lot of sense. But I’ve been lately doing a significant build-out in my home network on the perspective of yeah, this makes sense for what I do. And I now have increased number of workloads that I’m running here and I got to say, it feels a little strange, on some level, not to be paying AWS on something metered by the second whenever I’m running a job here. That always feels a little on the weird side. But I’m not suggesting I filled my house with servers either.

Avi: [unintelligible 00:13:18] going to report you to the House on Cloudian Activities Committee [laugh] for—

Corey: [laugh].

Avi: To straighten you out about your infrastructure use and beliefs. I do have to ask you, and I do have some foreknowledge of this, where is the controller for your network running? Is it running in your house or—

Corey: Oh, the WiFi controller lives in Ohio with all the other unpleasant things. I mean, even data transfer between Ohio and Virginia—if you’re on AWS—is half-price because data wants to get out of Ohio just as much as the people do. And that’s fine, but it can also fail out of band. I can chill that thing for a while and I’m not able to provision new equipment, I can’t spin up new SSIDs, but—

Avi: Right. It’s the same as [kale scale 00:14:00], which is, like, sufficiently indistinguishable from magic, but it’s nice there’s [head scale 00:14:05] in case something happened to them. But yeah, you know, you just can’t set up new stuff without your SSHing old way while it’s down. So.

Corey: And worst case, it goes away irretrievably, I can spin a new one up, I can pair everything locally, do it by repointing DNS locally, and life will go on. It’s one of those areas where, like, I would not have this in Ohio if latency was a concern if it was routing every packet out halfway across the country before it hit the general internet. That would be a challenge for me. But that’s not what I’m doing.

Avi: Yeah, yeah. No, that makes sense. And I think also—

Corey: And I certainly pay AWS by the second for that thing. That’s—I have a three-year savings plan for that thing, and if nothing else, it was useful for me just to figure out what the living hell was going on with the savings plan purchase project one year. That was just, it was challenged to get that straightened out in some ways. Turns out that the high watermark of the console is a hundred-and-some-odd-thirty-million dollars you can add to cart and click the buy button. Have fun.

Avi: My goodness. Okay, well.

Corey: The API goes up to $26.2 billion. Try that in a free tier account, preferably someone else’s.

Avi: I would love to have such problems. Right now, that is not one of them. We don’t spend that much on infrastructure.

Corey: Oh, that is more than Amazon’s—AWS’s at least—quarterly revenue. So, if you wind up doing a $26.2 billion, it’s like—it’s that old saw. You owe Amazon a million dollars, you have a problem. If you owe Amazon $26 billion, Amazon has a problem. Yeah, that’s when Andy Jassy calls you 20 minutes after you make that purchase, and at least to me, he yells at me with a, “Listen here, asshole,” and it sort of devolves from there.

Avi: Well, I do live in Seattle, so you know, they send the posse out, I’m pretty sure.

Corey: [laugh] I will be keynoting DevOpsDays Seattle on August 1st with a talk that might very well resonate with your perspective, “The Modern Devops: A Million Ways to Die in Production.”

Avi: That is very cool. I mean, ultimately, I think that’s what cloud comes back to. When cloud was being formed, it’s just other people’s computers, storage, and network. I don’t know if you’d argue that there’s a politics, control plane, or a—

Corey: Oh, I would say, “Cloud? There’s no cloud; just someone else’s cost center.”

Avi: Exactly. And so, how do you configure it? And back to the question of, should everything be on-prem or does cloud abstract at all, it’s all the same stuff that we’ve been doing for decades and decades, just with other people’s software and names, which you help decode. And then it’s the question we’ve always had: what’s the best thing to do? Do you like [Wellfleet 00:16:33] or [Protion 00:16:35]? Now, do you like Azure [laugh] or Google or Amazon or somebody else or running your own?

Corey: It’s almost this generation's equivalent of Vi versus Emacs.

Avi: Yes. I guess there could be a crowd equivalent. I use VI, but only because I’m a lisp addict and I don’t want to get stuck refining Eliza macros and connecting to the ChatGPT in Emacs. So, you know. Someone just did a Emacs as PID 0. So basically, no init, just, you know, the kernel boots into Emacs, and then someone of course had to do a VI as PID 0. And I have to admit, Emacs would be a lot more useful as a PID 0, even though I use VI.

Corey: I would say that—I mean, you wind up in writing in Emacs and writing lisp in it, then I’ve got to say every third thing you say becomes a parenthetical.

Avi: Exactly. Ha.

Corey: But I want to say that there’s also a definite moving of data going on there that I think is a scale that, for those of us working mostly in home labs and whatnot, can be hard to imagine. And I see that just in terms of the volume of Flow Logs, which to be clear, are smaller than the data transfer they are representing in almost every case.

Avi: Almost every.

Corey: You see so much of the telemetry that comes out of there and what customers are seeing and what their problems are, in different ways. It’s not just Flow Logs, you ingest a whole bunch of different telemetry through a variety of modern and ancient and everything in between variety of protocols to support, you know, the horror that is network equipment interoperability. And just, I can’t—I feel like I can’t do a terrific job of doing justice to describing just how comprehensive Kentik is, once you get it set up as a product. What is on the wire has always been for me the arbiter of truth because computers will lie to you, but it’s very tricky to get them to lie and get the network story to cover for it.

Avi: Right. I mean, ultimately, that’s one of the sources of truth. There’s routing, there’s performance testing, there’s a whole lot of different things, and as you were saying, in any one of these slices of your, let’s just pick the network. There’s many different things that all mean the same, but look different that you need to put together. You could—the nerd term would be, you know, normalizing. You need to take all this stuff and normalize it.

But traffic, we agree, that’s where we started with. We call it the what if what is. What’s actually happening on the infrastructure and that’s the ancient stuff like IPFIX and NetFlow and sFlow. Some people that would argue that, you know, the [IATF 00:19:04] would say, “Oh, we’re still innovating and it’s still current,” but you know, it’s certainly on-prem only. The major cloud vendors would say, “Oh, well, you can run the router—cloud routers—or you could run cloud versions of the big routers,” but we don’t really see that as a super common pattern today.

But what’s really the difference between NetFlow and the VPC Flow Log? Well, some VPC Flow Logs have permit deny because they’re really firewall logs, but ultimately, it’s something went from here to there. There might not be a TCP flag, but there might be something else in cloud. And, you know, maybe there’s rum data, which is also another kind of traffic. And ultimately, all together, we try to take that and then the business metadata to say, whether it’s NetBox in the old world or Kubernetes in the new world, or some other [unintelligible 00:19:49], what application is this? What user is this?

So, you can ask questions about why am I blowing up between these cloud regions? What applications are doing it, right? VPC Flow Logs by themselves don’t know that, so you need to add that kind of metadata in. And then there’s performance testing, which is sort of the what is. Something we do, Thousand Eyes does, some other people do.

It’s not the actual source of truth, but for example, if you’re having a performance problem getting between, you know, us-east and Azure in the east, well, there’s three other ways you can get there. If your actual traffic isn’t getting there that way, then how do you know which one to use? Well, let’s fire up some tests. There’s all the metrics on what all of the devices are reporting, just like you get metrics from your machines and from your applications, and then there’s stuff even up at the routing layer, which God help you, hopefully you don’t need to actually get in and debug, but sometimes you do. And sometimes, you know, your neighbor tells the mailman that that mail is for me and not for you and they believe them and then you have a big problem when your bills don’t get paid.

The same thing happens in the cloud, the same thing happens on the internet [unintelligible 00:20:52] at the routing. So, the goal is, take all the different sources of it, make it the same within each type, and then pull it all together so you can look at a single place, you can look at a map, you can look at everything, whether it’s the cloud, whether it’s your own data centers, your own WAN, into the internet and in between in a coherent way that understands your application. So, it’s a small task that we’ve bit off, but you know, we have fun solving it.

Corey: Do you find that when you look at customer environments, that they are, and I don’t mean to be disparaging here, truly I don’t, but if you were to ask me to design something today, I would probably not even be using VPCs if I’m doing this completely greenfield. I would be a lot more cloud-first, et cetera, et cetera. Whereas in many cases, that is not the right path, especially if you know, customers have the temerity to not be founded within the last 18 months before AWS existed in some ways. Do you find that the majority of what they’re doing looks like they’re treating the cloud like data centers or do you find that they are leveraging cloud in ways that surprise you and would not be possible in traditional data centers? Because I can’t shake the feeling that the network has a source of truth for figuring out what’s really going on [is very hard to beat 00:22:05].

Avi: Yes, for the most part, to both your assertion at the end and sort of the question. So, in terms of the question, for the most part, people think of VPCs as… you know, they could just equivalent be VLANs and [unintelligible 00:22:21], right? I’ve got policies, and I have these things that are talking to each other, and everything else is not local. And I’ve got—you know, it’s not a perfect mapping to physical interfaces in VLANs but it’s the equivalent of that.

And that is sort of how people think about it. In the data center, you’d call it micro-segmentation, in the cloud, you call it clouding, but you know, just applying all the same policies and saying this stuff can talk to each other and not. Which is always sort of interesting, if you don’t actually know what is talking [laugh] to each other to apply those policies. Which is a lot of what you know, Kentik gets brought in for first. I think where we see the cloud-native thinking, which is overlaid on top of that—you could call it overlay, I guess—which is service mesh.

Now, putting aside the question of what’s going to be a service mesh, what’s going to be a network mesh, where there’s something like [unintelligible 00:23:13] sit, the idea that there’s a way that you look at traffic above the packets at, you know, layers three to more layer seven, that can do things like load balancing, do things like telemetry, do things like policy enforcement, that is a layer that we see very commonly that a lot of the old school folks have—you know, they want their lsu F5s and they want their F5 script. And they’re like, “Why can’t I have this in the cloud?”—which I guess you could buy it from F5 if you really want—but that’s pretty common. Now, not everything’s a sidecar anymore and there’s still debates about what’s going on there, but that’s pretty common, even where the underlying cloud just looks like it could just be a data center.

And that seems to be state of the art, I would say, our traditional enterprise customers, for sure. Our web company customers, and you know, service providers use cloud more for their OTT and some other things. As we work with them, they’re a little bit more likely to be on-prem, you know, historic. But remember, in the enterprise, there’s still a lot of M&A going on, I think that’s even going to pick up in the next couple of years and a lot of what they’re doing is lift-and-shift of [laugh] actual data centers. And my theory is, it’s got to be easier to just make it look like VPCs than completely redo it.

Corey: I’d say that there’s reasons that things are the way that they are. Like, ignoring that this is the better approach from a technical perspective entirely because that’s often not the only answer, it’s we have assurances we made as part of audit compliance regimes, of our SOC 2, of how we handle certain things and what those controls are. And yeah, it’s not hard for even a junior employee, most of the time, to design a reasonable architecture on a whiteboard. The problem is, how do you take something pre-existing and get it to a state that closely resembles that while not turning it off for a long time?

Avi: Right. And I think we’re starting to see some things that probably shouldn’t exist, like, people trying to do VXLAN as overlays into and between VPCs because that was how their data s—you know, they’re more modern on the data center side and trying to do that. But generally, I think people have an understanding they need to be designing architecture for greenfield things that aren’t too far bleeding edge, unless it’s like a pure developer shop, and also can map to the least common denominator kinds of infrastructure that people have. Now, sometimes that may be serverless, which means, you know, more CDN use and abstracted layers in front, but for, you know, running your own components, we see a lot of differences but also a lot of commonality. It’s differences at the micro but commonality the macro. And I don’t know what you see in your practice. So.

Corey: I will say that what I see in practice is that there’s a dichotomy where you have born-in-the-cloud companies where 80% of their spend is on a single workload and you can do a whole bunch of deep optimizations. And then you see the conglomerate approach where it’s giant spend, but it’s all very diffuse across 1500 different applications. And different philosophies, different processes, different cultures give rise to a lot of these things. I will say that if I had a magic wand, I would—and again, the fact that you sponsor and promote this episode is deeply appreciated. Thank you—

Avi: You’re welcome.

Corey: —but it does not mean that you get to compromise my authenticity and objectivity. You can’t buy my opinion, just my attention. But I will say this, that I would love it if my customers used Kentik because it is one of the best things I’ve ever seen to describe what is talking to what that scale and in volume without going super deep into the weeds. Now, obviously, I’m not allowed to start rolling out random things into customer environments. That’s how I get sued to death. But, ugh, I wish it was there.

Avi: You probably shouldn’t set up IAM rules without asking them, yes. That wouldn’t be bad.

Corey: There’s a reason that the only writable stuff that I have access to is generating reports in Cost Explorer.

Avi: [laugh]. Okay.

Corey: Everything else is read-only. All we do is to have conversations with folks. It sets context for those conversations. I used to think that we’d be doing this as a software offering. I no longer believe that actually solves the in-depth problems that people have.

Avi: Well, I appreciate the praise. I even take some of the backhanded praise slash critique at the beginning because we think a lot about, you know, we did design for these complex and often hybrid infrastructures and it’s true, we didn’t design it for the two or four router, you know, infrastructure. If we had bootstrapped longer, if we’d done some other things, we might have done it that way. We don’t want to be exclusionary. It’s just sort of how we focus.

But in the kind of customers that you have, these are things that we’re thinking about what can we do to make it easier to onboard because people have these massive challenges seeing the traffic and understanding it and the cost and security and the performance, but to do more with the VPC Flow Logs, we need to get some of those metrics. We think about should we make an open-source thing. I don’t know how much you’ve seen the concern that people have universally across cloud providers that they turn on something like Kentik, and they’re going to hit their API rate limiter. Which is like, really, you can’t build a cache for that at the scale that these guys run at, the large cloud providers. I don’t really understand that. But it is what it is.

We spent a lot of time thinking about that because of security policy, and getting the kind of metrics that we need. You know, if we open-source some of that, would it make it easier, plug it into people’s observability infrastructure, we’d like to get that onboarding time down, even for those more complex infrastructures. But you know, the payoff is there, you know? It only takes a day of elapsed time and one hour or so. It’s just you got to get a lot of approvals to get the kind of telemetry that you need to make sense of this in some environments.

Corey: Oh, yes. And that’s part of the problem, too, is like, you could talk about one of those big environments where you have 1500 apps all talking to each other. You can’t make sense of any of it without talking to people and having contacts and occasionally get a little bit of [unintelligible 00:29:07] just what these things are named. But at that point, you’re just speculating wildly. And, you know, it’s an engineering trap, where I’m just going to guess rather than asking someone who knows the answer because I don’t want to look foolish. It’s… you just three weeks chasing your own tail. Who’s the foolish one?

Avi: We’re not in a competitive business to yours—

Corey: [laugh].

Avi: But I do often ask when we’re starting off, “So, can you point us at the source of truth that describes what all your applications are?” And usually, they’re, like, “[laugh]. No.” But you know, at the same time to make sense of this stuff, you also need that metadata and that’s something that we designed to be able to take.

Now, Kubernetes has some of that. You may have some of it in ServiceNow, a lot of people use. You may have it in your own text file, CSV somewhere. It may be in NetBox, which we’ve seen people actually use for the cloud, more on the web company and service provider side, but even some traditional enterprise is starting to use it. So, a lot of what we have to do as a vendor is put all that together because yeah, when you’re running across multiple environments and thousands of applications, ultimately scrying at IP addresses and VPC IDs is not going to be sufficient.

So, the good news is, almost everybody has those sources and we just tried to drag it out of them and pull it back together. And for now, we refuse to actually try to get into that business because it’s not a—seems sort of like, you know, SAP where you’re going to be sending consultants forever, and not as interesting as the problems we’re trying to solve.

Corey: I really want to thank you, not just for supporting the show of course, but also for coming here to suffer my slings and arrows. If people want to learn more, where’s the best place for them to find you? And please don’t respond with an IP address.

Avi: 127.0.0.1. You’re welcome at my home at any time.

Corey: There’s no place like localhost.

Avi: There’s no place like localhost. Indeed. So, the company is kentik.com, K-E-N-T-I-K. I am [email protected]. I am@avifriedman on Twitter and LinkedIn and some other things. And happy to chat with nerds, infrastructure nerds, cloud nerds, network nerds, software nerds, debate, maybe not VI versus Emacs, but should you swap space or not, and what should your cloud architecture look like?

Corey: And we will, of course, put links to that in the [show notes 00:31:20].

Avi: Thank you.

Corey: Thank you so much for being so generous with your time. I really appreciate it.

Avi: Thank you for having this forum. And I will let you know when I am down in San Francisco with some time.

Corey: I would be offended if you didn’t take the time to at least say hello. Avi Friedman, CEO at Kentik. I’m Cloud Economist Corey Quinn, and this has been a promoted guest episode of Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a all five-star review on your podcast platform of choice, along with an angry comment saying how everything, start to finish, is somehow because of the network.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Observing The Hidden Complexity Behind Simple Cloud Networks with Avi Freedman

Episode Summary

Episode Show Notes & Transcript

You might also like

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

The Latest State of IaC with Ido Neeman

Get the Newsletter

Sponsor an Episode