A Conversation on Cloud WAN with Kris Gillespie

Episode Summary

Kris Gillespie, lead platform engineer for Silverflow, joins Corey Quinn on "Screaming in the Cloud" to talk about Cloud WAN's exciting new role in cloud networking. Kris explains Silverflow's journey, from the original problems with network scalability and the resolution of IP conflicts, to fully utilizing Cloud WAN for global connectivity and easier network management. Kris, who enjoys simplifying complex network architectures, discusses how Cloud WAN has enabled Silverflow to seamlessly integrate between regions and cloud providers, meeting their mission-critical needs for low latency and reliable transaction processing. Listen in to see how Cloud WAN has transformed the approach to solving fundamental network problems, demonstrating the importance for companies and engineers of knowing how to navigate the constantly evolving cloud landscape.

Episode Show Notes & Transcript

Show Highlights:

(00:00) Introduction to the show
(01:57) Kris recounts the initial challenges Silverflowy and the discovery of Cloud WAN
(04:15) The advantages of Cloud WAN over traditional transit gateways
(08:35) Infrastructure management with OrgFormation
(12:15) Insights into the use of historical and current networking technologies
(21:13) challenges and implications of transitioning to IPv6
(33:10) Kris highlights the real need for Cloud WAN
(37:50) Closing remarks

About Kris

Kris is a 28-year industry veteran. He started in '95 back in Australia on the help desk for the first ISP in the country. Since then has moved to the Netherlands, switching roles between network, systems and storage engineering. During this time has been involved in developing certifications for both IBM and (the now defunct) EMC, among others. Worked heavily in the finance/banking sector. The last 10 years has been keenly focused on the cloud space and as is the term these days, combined these skills into what's popularly coined, a "Platform Engineer"

Currently works for a payments processing startup, Silverflow, as their Principal Platform Engineer, leading their Platform team and ensuring the platform can scale globally.

Links Referenced:

LinkedIn: https://www.linkedin.com/in/krisgillespie/
blog: https://blog.viking-ops.io/

Transcript

Kris: We kind of had to wrap our head around the cost and how it will justify it, and now we’re at the point where any feature team can just decide to roll out a new VPC, and it automatically goes, and grabs the allocation, and it’s done. And there’s no conflict, everything works.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. One of the fundamental things about cloud is something that’s largely been abstracted away, almost to the point where it’s become something of a lie, where is, the network that really exists in cloud is not necessarily what is presented to us as customers. It’s a polite fiction, which is, I guess, a different way of saying virtualization. But what’s happening there is often not understood at almost any level, just because that skill set is not as front-and-center as it once needed to be. I had a conversation with today’s guest about this exact thing at re:Invent, and I figured we’d have this conversation in a longer format where my voice wasn’t basically down to a croak. Kris Gillespie is a principal platform engineer at Silverflow. Kris, thank you for agreeing to do this in a more public format.

Kris: Oh, you’re welcome. I actually really enjoyed our conversation at a re:Invent, and even though that was quite a public conversation. Let’s see if this can be more useful to your listeners out there as well.

Corey: What I love about re:Invent—and honestly, all conferences—is the hallway track. You get to have conversations and things just sort of pop up out of nowhere because you didn’t really know that they were coming. As I recall, what really got started talking was that you were the first, and so far only, Cloud WAN customer that I have found in the wild. Cloud WAN being yet another networking service that is poorly described by AWS, and even more surprising on some level, you were brimming with praise about it. So, terrific. Someone will talk to me about it who doesn’t actually work on the product at Amazon. What’s the deal?

Kris: The journey to Cloud WAN came completely by accident. When I started working at Silverflow—they’re a payments processing company, so the idea is basically to have this one API which we can do transactions on, but then scale it globally, right? So, have it in the US, Europe, Asia Pacific, wherever. So, we had everything running in one region because we’re based in Europe or in the Netherlands, but you know, we were talking to some possible customers in North America, and then we started to think, like, okay, how do we actually scale this properly? Because everything has been designed, you know, following best practices, so we have VPCs with /16s everywhere.

Or when I say we, I mean, when I walked in the door, they were there, and I’m like, okay /16s. Feels a bit wasteful, but okay. And then we got to the point where we’re like, okay, how do we scale this? How do we actually take these 30, 40, 50 accounts with VPCs everywhere /16s is everywhere, and now replicate that two, three, four, five, six times?

Corey: Without having IP conflicts up the wazoo, which is always fun when you start pairing things together that were not designed to be paired.

Kris: Tell me about it. We already had IP conflicts, and we were already at the point where it’s like, okay, somebody mentioned somewhere that it’s possible to—well, of course, you can do transit gateway peering, but that dynamic routing will come eventually. So, my mission, not last year but the year before, at re:Invent was to find somebody who works at AWS that could tell me about it. So, I just went hunting for that person. And yeah, eventually they told me about Cloud WAN. And that was, like, the hot moment.

Corey: I keep wanting to play with it. One of the problems I’ve had with it historically has been that at its minimum scale, I think it’s something like $500 a month to run the smallest possible expression of it, which isn’t particularly useful because you want to have multiple things talking to it. So, at that point, we’re talking thousands of dollars a month for me to build out anything that remotely resembles a reasonable test lab. And sorry, Amazon, I’m not quite at a point where I’m just willing to throw that kind of R&D budget at explaining your own services for you because you’re bad at it. So, I just wait until I encounter people in the wild, and then, like any good consultant, I turn other people’s production into my test accounts. So, I’m glad to hear that it’s solved some of the painful parts for you. What was—what were you—what exactly was it that things like transit gateway and a variety of convoluted peering setups weren’t getting done for you?

Kris: I mean, transit gateways work fine in a regional context. So, yeah… you can actually share them across accounts, and the—but they’re a regional service. Once you start wanting to connect multiple regions together, the connection between them is, well, the routing is static, so you need to have some sort of Lambda, updating routing tables on your transit gateways to maintain any kind of, I don’t know [laugh]… let’s say, accurate view of your network, especially if you have things like, you know, VPN connections or, you know, customer gateways or anything that’s kind of dynamic. So yeah, you’re kind of mixing dynamic with static, and then you need to add some extra, yeah, Lambdas and stuff that you have to write your yourself to manage this. And I’m thinking, why? Why should I write some TypeScript code or Python or whatever to manage networking that you should do?

Corey: Right. I was under the impression that ‘router’ was a piece of hardware, not a job description for someone. So there’s a qu—you’re a half-step above passing packets by hand at that point.

Kris: Basically, yes. And if you’re talking about VPC peering, then that’s just like everything there, everything there, and you sort things out through NACLs. And I never ever want to deal within the NACLs. That’s like, I just want to kill myself, then.

Corey: Even AWS has policies on Network ACLs has been, don’t use them unless you have a hard-bound requirement where you must use them. Use security groups. Because it’s one of those hidden things you don’t think about or see. They’re not very granular, you’re limited in how many you can have applied, and when something isn’t working that should, you will drive yourself up a wall before you remember that there, there.

Kris: Exactly. You need to have logging everywhere know that you need to turn it on here, here, here, here, and here, and then try to correlate it all the way through. So, as far as visibility on the network layer, it’s also quite a challenge. That’s also something that we’re very hard trying to solve. But again, that then leads you down the path of more AWS services.

Corey: I was a grumpy Unix sysadmin who then learned Linux for a job, and one year, and we had the 2008 financial crisis. Suddenly, no one’s hiring, and salary freezes across the board, and I was fairly bored at work. So, I wound up spending that year getting my CCNA because it was, okay, what area my hand-waving over in my day job that I feel like I really should know more about? So, networking was an easy answer. I’m not saying I’m any great shakes at it, but I definitely understand it a lot better than I did, and it made me a better systems person as a direct result.

And over the, dear Lord, almost 20 years since, I’m realizing just how rare I am just with that surface-level understanding of a lot of networking concepts. Because in cloud, you don’t actually have to think about the network at all, until suddenly, one day you very much do, and by that point, you don’t even know where to begin. It’s an entire area that has sort of slipped below the surface level of awareness, that is still critically important because a computer without a network is basically an expensive space heater.

Kris: It’s actually really, I would say ironic, as well. So, my background is initially on, let’s say, helpdesk systems, then I went into networking, storage engineering, back to systems, back to networking, and now all that, kind of, wraps together, and you call it let’s say… platform engineering. I mean, I don’t even know what to call what I do anymore. But when I was hired for my job currently, they had no idea that I had any networking experience. It was just—because my last, like, say 8, 10 years has been largely focused on let’s say, cloud, and so that, in most people’s minds, is infra, Infrastructure as Code, CI/CD, and that’s about it.

Corey: And people instead wind up getting judgy and annoying because in job interviews, you’re bad at doing things like implementing quicksort on a whiteboard.

Kris: I can’t do any of that. I am not a developer, right?

Corey: See, I used to say the same thing, and then I realized that I was writing an awful lot of Configuration as Code and a lot of scripts that were getting fairly up there, and Python just started to intrinsically make sense. And now, of course, I write the most common programming language, which is YAML. And now I have this amazing alchemical ability to turn a YAML files into AWS bills. It’s kind of amazing. It’s a horrible party trick, and you’re never invited back to that party.

Kris: No. Well, I mean, yeah, so we actually use a hybrid, or I would say, Frankenstein’s monster version of CloudFormation, called OrgFormation. Let’s say, it’s organizationally aware CloudFormation. So, you can very easily deploy a stack—so all this YAML—across 20 accounts, 30 accounts. Yeah, if you make one little boo-boo, and all sudden, you’ve deployed cloud network edges in 20 regions, then your bill goes [whistle-up noise]. So yes, we can very much scale our costs, much to the delight of our TAMs.

Corey: What’s fun about a lot of that, too, is that it’s this whole world of, okay, what am I going to do? I’m going to teach an AWS service about multiple accounts and/or multiple regions—on fun days, possibly both—and you’re spackling over things that AWS really should be doing for us as customers, but haven’t gotten around to yet. And let’s be clear, I know that these are hard problems to solve. They’re also a division of a $1.5 trillion company, and I don’t think that asking the rest of us to do volunteer work to spackle over their faults isn’t necessarily fair, either. Like at some point of scale, the burden shifts.

Kris: Yeah. I mean, we’re at the point now where—and this will sound actually crazy—but we’re building a development environment specifically around Cloud WAN because now it’s getting to the point where we can’t [unintelligible 00:09:52] changes anymore. They’re becoming too… well, too impactful, so we need to—we need an environment where we can actually test theories, and, you know, create test cases, and make sure that what we do is correct. But I’m at the point where I’m like, I also want to ask our TAMs to, let’s say, contribute to this environment because it’s really expensive for us just to test things that we don’t have any other way of testing.

Corey: The networking blast radius is always somewhat terrifying to me. I don’t know about you, but when I was working on networks, once upon a time, one of the first things you learn to do is you set up a cronjob or an [atjob 00:10:27] on whatever it is you’re working on, usually a firewall. And after some elapsed period of time, it automatically reverts to the previous config. Or you even run the command, put a sleep in there. Because if the change doesn’t work, suddenly, you’ve locked yourself out, and you now get to either drive across town to the data center or open a remote [unintelligible 00:10:46] ticket, or have all kinds of fun things that happen as a result. And it’s kind of scary. I’ve been doing a lot of home networking lately, and even now, it’s like crap, I have to go downstairs again.

Kris: My first job in the Netherlands, that was almost the first thing that I did was I… spaced out—in that second [I 00:11:03]—this is like 2001, so I’m working on a Cisco router, get the config, and of course, it didn’t go into my mind that as soon as you add the first firewall rule, it has an implicit deny after it. So, I didn’t put the office in there. So, I’m running out the door to the car, as everyone goes, “Hey Kris, it’s”—and I’m already gone. I’m, like, going to the data center to go and fix that. It’s, yeah.

Corey: Like, I am 99% sure I know exactly what’s about to come out of your mouth [laugh]. Yeah.

Kris: And I also think that’s also why people are shy of it as well. Because once you really start to get your hands dirty in this domain, the impacts of any, you know, boo-boo can be immense.

Corey: It also seems to me that networking has been very slow to embrace change. And again, I can say as a systems person, that’s not necessarily intended to be pejorative. Again, the scale is massive, mistakes matter, and it can be very convoluted. But it seems that it’s still in its infancy, when it comes to the idea of programmatic control. Everything seems to be done by hand, and then applied as weird one-offs. There’s no testing facility to speak of.

I still remember, in my very early days, having to patch a monstrous Perl script called rancid so that it could speak to Radware load balancers. And all this monstrosity did was it logged into a variety of network equipment that you’ve told it to, grabbed the copy of the config on the thing, and then committed it to Subversion—which was a Git precursor—and then it would either email diffs out, and then you had an entire history of the change on these things. But the fact that that was an industry standard for as long as it was—and I really hope it’s not now, but it probably still is—is wild.

Kris: Oh, I’m sure it’s sitting somewhere in a dark corner of a data center, still humming along. I would not be surprised. But even the tech, right, I mean, BGP is as a protocol still runs the internet. It’s, what, 30-plus years old now, and even in AWS, under the hood, you know, the transit gateways—or in Cloud WAN itself, you know, so the core network edges, and everything—is just talking BGP. In fact, one of the things that I find most funny is the connect attachments. So, on transit gateways and on the Cloud WAN, you have what’s called a connected attachment, which is a way how you can connect a third-party device—so like, maybe a software-defined networking device from someone, like, I don’t know Fortinet, or Aviatrix, or whoever, even Cisco—but it’s actually over a GRE tunnel. And I don’t know how old you are, but a GRE tunnel is, like, ancient tech.

Corey: Eucalyptus, as I recall, if not OpenStack, used to do their entire fake presented network layer by having everything running through GRE tunnels between the physical hosts, and it would just build abstraction layers within them. So yeah, I’m old is the short answer to that. Been there, done that, have the battle scars from the rack nuts that I was working on that week.

Kris: I’ve gone gray from all this networking in the last few years.

Corey: Oh, everyone working in networking is old. It was great. It’s like, “Oh, you have gray hair.” It’s like, “Wow, what was it like back in your day, Grandpa?” It’s like, “I’m 24 years old. What are you asking here?” Yeah, it ages you. It really does.

Kris: Yeah. That’s what we’re knee-deep in at the moment, so we’re a busy… we had to re-provision most of what we already had because we had IP conflicts everywhere. We actually used another service, which the pricing for it drives me crazy. It’s a IPAM. That’s amazing. You pay for the IPs that are under management. Think of that.

Corey: I love that. It’s basically the world’s most expensive version of Microsoft Excel. I understand the value of it because I look at this; it’s like okay, here’s a list of IP addresses that have been allocated across your entire AWS estate. Great. And it charges you per IP address per month in that thing, and it sounds like this is a job for a spreadsheet.

Where it starts to add value is, okay, pretend it’s not just you are a team of three people. Imagine now that you’re a giant multinational, and you have an entire number of divisions that are all contributing to this IP scheme and the rest. How do you wind up tracking all of that in one central place? It quickly becomes worth its weight in gold. Before that, when I was in my on-prem days, the gold standard for this—because it, surprise, surprise, turns out that spreadsheets don’t scale super well—was a company called Device42, which was a great way of having even rack-level inventory. And of course, my Route 53 is a database joke started by annotating VMs with TXT records saying what physical hosts they were on.

Kris: [laugh] Yeah. That’ll work.

Corey: Tracking this stuff hard.

Kris: Yeah, absolutely. But yeah, I mean, for us IPAM, we kind of had to wrap our head around the cost and how it would justify it, and now we’re at the point where any feature team can just decide to roll out a new VPC, and it automatically goes, it grabs the allocation, and it’s done. And there’s no conflict. Everything works. It attaches to Cloud WAN, routes are there, and then from our perspective now, the management is, like, next to zero. It just, everything is automated. So, from that perspective, we’re very, very happy.

Corey: When I see an AWS service and the pricing just strikes me as Looney Tunes, my default assumption—especially these days—is okay, that probably means that I’m not thinking about it in the right way and/or I’m not the target market. Even at small-scale, things like the Managed NAT Gateway, at, you know, 30 bucks a month for the instance hours, so it offends independent learners, but at small-scale, okay, it starts to make sense. But it really gets problematic when it’s—and you’re costing $30,000 a day on just the data you’re shoving through the things. What on earth is going on over there? It’s a… the pricing becomes architecture.

And I think that is a big problem right now is that networking in a cloud context is so radically different economically from anything you wind up doing on-prem. Where ports to—bandwidth to the internet, for example, is generally charged at 95th percentile. So, you basically wind up having every five minutes [unintelligible 00:16:50] over the course of the month, sorting them from largest to smallest, chop off the top 5%, and whatever the next one is, that’s how much it costs you for the month. So yeah, get that wrong, and you’ll wind up having significant overage charges. But once you pay for the size of the pipe, it can be bored or it can be saturated, and it doesn’t matter at all.

Now suddenly, every bite passing through is metered and charged for, combined with instance hours, which in a home lab, has never really been a thing. Increasingly, it means there is no home lab story, for an awful lot of AWS’s increasingly impressive networking options. You have to almost find a company that’s using these things, and then because they can justify the developer environment, but as independent learners, we just can’t.

Kris: No, I mean, so even for myself, you know, personally, I kind of have the ambition to write about this as well. So, I’m halfway through it, but I sometimes toy with the idea of, you know, spinning some of this stuff up to, you know, create some nice, you know, pictures or whatever. But I was like, no. I am not going to pay for this. Not even close to it. It’s way too expensive. It’s—and then yeah, I don’t even know how even smaller companies could even experiment with it because, you know, if you forget to turn it off for a month, you’ve all sudden got a 2, 3k bill that you never expected, which yeah, can hurt quite a lot.

Corey: It’s one of those big challenges. Back when I was learning this stuff myself, Cisco had a reasonably decent program—that I’m sure still exists because it’s written in Java; that stuff lives forever—called Packet Tracer. And you could wind up building fake networks because you didn’t have a spare quarter million dollars to buy one of their catalyst switches that did this stuff at scale, so you could set these things up and learn how to do the configuration and the rest, and it mostly worked. Where’s that equivalent in AWS-land? It’s a problem from home lab perspective.

I remember getting old Cisco gear off of eBay, or from employers that were decommissioning stuff, just to build out a somewhat reasonable home lab, but that is such a different scale, even now, compared to what the actual networking concerns of big companies tend to be. Just because small networks are mostly—and I’m going to get yelled at for this—but mostly a solved problem.

Kris: Oh yeah, absolutely. I mean, I consider myself a bit of a networking guy. But at home, I just use, for example—and I’m not pimping any—or I’m not pushing anything at all, but I love the UniFi gear because 99% of the work is done for me. I just put it in, set up something that works, and it just works. So, I don’t want to think about it.

Corey: I’m still running that for Wi-Fi here. I didn’t love their AWS security breach that they didn’t fully disclose when the indictment came out. They had been hiding it. They sued Krebs for reporting on it in ways they didn’t like, but it’s still—it is—everything else is either way more expensive or back in the days of flashing Linksys all-in-one Wi-Fi nonsense thing with OpenWrt or DD-WRT or something just so you could actually get more capabilities.

Kris: Exactly that. So, home networking is solved. And I mean, even, you know, in my experience with most engineers—and I’m not talking down to anybody at all because I know that, you know, there’s a thousand topics out there—but I would say, as a good systems engineer—which is what I really think I am; well, good is debatable—but I think the more exposure you have to all the aspects, right—because the system does sit there in isolation, right? It’s connected to a network, it has storage, it has all these things—the more you can dive into any of these topics, the better you will be at your job, the better you’ll be able to help the developers or the feature teams or, you know, explain things to your management so that they can make better decisions with the budget or whatever, right? So, I don’t understand why people avoid these topics.

Corey: And so, much of it comes from getting it wrong the first time. I got yelled at this years ago by a boss who didn’t understand this concept. And even now at home, my network here is 192.168.1.0/24. In other words, there are 254 usable IP addresses that I can have on the network. When it came time to build a separate IoT network, the common—think oh okay, so put it in the next block up, so 192.168.2. And down that path lies madness. I picked the 192.168.128, and tha—because if I need to expand either network, there’s massive amounts of headroom. Like, do you really see a scenario when you’re ever going to have more than 254 IP addresses on a home network? It’s, have you met Kubernetes? It eats IP addresses like it’s nobody’s freaking business.

Kris: My oven has an IP address. I mean, everything now is connected, right? So, it will only get worse.

Corey: And you saw the AWS change—as we record this on January 31st, it takes effect tomorrow—which is specifically that every IPv4 address will cost roughly 3 to $4 per month, regardless of whether it’s attached or not. That means that that’s going to cost about $43 every year for every IPv4 address. And when this was first launched, I was pretty enthusiastic about it because that’s great. It’s driving IPv6 adoption. The problem is, is that people are slow to adopt IPv6 when they’re at AWS working on service teams. There are a laundry list of AWS services that flat-out don’t work that require these things. So, the price of a whole bunch of things has just gone up, and people are about to be surprised and then some, when they see what happens to their bill at the end of February. I’m expecting my phone to basically explode.

Kris: Yes. I mean, it’s—I guess, they need to raise their invoices a little bit.

Corey: I have a customer who will be charged many millions of dollars a year for this change. One. I say it like there’s only one of them. And the official response is, “Great, well, what about bring-your-own-IP? We don’t charge for that. You can bring your own IP allocation and use that.”

Oh, great. So, I’m just going to re-IP every device I have that’s public-facing and talks to customers. Who can do all that work for communication and networking and avoiding outages, jackhole? You? I didn’t think so, so I guess I’m going to take it on the chin and pack the millions of dollars. And it’s a… it’s just a disaster.

Kris: Well, I mean, even if you want to bring your own IPs, then you need to go to, well, ARIN or APNIC or RIPE or one of those organizations to even beg.

Corey: Oh, these days, you’re going to the secondary markets. Like, they’re not passing out—

Kris: No.

Corey: —the stuff anymore.

Kris: It’s [unintelligible 00:23:00].

Corey: And the stuff that they occasionally do is, like, coming out of bogon space and whatnot. It’s like, hey, do you want some IP addresses that, like, a good third of the internet’s [laugh] [unintelligible 00:23:07] devices refused to acknowledge is valid and will drop your packets on the floor?

Kris: Wow, the bogon network.

Corey: It’s like there are actually some people whose sites I think should live in that IP space, but that’s just me being small and petty.

Kris: It’s fine. I mean, everybody has their way [laugh].

Corey: Exactly. It’s been a really interesting ride just watching the evolution of AWS networking as it’s gone from, like, originally with EC2 Classic—which was just called EC2 back in those days—everything was a big flat network and you’d better be good with security policies. And then they build abstractions on top of it, and abstractions on top of that. Easy example: a public versus private subnet is simply a human convenience. There is no declaration public or private in their APIs. It’s simply a question of does this get IP addresses assigned to it, yea or nay? And, oh, is there a NAT Gateway to let it speak to other things?

Kris: Yeah. I mean, these are, as you say, just for human convenience, but they don’t actually do anything. And you could argue that, you know, IPv6, when it actually eventually comes through, will be kind of an interesting thing as well, right? Because there’s no concept of [NATting 00:24:13] with IPv6. So, a lot of, say, a lot of ways how people consider security will be, well, changed because you can’t hide behind a NAT Gateway.

So, now every device that you have, theoretically, if you don’t do your policies correctly—so your security groups or whatever—will be accessible, including in your home, right? So, that’s a very interesting thought that probably lots of people haven’t really considered.

Corey: Yeah, my ISP doesn’t support IPv6 natively, unfortunately, so I set up a tunnel for a bit. And that was great, and the firewall rules are super important, but what I found that was disheartening was how many things still broke. Logging into some services would have some sort of application firewall that just would hang my connection, and it would never load. And it took me a bit to figure out what it was until I forcibly disabled IPv6 on that node, and suddenly, I could log into web pages. I found that I have a few IoT stuff that are now leaking Matter IPv6 addresses onto, you know, my actual home network. And it’s okay, so why does my computer have seven different IPv6 addresses here with one interface? That seems a little off. And it’s oh, dear Lord, we’re in no way ready for this.

Kris: No. And it’s funny because, like, for as long as I’ve even been in the Netherlands, you know, going to my first RIPE meeting, they were talking about, you know, the IPv6 uptake. And it’s like, there was this sad little graph which, like, barely moved up, right? And I think even now, it’s probably slightly up, but I am curious if before I retire if IPv6 is still anywhere further.

Corey: I think that we’re going to see a lot more interest in it just as soon as people start realizing just how much this is costing them.

Kris: Yeah. I mean, cost is usually a driver to almost any kind of change like this.

Corey: Other cloud providers have been charging the same effective fee for many years. The difference is that they adopted this either from the outside or when they were a lot smaller. They didn’t wait until 2024 when a decent percentage of the internet was going through them as the world’s largest cloud provider. I think that it is going to be wild. It’s going to make AWS billions of dollars a year, which, okay, good for them.

Watch them all attribute to how good they are a generative AI, but… okay. It just feels like on the one hand, it’s rent-seeking, but on the other, I do understand it. These are a scarce and diminishing resource, you need to manage it well, you’re not allowed to get any more of them, and it costs them a giant pile of money to acquire these things. How do the economics balance?

Kris: From that perspective, I do understand it as well. And at least from my organization, we are extremely lean on the, let’s say, externally facing IPs that we even have. I think maybe we have a couple of hands’ worth the external IPs. So well, at least for us, we’re not particularly worried.

Corey: Yeah, we’re in the same boat, but our company account is relatively small, like 500 bucks a month. And it’s going to go up by about 10% based on this charge. And I’m not going to sit there and try and hunt down the, what is that, four—or what is that, ten or eleven IP addresses across the entire estate, I’m not going to hunt that down with prejudice; it’s not worth my time. But it is a noticeable bump on what I’m paying AWS. And that’s not because I’m being irresponsible.

Kris: Well, of course, it’s what others have been doing for the longest period, and it’s a form of rent-seeking as well. So yeah, maybe to shift the topic a bit back to Cloud WAN, if you don’t mind—

Corey: By all means. Please.

Kris: —one other little thing which is also kind of interesting, is that the industry that we’re in—or at least my company is in, payments processing—you can imagine that we’re going to work with—hopefully, if we are successful, which of course, my dear overlords will ensure—we will work with a lot of larger companies that probably don’t like Amazon, right? They don’t want their things, their credit card transactions, processed on Amazon.

Corey: I have worked with a number of those companies myself.

Kris: So, you have data in transit and you have data at rest. Almost everybody cares about data at rest. Data in transit people care, but I mean, there’s more of a gray area there, I would say. You can play with it. One of the aspects that we’re also looking into—and the rest of my engineering team will beat me up when they hear this, but I mean, everybody knows that this is going to happen eventually—but how we built our platform is to separate, let’s say, the back-end connectivity, so how we connect all the different cards schemes—so, you know, all the credit cards schemes—we separated that.

Actually, you can imagine it, like, a kind of, you know, we’ve stacked where we do the actual workload, so all the workload processing, all of that is up above, we have Cloud WAN in the middle as, like, this nice glue to kind of connect everything together and to do the separation, and at the bottom side, we do all the connectivity. So, the really expensive stuff at the bottom, and all the stuff that we can push out everywhere, at the top. Um… sorry, I don’t want to do too much of a big monologue here.

Corey: No, no, please. This is fascinating. Tell me more.

Kris: The reason why we did it this way is because the bottom part is very expensive, right? So, we’re talking data centers, we’re talking physical connectivity, we’re talking all these kinds of things, in Europe, North America, Asia, everywhere. These are expensive. But the ones at the top are purely AWS, right, so we do everything as much as possible in AWS, leaning on their services as well. This means that we can also deploy in any region very quickly and hook it up via Cloud WAN down into the various, you know, card schemes very quickly.

So, if a customer calls up and says, “Hey, I’m in Tokyo,” but we don’t have anything there, but we can hook it up via—well… even a local zone—which I learned a lot about at a re:Invent—back through Cloud WAN, and then back to the, you know, processing in Europe or North America, depends which one makes sense, then we can be live within, like, days, weeks, of course, you know, customer integration time takes a long time, but like, we can be ready for them to start, you know, integrating and testing within, like, a day, in logical terms. It’s extremely quick. But what if we need to go to other cloud vendors, right? So say, you know, someone says, “I’m not going to touch Amazon at all. You’re—I—no. I don’t want that.”

So, this is where those connect attachments come in because then we can do an SDN device, or we can even be in a different cloud, right, because it’s just a GRE tunnel. So, then we’re talking, like, okay, we have a connect attachment to an SDN device and we connect that across to Azure. Now, we’ve bridged that gap, so all our very expensive stuff can stay in one place, but now we can expand very easily into other cloud vendors.

Corey: Without having to go to all the trouble of trying to get incompatible interpretations of IPsec working between providers, and getting the security groups working, the routing of the rest. I talked to a company that spent four months on that before giving up completely and deciding to take a different approach.

Corey: Yeah. I’m looking forward to seeing how that winds up branching out. I think we need see more customers using it that way and building tooling around it. I mean, historically, things like Terraform arose because everyone is trying to solve the exact same problems. This feels like it’s a lot more rarefied as far as who’s going to experience these particular requirements. That number only grows with time, but I think it’s just going to take a while for us to start seeing that awareness trickling into the mainstream.

Kris: For us, the real need comes from a low latency, right? So, if you’re at a restaurant, and you know, you have your, you know, you have your card on your phone, you want to tap it down, you don’t want to sit there for 30 seconds to a minute waiting for it to—“Come on, come on, come on, come on.” You want to tap it and go, right? You just want is to work now. So, the number one thing that we off—actually, we have two things: latency and never drop an authorization, right? You don’t want to be double-charged, you don’t want to have it go beep, and then nothing, right? That’s impossible, right?

So, can’t lose anything, and it must be fast. So, those two requirements give us quite some budget in the networking space, right? So, I can understand that it’s also not something that a lot of companies would use either, right, because it is quite a niche problem to have. But I mean, even if you have, you know—I don’t know, if you’re in multi-regional setup, and you have any kind of external connectivity, then this is where it starts to really make sense.

Corey: Yeah, it’s definitely something that is clearly solving problems for folks. I have to confess, when they first told me about Cloud WAN, I was skeptical because I was trying to map it to the problem of the week that I was tackling at the time, and I’m like, “This is useless. I can barely use this as a database. What’s going on?”—not for lack of trying—but it was a okay, now that I—all have to do whenever I think I’ve gotten a lock on something or written something off is talk to a customer who’s using it and I learn an awful lot.

Not always for the better in some cases, and there’s occasionally times where I cannot find a single customer for the life of me, and that does inform some educated guesses as far as just how many people are using this thing. But I’m glad to see that you folks are out there. Did you get to catch up with any other Cloud WAN customers, or is it possible you’re the only one these days?

Kris: So, on that. More customers are onboarding. They don’t like to call us the biggest user anymore. Maybe it feels wrong for them.

Corey: Oh, they love doing that as part of a sales process, too. Like, do you have any idea how many biggest S3 customers I’ve encountered, just this past year alone?

Kris: Exactly.

Corey: “We’re the biggest X?” Sure you are.

Kris: Well, the only difference, at least—how I feel is that at least we are talking directly to the service team to actually also give them ideas and, you know, feature requests. Because one of the biggest problems that we have—and I can mention this—is the fact that when you build out your network, you have these core network edges. These are those 500-Euro-a-month devices, they’re basically transit gateways, but they’re called CNEs. These things cost you 500 a month. When you add more than one—so you have two—they connect to each other. If you have four, then each one connects to each other, so you have a full mesh.

The problem then starts to go with routing because what we tried to do, we tried to be smart. We’re like, okay, let’s do a centralized egress, right, because now we have all these accounts and normally every account had NAT Gateways, internet gateways, and that was costing us money, right? So, we’re like, “Okay, now we have Cloud WAN. We can centralize this.” So, we have, like, 50 accounts, we’re going to have a central egress, and you’re going to go through that. Perfect. So, now we only have one NAT Gateway, one internet gateway, [unintelligible 00:36:18] network firewall in there, all the nice stuff. Perfect.

Corey: Better centralization, better story around it, better cost economics, better cost efficiency.

Kris: All of that is true. It’s great. Except. So say, for example, we have an egress in eu-west-1, but we have a workload in eu-central-1. We also have one in eu-west-1 and eu-east-1—sorry, us-east-1, us-west-1. And we have an egress in us-east-1. The problem is that in the other regions that do not have an egress, you have no way to direct the traffic to a preferred egress, right? So, we had the funny situation that in eu-central, it was going out the US egress. And US might go to EU, or it might go to the US, and it differed segment to segment. So, we had, like, devs going this way in this segment, production going that way, and we’re like, “What the hell?”

Corey: And that can have consequences for a number of things.

Kris: Sure. I mean, we have customers that like to whitelist IP addresses. We tell them not to, but they’ll do it anyway, and then suddenly, things break.

Corey: If you can get big companies to stop doing that game, oh my god. That’s the biggest thing that’s going to make it impossible for some companies to get off of their allocated IPv4 stuff because it takes an act of God to get companies to update firewalls.

Kris: Sure. “We now have /64 of IPv6. Please enter that into your firewall, IP by IP. Thank you. Off you go.”

Corey: Yeah. By hand. It’s the worst internship ever.

Kris: I swear, we have customers that I believe truly do that. But okay. I didn’t say that [laugh].

Corey: Of course not. I really want to thank you for taking the time to talk to me about this. If people want to learn more about how you see these things, where’s the best place for them to find you?

Kris: At the moment, find me on LinkedIn—I believe I sent it across to you—and I am working on a blog, so I will send that out. Well, I will put it on my LinkedIn profile very soon. I have four posts so far, and I’m going to keep working on it because, yeah. I think that this is an interesting story. And, yeah, that’s the easiest way to find me. And otherwise, I’ll be at the AWS Summit in Amsterdam in a couple of months, and probably re:Invent again this year. So, I’ll be around.

Corey: I look forward to seeing you at at least one of those things. Thanks so much for taking the time to speak with me. I appreciate it.

Kris: Thank you very much for having me. I really enjoyed it.

Corey: Kris Gillespie, principal platform engineer at Silverflow. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that will fail to save properly because that podcast platform just implemented IPv6 this morning, badly.

A Conversation on Cloud WAN with Kris Gillespie

Episode Summary

Episode Show Notes & Transcript

Transcript

You might also like

Reliable Software by Default with Jeremy Edberg

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

Get the Newsletter

Sponsor an Episode