Building Computers for the Cloud with Steve Tuck

Episode Summary

Episode Show Notes & Transcript

Steve Tuck, Co-Founder & CEO of Oxide Computer Company, joins Corey on Screaming in the Cloud to discuss his work to make modern computers cloud-friendly. Steve describes what it was like going through early investment rounds, and the difficult but important decision he and his co-founder made to build their own switch. Corey and Steve discuss the demand for on-prem computers that are built for cloud capability, and Steve reveals how Oxide approaches their product builds to ensure the masses can adopt their technology wherever they are.

About Steve

Steve is the Co-founder & CEO of Oxide Computer Company. He previously was President & COO of Joyent, a cloud computing company acquired by Samsung. Before that, he spent 10 years at Dell in a number of different roles.

Links Referenced:

Oxide Computer Company: https://oxide.computer/
On The Metal Podcast: https://oxide.computer/podcasts/on-the-metal

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is brought to us in part by our friends at RedHat. As your organization grows, so does the complexity of your IT resources. You need a flexible solution that lets you deploy, manage, and scale workloads throughout your entire ecosystem. The Red Hat Ansible Automation Platform simplifies the management of applications and services across your hybrid infrastructure with one platform. Look for it on the AWS Marketplace.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. You know, I often say it—but not usually on the show—that Screaming in the Cloud is a podcast about the business of cloud, which is intentionally overbroad so that I can talk about basically whatever the hell I want to with whoever the hell I’d like. Today’s guest is, in some ways of thinking, about as far in the opposite direction from Cloud as it’s possible to go and still be involved in the digital world. Steve Tuck is the CEO at Oxide Computer Company. You know, computers, the things we all pretend aren’t underpinning those clouds out there that we all use and pay by the hour, gigabyte, second-month-pound or whatever it works out to. Steve, thank you for agreeing to come back on the show after a couple years, and once again suffer my slings and arrows.

Steve: Much appreciated. Great to be here. It has been a while. I was looking back, I think three years. This was like, pre-pandemic, pre-interest rates, pre… Twitter going totally sideways.

Corey: And I have to ask to start with that, it feels, on some level, like toward the start of the pandemic, when everything was flying high and we’d had low interest rates for a decade, that there was a lot of… well, lunacy lurking around in the industry, my own business saw it, too. It turns out that not giving a shit about the AWS bill is in fact a zero interest rate phenomenon. And with all that money or concentrated capital sloshing around, people decided to do ridiculous things with it. I would have thought, on some level, that, “We’re going to start a computer company in the Bay Area making computers,” would have been one of those, but given that we are a year into the correction, and things seem to be heading up into the right for you folks, that take was wrong. How’d I get it wrong?

Steve: Well, I mean, first of all, you got part of it right, which is there were just a litany of ridiculous companies and projects and money being thrown in all directions at that time.

Corey: An NFT of a computer. We’re going to have one of those. That’s what you’re selling, right? Then you had to actually hard pivot to making the real thing.

Steve: That’s it. So, we might as well cut right to it, you know. This is—we went through the crypto phase. But you know, our—when we started the company, it was yes, a computer company. It’s on the tin. It’s definitely kind of the foundation of what we’re building. But you know, we think about what a modern computer looks like through the lens of cloud.

I was at a cloud computing company for ten years prior to us founding Oxide, so was Bryan Cantrill, CTO, co-founder. And, you know, we are huge, huge fans of cloud computing, which was an interesting kind of dichotomy. Instead of conversations when we were raising for Oxide—because of course, Sand Hill is terrified of hardware. And when we think about what modern computers need to look like, they need to be in support of the characteristics of cloud, and cloud computing being not that you’re renting someone else’s computers, but that you have fully programmable infrastructure that allows you to slice and dice, you know, compute and storage and networking however software needs. And so, what we set out to go build was a way for the companies that are running on-premises infrastructure—which, by the way, is almost everyone and will continue to be so for a very long time—access to the benefits of cloud computing. And to do that, you need to build a different kind of computing infrastructure and architecture, and you need to plumb the whole thing with software.

Corey: There are a number of different ways to view cloud computing. And I think that a lot of the, shall we say, incumbent vendors over in the computer manufacturing world tend to sound kind of like dinosaurs, on some level, where they’re always talking in terms of, you’re a giant company and you already have a whole bunch of data centers out there. But one of the magical pieces of cloud is you can have a ridiculous idea at nine o’clock tonight and by morning, you’ll have a prototype, if you’re of that bent. And if it turns out it doesn’t work, you’re out, you know, 27 cents. And if it does work, you can keep going and not have to stop and rebuild on something enterprise-grade.

So, for the small-scale stuff and rapid iteration, cloud providers are terrific. Conversely, when you wind up in the giant fleets of millions of computers, in some cases, there begin to be economic factors that weigh in, and for some on workloads—yes, I know it’s true—going to a data center is the economical choice. But my question is, is starting a new company in the direction of building these things, is it purely about economics or is there a capability story tied in there somewhere, too?

Steve: Yeah, it’s actually economics ends up being a distant third, fourth, in the list of needs and priorities from the companies that we’re working with. When we talk about—and just to be clear we’re—our demographic, that kind of the part of the market that we are focused on are large enterprises, like, folks that are spending, you know, half a billion, billion dollars a year in IT infrastructure, they, over the last five years, have moved a lot of the use cases that are great for public cloud out to the public cloud, and who still have this very, very large need, be it for latency reasons or cost reasons, security reasons, regulatory reasons, where they need on-premises infrastructure in their own data centers and colo facilities, et cetera. And it is for those workloads in that part of their infrastructure that they are forced to live with enterprise technologies that are 10, 20, 30 years old, you know, that haven’t evolved much since I left Dell in 2009. And, you know, when you think about, like, what are the capabilities that are so compelling about cloud computing, one of them is yes, what you mentioned, which is you have an idea at nine o’clock at night and swipe a credit card, and you’re off and running. And that is not the case for an idea that someone has who is going to use the on-premises infrastructure of their company. And this is where you get shadow IT and 16 digits to freedom and all the like.

Corey: Yeah, everyone with a corporate credit card winds up being a shadow IT source in many cases. If your processes as a company don’t make it easier to proceed rather than doing it the wrong way, people are going to be fighting against you every step of the way. Sometimes the only stick you’ve got is that of regulation, which in some industries, great, but in other cases, no, you get to play Whack-a-Mole. I’ve talked to too many companies that have specific scanners built into their mail system every month looking for things that look like AWS invoices.

Steve: [laugh]. Right, exactly. And so, you know, but if you flip it around, and you say, well, what if the experience for all of my infrastructure that I am running, or that I want to provide to my software development teams, be it rented through AWS, GCP, Azure, or owned for economic reasons or latency reasons, I had a similar set of characteristics where my development team could hit an API endpoint and provision instances in a matter of seconds when they had an idea and only pay for what they use, back to kind of corporate IT. And what if they were able to use the same kind of developer tools they’ve become accustomed to using, be it Terraform scripts and the kinds of access that they are accustomed to using? How do you make those developers just as productive across the business, instead of just through public cloud infrastructure?

At that point, then you are in a much stronger position where you can say, you know, for a portion of things that are, as you pointed out, you know, more unpredictable, and where I want to leverage a bunch of additional services that a particular cloud provider has, I can rent that. And where I’ve got more persistent workloads or where I want a different economic profile or I need to have something in a very low latency manner to another set of services, I can own it. And that’s where I think the real chasm is because today, you just don’t—we take for granted the basic plumbing of cloud computing, you know? Elastic Compute, Elastic Storage, you know, networking and security services. And us in the cloud industry end up wanting to talk a lot more about exotic services and, sort of, higher-up stack capabilities. None of that basic plumbing is accessible on-prem.

Corey: I also am curious as to where exactly Oxide lives in the stack because I used to build computers for myself in 2000, and it seems like having gone down that path a bit recently, yeah, that process hasn’t really improved all that much. The same off-the-shelf components still exist and that’s great. We always used to disparagingly call spinning hard drives as spinning rust in racks. You named the company Oxide; you’re talking an awful lot about the Rust programming language in public a fair bit of the time, and I’m starting to wonder if maybe words don’t mean what I thought they meant anymore. Where do you folks start and stop, exactly?

Steve: Yeah, that’s a good question. And when we started, we sort of thought the scope of what we were going to do and then what we were going to leverage was smaller than it has turned out to be. And by that I mean, man, over the last three years, we have hit a bunch of forks in the road where we had questions about do we take something off the shelf or do we build it ourselves. And we did not try to build everything ourselves. So, to give you a sense of kind of where the dotted line is, around the Oxide product, what we’re delivering to customers is a rack-level computer. So, the minimum size comes in rack form. And I think your listeners are probably pretty familiar with this. But, you know, a rack is—

Corey: You would be surprised. It’s basically, what are they about seven feet tall?

Steve: Yeah, about eight feet tall.

Corey: Yeah, yeah. Seven, eight feet, weighs a couple 1000 pounds, you know, make an insulting joke about—

Steve: Two feet wide.

Corey: —NBA players here. Yeah, all kinds of these things.

Steve: Yeah. And big hunk of metal. And in the cases of on-premises infrastructure, it’s kind of a big hunk of metal hole, and then a bunch of 1U and 2U boxes crammed into it. What the hyperscalers have done is something very different. They started looking at, you know, at the rack level, how can you get much more dense, power-efficient designs, doing things like using a DC bus bar down the back, instead of having 64 power supplies with cables hanging all over the place in a rack, which I’m sure is what you’re more familiar with.

Corey: Tremendous amount of weight as well because you have the metal chassis for all of those 1U things, which in some cases, you wind up with, what, 46U in a rack, assuming you can even handle the cooling needs of all that.

Steve: That’s right.

Corey: You have so much duplication, and so much of the weight is just metal separating one thing from the next thing down below it. And there are opportunities for massive improvement, but you need to be at a certain point of scale to get there.

Steve: You do. You do. And you also have to be taking on the entire problem. You can’t pick at parts of these things. And that’s really what we found. So, we started at this sort of—the rack level as sort of the design principle for the product itself and found that that gave us the ability to get to the right geometry, to get as much CPU horsepower and storage and throughput and networking into that kind of chassis for the least amount of wattage required, kind of the most power-efficient design possible.

So, it ships at the rack level and it ships complete with both our server sled systems in Oxide, a pair of Oxide switches. This is—when I talk about, like, design decisions, you know, do we build our own switch, it was a big, big, big question early on. We were fortunate even though we were leaning towards thinking we needed to go do that, we had this prospective early investor who was early at AWS and he had asked a very tough question that none of our other investors had asked to this point, which is, “What are you going to do about the switch?”

And we knew that the right answer to an investor is like, “No. We’re already taking on too much.” We’re redesigning a server from scratch in, kind of, the mold of what some of the hyperscalers have learned, doing our own Root of Trust, we’re doing our own operating system, hypervisor control plane, et cetera. Taking on the switch could be seen as too much, but we told them, you know, we think that to be able to pull through all of the value of the security benefits and the performance and observability benefits, we can’t have then this [laugh], like, obscure third-party switch rammed into this rack.

Corey: It’s one of those things that people don’t think about, but it’s the magic of cloud with AWS’s network, for example, it’s magic. You can get line rate—or damn near it—between any two points, sustained.

Steve: That’s right.

Corey: Try that in the data center, you wind into massive congestion with top-of-rack switches, where, okay, we’re going to parallelize this stuff out over, you know, two dozen racks and we’re all going to have them seamlessly transfer information between each other at line rate. It’s like, “[laugh] no, you’re not because those top-of-rack switches will melt and become side-of-rack switches, and then bottom-puddle-of-rack switches. It doesn’t work that way.”

Steve: That’s right.

Corey: And you have to put a lot of thought and planning into it. That is something that I’ve not heard a traditional networking vendor addressing because everyone loves to hand-wave over it.

Steve: Well so, and this particular prospective investor, we told him, “We think we have to go build our own switch.” And he said, “Great.” And we said, “You know, we think we’re going to lose you as an investor as a result, but this is what we’re doing.” And he said, “If you’re building your own switch, I want to invest.” And his comment really stuck with us, which is AWS did not stand on their own two feet until they threw out their proprietary switch vendor and built their own.

And that really unlocked, like you’ve just mentioned, like, their ability, both in hardware and software to tune and optimize to deliver that kind of line rate capability. And that is one of the big findings for us as we got into it. Yes, it was really, really hard, but based on a couple of design decisions, P4 being the programming language that we are using as the surround for our silicon, tons of opportunities opened up for us to be able to do similar kinds of optimization and observability. And that has been a big, big win.

But to your question of, like, where does it stop? So, we are delivering this complete with a baked-in operating system, hypervisor, control plane. And so, the endpoint of the system, where the customer meets is either hitting an API or a CLI or a console that delivers and kind of gives you the ability to spin up projects. And, you know, if one is familiar with EC2 and EBS and VPC, that VM level of abstraction is where we stop.

Corey: That, I think, is a fair way of thinking about it. And a lot of cloud folks are going to pooh-pooh it as far as saying, “Oh well, just virtual machines. That’s old cloud. That just treats the cloud like a data center.” And in many cases, yes, it does because there are ways to build modern architectures that are event-driven on top of things like Lambda, and API Gateway, and the rest, but you take a look at what my customers are doing and what drives the spend, it is invariably virtual machines that are largely persistent.

Sometimes they scale up, sometimes they scale down, but there’s always a baseline level of load that people like to hand-wave away the fact that what they’re fundamentally doing in a lot of these cases, is paying the cloud provider to handle the care and feeding of those systems, which can be expensive, yes, but also delivers significant innovation beyond what almost any company is going to be able to deliver in-house. There is no way around it. AWS is better than you are—whoever you happen to—be at replacing failed hard drives. That is a simple fact. They have teams of people who are the best in the world of replacing failed hard drives. You generally do not. They are going to be better at that than you. But that’s not the only axis. There’s not one calculus that leads to, is cloud a scam or is cloud a great value proposition for us? The answer is always a deeply nuanced, “It depends.”

Steve: Yeah, I mean, I think cloud is a great value proposition for most and a growing amount of software that’s being developed and deployed and operated. And I think, you know, one of the myths that is out there is, hey, turn over your IT to AWS because we have or you know, a cloud provider—because we have such higher caliber personnel that are really good at swapping hard drives and dealing with networks and operationally keeping this thing running in a highly available manner that delivers good performance. That is certainly true, but a lot of the operational value in an AWS is been delivered via software, the automation, the observability, and not actual people putting hands on things. And it’s an important point because that’s been a big part of what we’re building into the product. You know, just because you’re running infrastructure in your own data center, it does not mean that you should have to spend, you know, 1000 hours a month across a big team to maintain and operate it. And so, part of that, kind of, cloud, hyperscaler innovation that we’re baking into this product is so that it is easier to operate with much, much, much lower overhead in a highly available, resilient manner.

Corey: So, I’ve worked in a number of data center facilities, but the companies I was working with, were always at a scale where these were co-locations, where they would, in some cases, rent out a rack or two, in other cases, they’d rent out a cage and fill it with their own racks. They didn’t own the facilities themselves. Those were always handled by other companies. So, my question for you is, if I want to get a pile of Oxide racks into my environment in a data center, what has to change? What are the expectations?

I mean, yes, there’s obviously going to be power and requirements at the data center colocation is very conversant with, but Open Compute, for example, had very specific requirements—to my understanding—around things like the airflow construction of the environment that they’re placed within. How prescriptive is what you’ve built, in terms of doing a building retrofit to start using you folks?

Steve: Yeah, definitely not. And this was one of the tensions that we had to balance as we were designing the product. For all of the benefits of hyperscaler computing, some of the design center for you know, the kinds of racks that run in Google and Amazon and elsewhere are hyperscaler-focused, which is unlimited power, in some cases, data centers designed around the equipment itself. And where we were headed, which was basically making hyperscaler infrastructure available to, kind of, the masses, the rest of the market, these folks don’t have unlimited power and they aren’t going to go be able to go redesign data centers. And so no, the experience should be—with exceptions for folks maybe that have very, very limited access to power—that you roll this rack into your existing data center. It’s on standard floor tile, that you give it power, and give it networking and go.

And we’ve spent a lot of time thinking about how we can operate in the wide-ranging environmental characteristics that are commonplace in data centers that focus on themselves, colo facilities, and the like. So, that’s really on us so that the customer is not having to go to much work at all to kind of prepare and be ready for it.

Corey: One of the challenges I have is how to think about what you’ve done because you are rack-sized. But what that means is that my own experimentation at home recently with on-prem stuff for smart home stuff involves a bunch of Raspberries Pi and a [unintelligible 00:19:42], but I tend to more or less categorize you the same way that I do AWS Outposts, as well as mythical creatures, like unicorns or giraffes, where I don’t believe that all these things actually exist because I haven’t seen them. And in fact, to get them in my house, all four of those things would theoretically require a loading dock if they existed, and that’s a hard thing to fake on a demo signup form, as it turns out. How vaporware is what you’ve built? Is this all on paper and you’re telling amazing stories or do they exist in the wild?

Steve: So, last time we were on, it was all vaporware. It was a couple of napkin drawings and a seed round of funding.

Corey: I do recall you not using that description at the time, for what it’s worth. Good job.

Steve: [laugh]. Yeah, well, at least we were transparent where we were going through the race. We had some napkin drawings and we had some good ideas—we thought—and—

Corey: You formalize those and that’s called Microsoft PowerPoint.

Steve: That’s it. A hundred percent.

Corey: The next generative AI play is take the scrunched-up, stained napkin drawing, take a picture of it, and convert it to a slide.

Steve: Google Docs, you know, one of those. But no, it’s got a lot of scars from the build and it is real. In fact, next week, we are going to be shipping our first commercial systems. So, we have got a line of racks out in our manufacturing facility in lovely Rochester, Minnesota. Fun fact: Rochester, Minnesota, is where the IBM AS/400s were built.

Corey: I used to work in that market, of all things.

Steve: Really?

Corey: Selling tape drives in the AS/400. I mean, I still maintain there’s no real mainframe migration to the cloud play because there’s no AWS/400. A joke that tends to sail over an awful lot of people’s heads because, you know, most people aren’t as miserable in their career choices as I am.

Steve: Okay, that reminds me. So, when we were originally pitching Oxide and we were fundraising, we [laugh]—in a particular investor meeting, they asked, you know, “What would be a good comp? Like how should we think about what you are doing?” And fortunately, we had about 20 investor meetings to go through, so burning one on this was probably okay, but we may have used the AS/400 as a comp, talking about how [laugh] mainframe systems did such a good job of building hardware and software together. And as you can imagine, there were some blank stares in that room.

But you know, there are some good analogs to historically in the computing industry, when you know, the industry, the major players in the industry, were thinking about how to deliver holistic systems to support end customers. And, you know, we see this in the what Apple has done with the iPhone, and you’re seeing this as a lot of stuff in the automotive industry is being pulled in-house. I was listening to a good podcast. Jim Farley from Ford was talking about how the automotive industry historically outsourced all of the software that controls cars, right? So, like, Bosch would write the software for the controls for your seats.

And they had all these suppliers that were writing the software, and what it meant was that innovation was not possible because you’d have to go out to suppliers to get software changes for any little change you wanted to make. And in the computing industry, in the 80s, you saw this blow apart where, like, firmware got outsourced. In the IBM and the clones, kind of, race, everyone started outsourcing firmware and outsourcing software. Microsoft started taking over operating systems. And then VMware emerged and was doing a virtualization layer.

And this, kind of, fragmented ecosystem is the landscape today that every single on-premises infrastructure operator has to struggle with. It’s a kit car. And so, pulling it back together, designing things in a vertically integrated manner is what the hyperscalers have done. And so, you mentioned Outposts. And, like, it’s a good example of—I mean, the most public cloud of public cloud companies created a way for folks to get their system on-prem.

I mean, if you need anything to underscore the draw and the demand for cloud computing-like, infrastructure on-prem, just the fact that that emerged at all tells you that there is this big need. Because you’ve got, you know, I don’t know, a trillion dollars worth of IT infrastructure out there and you have maybe 10% of it in the public cloud. And that’s up from 5% when Jassy was on stage in ’21, talking about 95% of stuff living outside of AWS, but there’s going to be a giant market of customers that need to own and operate infrastructure. And again, things have not improved much in the last 10 or 20 years for them.

Corey: They have taken a tone onstage about how, “Oh, those workloads that aren’t in the cloud, yet, yeah, those people are legacy idiots.” And I don’t buy that for a second because believe it or not—I know that this cuts against what people commonly believe in public—but company execs are generally not morons, and they make decisions with context and constraints that we don’t see. Things are the way they are for a reason. And I promise that 90% of corporate IT workloads that still live on-prem are not being managed or run by people who’ve never heard of the cloud. There was a decision made when some other things were migrating of, do we move this thing to the cloud or don’t we? And the answer at the time was no, we’re going to keep this thing on-prem where it is now for a variety of reasons of varying validity. But I don’t view that as a bug. I also, frankly, don’t want to live in a world where all the computers are basically run by three different companies.

Steve: You’re spot on, which is, like, it does a total disservice to these smart and forward-thinking teams in every one of the Fortune 1000-plus companies who are taking the constraints that they have—and some of those constraints are not monetary or entirely workload-based. If you want to flip it around, we were talking to a large cloud SaaS company and their reason for wanting to extend it beyond the public cloud is because they want to improve latency for their e-commerce platform. And navigating their way through the complex layers of the networking stack at GCP to get to where the customer assets are that are in colo facilities, adds lag time on the platform that can cost them hundreds of millions of dollars. And so, we need to think behind this notion of, like, “Oh, well, the dark ages are for software that can’t run in the cloud, and that’s on-prem. And it’s just a matter of time until everything moves to the cloud.”

In the forward-thinking models of public cloud, it should be both. I mean, you should have a consistent experience, from a certain level of the stack down, everywhere. And then it’s like, do I want to rent or do I want to own for this particular use case? In my vast set of infrastructure needs, do I want this to run in a data center that Amazon runs or do I want this to run in a facility that is close to this other provider of mine? And I think that’s best for all. And then it’s not this kind of false dichotomy of quality infrastructure or ownership.

Corey: I find that there are also workloads where people will come to me and say, “Well, we don’t think this is going to be economical in the cloud”—because again, I focus on AWS bills. That is the lens I view things through, and—“The AWS sales rep says it will be. What do you think?” And I look at what they’re doing and especially if involves high volumes of data transfer, I laugh a good hearty laugh and say, “Yeah, keep that thing in the data center where it is right now. You will thank me for it later.”

It’s, “Well, can we run this in an economical way in AWS?” As long as you’re okay with economical meaning six times what you’re paying a year right now for the same thing, yeah, you can. I wouldn’t recommend it. And the numbers sort of speak for themselves. But it’s not just an economic play.

There’s also the story of, does this increase their capability? Does it let them move faster toward their business goals? And in a lot of cases, the answer is no, it doesn’t. It’s one of those business process things that has to exist for a variety of reasons. You don’t get to reimagine it for funsies and even if you did, it doesn’t advance the company in what they’re trying to do any, so focus on something that differentiates as opposed to this thing that you’re stuck on.

Steve: That’s right. And what we see today is, it is easy to be in that mindset of running things on-premises is kind of backwards-facing because the experience of it is today still very, very difficult. I mean, talking to folks and they’re sharing with us that it takes a hundred days from the time all the different boxes land in their warehouse to actually having usable infrastructure that developers can use. And our goal and what we intend to go hit with Oxide as you can roll in this complete rack-level system, plug it in, within an hour, you have developers that are accessing cloud-like services out of the infrastructure. And that—God, countless stories of firmware bugs that would send all the fans in the data center nonlinear and soak up 100 kW of power.

Corey: Oh, God. And the problems that you had with the out-of-band management systems. For a long time, I thought Drax stood for, “Dell, RMA Another Computer.” It was awful having to deal with those things. There was so much room for innovation in that space, which no one really grabbed onto.

Steve: There was a really, really interesting talk at DEFCON that we just stumbled upon yesterday. The NVIDIA folks are giving a talk on BMC exploits… and like, a very, very serious BMC exploit. And again, it’s what most people don’t know is, like, first of all, the BMC, the Baseboard Management Controller, is like the brainstem of the computer. It has access to—it’s a backdoor into all of your infrastructure. It’s a computer inside a computer and it’s got software and hardware that your server OEM didn’t build and doesn’t understand very well.

And firmware is even worse because you know, firmware written by you know, an American Megatrends or other is a big blob of software that gets loaded into these systems that is very hard to audit and very hard to ascertain what’s happening. And it’s no surprise when, you know, back when we were running all the data centers at a cloud computing company, that you’d run into these issues, and you’d go to the server OEM and they’d kind of throw their hands up. Well, first they’d gaslight you and say, “We’ve never seen this problem before,” but when you thought you’ve root-caused something down to firmware, it was anyone’s guess. And this is kind of the current condition today. And back to, like, the journey to get here, we kind of realized that you had to blow away that old extant firmware layer, and we rewrote our own firmware in Rust. Yes [laugh], I’ve done a lot in Rust.

Corey: No, it was in Rust, but, on some level, that’s what Nitro is, as best I can tell, on the AWS side. But it turns out that you don’t tend to have the same resources as a one-and-a-quarter—at the moment—trillion-dollar company. That keeps [valuing 00:30:53]. At one point, they lost a comma and that was sad and broke all my logic for that and I haven’t fixed it since. Unfortunate stuff.

Steve: Totally. I think that was another, kind of, question early on from certainly a lot of investors was like, “Hey, how are you going to pull this off with a smaller team and there’s a lot of surface area here?” Certainly a reasonable question. Definitely was hard. The one advantage—among others—is, when you are designing something kind of in a vertical holistic manner, those design integration points are narrowed down to just your equipment.

And when someone’s writing firmware, when AMI is writing firmware, they’re trying to do it to cover hundreds and hundreds of components across dozens and dozens of vendors. And we have the advantage of having this, like, purpose-built system, kind of, end-to-end from the lowest level from first boot instruction, all the way up through the control plane and from rack to switch to server. That definitely helped narrow the scope.

Corey: This episode has been fake sponsored by our friends at AWS with the following message: Graviton Graviton, Graviton, Graviton, Graviton, Graviton, Graviton, Graviton, Graviton. Thank you for your l-, lack of support for this show. Now, AWS has been talking about Graviton an awful lot, which is their custom in-house ARM processor. Apple moved over to ARM and instead of talking about benchmarks they won’t publish and marketing campaigns with words that don’t mean anything, they’ve let the results speak for themselves. In time, I found that almost all of my workloads have moved over to ARM architecture for a variety of reason, and my laptop now gets 15 hours of battery life when all is said and done. You’re building these things on top of x86. What is the deal there? I do not accept that if that you hadn’t heard of ARM until just now because, as mentioned, Graviton, Graviton, Graviton.

Steve: That’s right. Well, so why x86, to start? And I say to start because we have just launched our first generation products. And our first-generation or second-generation products that we are now underway working on are going to be x86 as well. We’ve built this system on AMD Milan silicon; we are going to be launching a Genoa sled.

But when you’re thinking about what silicon to use, obviously, there’s a bunch of parts that go into the decision. You’re looking at the kind of applicability to workload, performance, power management, for sure, and if you carve up what you are trying to achieve, x86 is still a terrific fit for the broadest set of workloads that our customers are trying to solve for. And choosing which x86 architecture was certainly an easier choice, come 2019. At this point, AMD had made a bunch of improvements in performance and energy efficiency in the chip itself. We’ve looked at other architectures and I think as we are incorporating those in the future roadmap, it’s just going to be a question of what are you trying to solve for.

You mentioned power management, and that is kind of commonly been a, you know, low power systems is where folks have gone beyond x86. Is we’re looking forward to hardware acceleration products and future products, we’ll certainly look beyond x86, but x86 has a long, long road to go. It still is kind of the foundation for what, again, is a general-purpose cloud infrastructure for being able to slice and dice for a variety of workloads.

Corey: True. I have to look around my environment and realize that Intel is not going anywhere. And that’s not just an insult to their lack of progress on committed roadmaps that they consistently miss. But—

Steve: [sigh].

Corey: Enough on that particular topic because we want to keep this, you know, polite.

Steve: Intel has definitely had some struggles for sure. They’re very public ones, I think. We were really excited and continue to be very excited about their Tofino silicon line. And this came by way of the Barefoot networks acquisition. I don’t know how much you had paid attention to Tofino, but what was really, really compelling about Tofino is the focus on both hardware and software and programmability.

So, great chip. And P4 is the programming language that surrounds that. And we have gotten very, very deep on P4, and that is some of the best tech to come out of Intel lately. But from a core silicon perspective for the rack, we went with AMD. And again, that was a pretty straightforward decision at the time. And we’re planning on having this anchored around AMD silicon for a while now.

Corey: One last question I have before we wind up calling it an episode, it seems—at least as of this recording, it’s still embargoed, but we’re not releasing this until that winds up changing—you folks have just raised another round, which means that your napkin doodles have apparently drawn more folks in, and now that you’re shipping, you’re also not just bringing in customers, but also additional investor money. Tell me about that.

Steve: Yes, we just completed our Series A. So, when we last spoke three years ago, we had just raised our seed and had raised $20 million at the time, and we had expected that it was going to take about that to be able to build the team and build the product and be able to get to market, and [unintelligible 00:36:14] tons of technical risk along the way. I mean, there was technical risk up and down the stack around this [De Novo 00:36:21] server design, this the switch design. And software is still the kind of disproportionate majority of what this product is, from hypervisor up through kind of control plane, the cloud services, et cetera. So—

Corey: We just view it as software with a really, really confusing hardware dongle.

Steve: [laugh]. Yeah. Yes.

Corey: Super heavy. We’re talking enterprise and government-grade here.

Steve: That’s right. There’s a lot of software to write. And so, we had a bunch of milestones that as we got through them, one of the big ones was getting Milan silicon booting on our firmware. It was funny it was—this was the thing that clearly, like, the industry was most suspicious of, us doing our own firmware, and you could see it when we demonstrated booting this, like, a year-and-a-half ago, and AMD all of a sudden just lit up, from kind of arm’s length to, like, “How can we help? This is amazing.” You know? And they could start to see the benefits of when you can tie low-level silicon intelligence up through a hypervisor there’s just—

Corey: No I love the existing firmware I have. Looks like it was written in 1984 and winds up having terrible user ergonomics that hasn’t been updated at all, and every time something comes through, it’s a 50/50 shot as whether it fries the box or not. Yeah. No, I want that.

Steve: That’s right. And you look at these hyperscale data centers, and it’s like, no. I mean, you’ve got intelligence from that first boot instruction through a Root of Trust, up through the software of the hyperscaler, and up to the user level. And so, as we were going through and kind of knocking down each one of these layers of the stack, doing our own firmware, doing our own hardware Root of Trust, getting that all the way plumbed up into the hypervisor and the control plane, number one on the customer side, folks moved from, “This is really interesting. We need to figure out how we can bring cloud capabilities to our data centers. Talk to us when you have something,” to, “Okay. We actually”—back to the earlier question on vaporware, you know, it was great having customers out here to Emeryville where they can put their hands on the rack and they can, you know, put your hands on software, but being able to, like, look at real running software and that end cloud experience.

And that led to getting our first couple of commercial contracts. So, we’ve got some great first customers, including a large department of the government, of the federal government, and a leading firm on Wall Street that we’re going to be shipping systems to in a matter of weeks. And as you can imagine, along with that, that drew a bunch of renewed interest from the investor community. Certainly, a different climate today than it was back in 2019, but what was great to see is, you still have great investors that understand the importance of making bets in the hard tech space and in companies that are looking to reinvent certain industries. And so, we added—our existing investors all participated. We added a bunch of terrific new investors, both strategic and institutional.

And you know, this capital is going to be super important now that we are headed into market and we are beginning to scale up the business and make sure that we have a long road to go. And of course, maybe as importantly, this was a real confidence boost for our customers. They’re excited to see that Oxide is going to be around for a long time and that they can invest in this technology as an important part of their infrastructure strategy.

Corey: I really want to thank you for taking the time to speak with me about, well, how far you’ve come in a few years. If people want to learn more and have the requisite loading dock, where should they go to find you?

Steve: So, we try to put everything up on the site. So, oxidecomputer.com or oxide.computer. We also, if you remember, we did [On the Metal 00:40:07]. So, we had a Tales from the Hardware-Software Interface podcast that we did when we started. We have shifted that to Oxide and Friends, which the shift there is we’re spending a little bit more time talking about the guts of what we built and why. So, if folks are interested in, like, why the heck did you build a switch and what does it look like to build a switch, we actually go to depth on that. And you know, what does bring-up on a new server motherboard look like? And it’s got some episodes out there that might be worth checking out.

Corey: We will definitely include a link to that in the [show notes 00:40:36]. Thank you so much for your time. I really appreciate it.

Steve: Yeah, Corey. Thanks for having me on.

Corey: Steve Tuck, CEO at Oxide Computer Company. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this episode, please leave a five-star review on your podcast platform of choice, along with an angry ranting comment because you are in fact a zoology major, and you’re telling me that some animals do in fact exist. But I’m pretty sure of the two of them, it’s the unicorn.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Building Computers for the Cloud with Steve Tuck

Episode Summary

Episode Show Notes & Transcript

Transcript

You might also like

The Appalachian Cloud Trail: Hiking, Cloud Economics, and Finding Perspective

Coding Agents, Chaos, and the Future of Dev Work with Dexter Horthy

The Rise of Autonomous Ops: Inside AWS’s DevOps Agent with David Yanacek

Get the Newsletter

Gnarly cloud cost questions?