Episode 73 – Building a Cloud Supercomputer on AWS with Mike Warren

Episode Summary

Supercomputers used to be gigantic monstrosities that would take up enormous rooms. Now, you can run them in the cloud. Just ask Mike Warren, CTO and co-founder of Descartes Labs, a company that provides Earth imagery to help folks understand planetary changes—like deforestation, water cycles, agriculture, and more. Join Corey and Mike as they discuss what it’s like to build supercomputers on top of AWS and how “easy” it is, the power of Amazon’s Spot blocks, building Beowulf clusters in the ‘90s, what Descartes Labs’ platform-agnostic infrastructure looks like (spoiler alert: nothing is on-prem), how AWS accelerates the development process, petaflop machines, the evolution of high-performance computing over the last few decades, and more.

Episode Show Notes & Transcript

About Mike Warren

Mike Warren is cofounder and CTO of Descartes Labs. Mike’s past work spans a wide range of disciplines, with the recurring theme of developing and applying advanced software and computing technology to understand the physical and virtual world. He was a scientist at Los Alamos National Laboratory for 25 years, and also worked as a Senior Software Engineer at Sandpiper Networks/Digital Island. His work has been recognized on multiple occasions, including the Gordon Bell prize for outstanding achievement in high-performance computing. He has degrees in Physics and Engineering & Applied Science from Caltech, and he received a PhD in Physics from the University of California, Santa Barbara.

Links Referenced

Transcript

Announcer: Hello and welcome to Screaming in the Cloud with your host cloud economist's, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of Cloud, thoughtful commentary on the state of the technical world and ridiculous titles for which Cory refuses to apologize. This is Screaming in the Cloud.


Corey Quinn: This week's episode of Screaming in the Cloud is sponsored by LightStep. What is LightStep? Picture monitoring like it was back in 2005 and then run away screaming. We're not using Nagios at scale anymore because monitoring looks like something very different in a modern architecture where you have ephemeral containers spinning up and down for example. How do you know how up your application is in an environment like that? At scale it's never a question of whether your site is up, but rather a question of how down is it. LightStep lets you answer that question effectively. Discover what other companies including Lyft, Twilio, Box and Github have already learned. Visit lightstep.com to learn more.


Corey Quinn: My thanks to them for sponsoring this episode of Screaming in the Cloud. Welcome to Screaming in the Cloud, I'm Corey Quinn. I'm joined this week by Mike Warren Co-Founder and CTO of Descartes Labs. Welcome to the show, Mike.


Mike Warren: Thanks Corey, happy to be here.


Corey Quinn: So you have a fascinating story in that you had a 25 year career at the Los Alamos National Lab and then joined, not joined, started a company Descartes Labs about four years ago now.


Mike Warren: That's right. I kind of considered, you know, it was a 30 year long education that provided me kind of the right skills to go out and start a company that used computing and science to help our customers understand the world.


Corey Quinn: And it certainly seems that you've done some interesting things with it lately. At the time of this recording, about a month or so ago, you wound up making I guess a bit of a splash in the cloud computing world by building a effective supercomputer on top of AWS and it qualified, I think, what was it, place 136 on the top 500 list of the most powerful supercomputers in the world. And that's impressive in its own right, but what's fascinating about this is you didn't set out to do this with an enormous project behind you, you didn't decided to do this with a grant from someone. You did it with a corporate credit card.


Mike Warren: That's right. I guess it wasn't such a surprise to us. We knew the capability was coming and it just happened to be the time was right and I had some time to compile a HPL LINPACK and run it. But that's the way the computing industry is going. And anything you can do in a data center or in a super computing center is eventually going to be possible in the cloud.


Corey Quinn: So one of the parts of the story that resonated the most with me was the fact that you did this using Spot instances, but you didn't tell anyone at AWS that you were doing this until after it was already done. And I confirmed that myself by reaching out to a friend who's reasonably well-placed in the computer org. I said, "Wow, congratulations. You just hosted something that would wind up counting as one of the top 500 super computers in the world," and his response was, "Wait, we what now?" Which was fascinating to see where it used to be that this was the sort of thing that would require such a tremendous amount of coordinated effort between different people, different stakeholders across a wide organization.


And it turns out that because it's neat, wouldn't have been a sufficient justification to do approximately any of it. And now apparently you can do this for 5,000 bucks, which is less than a typical server tends to cost, and about what a typical engineer embezzles in office supplies in a given year. And the economics of that are staggering. How long did you plan on doing this before deciding just spark it up?


Mike Warren: Not Long. And it's really why the cloud is so attractive for businesses like ours. You don't have to wait for resources, you don't have to coordinate, you don't have to spend time on the phone, you don't have to worry about support contracts and all of this. You know, even a top 500 run on a supercomputer requires an enormous amount of coordination. You've got to kick all the other users off. You know, it's a big deal and it may only be done once on one of these nation scale supercomputers, but this on AWS, any given day when there's a thousand nodes available, you can pull those up and run at a petaflop.


Corey Quinn: Which is astonishing. I mean, you did some back of the envelope calculations on what it would cost to do this in hardware. And the interesting part to me wasn't the dollar figure, which, what was that again?


Mike Warren: We figured, you know, $20 to $30 million in hardware to get to this near to petaflop level.


Corey Quinn: Yeah. And the amazing part for me was that looking at this, you also mentioned in that article, and I'll throw a link to that in the show notes, but the fact that it would also take your estimation six to 12 months just to procure the hardware and get it set up and get everything ready just to do this.


Mike Warren: Yeah, I think people don't realize the sort of unmentioned overheads in all of these HPC sort of applications. You know, you've got to procure the hardware and make sure you have the budget authority and get the right signatures, and that's after, assuming you have a data center or a building and enough cooling and power and all those sorts of infrastructure.


Corey Quinn: So one thing that rang a little bit, I guess we'll call it false, in the narrative and not to poke holes in your legend has been that we didn't tell AWS that we were going to be doing any of this, we just decided to go ahead and run it. But anyone who spent more than 20 minutes swearing at the AWS console knows that, okay, you start up an account and the first thing that you need to wind up doing is request a whole bunch of service limit increases while you're already running one of those instances in that region. Why do you need a second one? Justify it. And they're pretty good about granting those service limit increases. But I'm curious to know, did you already have sufficient limits to run this from other workloads that have been done in that account or did you wind up opening the most bizarre limit increase service ticket that they've probably seen in a month?


Mike Warren: As I recall, we had about half the resources in US east where we ran this. So we had kind of this scale of resources spread across different zones, but there's also different quotas between Spot and not Spot. So, there was a specific request to up the quota for Spot in US east to this level. And there's another interesting API limitation, which is a single call to allocate nodes for Spot can only allocate a thousand at a time. So it's a little extra workflow. You got to break it into two parts and there's a throttling on the API calls, so you can't immediately allocate, say 1,200 nodes, you've got to allocate a thousand, wait a minute, and then allocate the remainder.


Corey Quinn: Take us through this from the beginning where effectively... what is the architecture of this look like? I'm not entirely sure what the supercomputer in question was chomping on, so in my mental model, I'm going to figure that you were just mining Bitcoin, which it turns out is super financially viable in the cloud, you just need to do it in someone else's account. But on a more serious note, what was the, you mentioned LINPACK, what does that do?


Mike Warren: Well, an important distinction to make is how much communication needs to happen among the processors. So mining Bitcoin, the processor doesn't need to know about anything else going on in the world, it's completely independent. So you can start up a thousand different nodes mining Bitcoin and all of the clouds do that very well. The top 500 benchmark is based on the inversion of a very large matrix and it uses a piece of software called the LINPACK. And in the solution of that problem, all of these processors have to talk to each other and exchange data and they have to do that with very low latency. So, you know, starting up a thousand nodes and if any one of those fails in terms of computation or network communication during that process, the whole thing falls over. So these tightly coupled types of HPC applications that are typically written in something called MPI, which is the message passing interface, are a lot more challenging to do in the cloud than the task parallel sorts of applications like a typical web server.


Corey Quinn: One of the things that fascinates me is that whenever you're using any sufficiently large number of computers, some of them are intrinsically going to break, fail out of the cluster, et cetera. That's the nature of things. That sort of goes double when you're using something like on Google's pre-emptable instances or an AWS Spot Fleets where it turns out that subject to the available supply, things are generally not nearly as available as you would expect if, for example, someone else in that region decides to spin up a supercomputer to take a spot on the top 500 list. So how does each one of those things checkpoint what it's working on in some sort of central location so that it can die and be replaced by something else without losing all the work that node has done, or it doesn't it?


Mike Warren: No it can't. There's no check-pointing these sorts of problems in any easy manner. So everything's got to work perfectly for the three or six hours that this benchmark is running. So the reliability is very important. And I've heard anecdotal reports of, you know, some of these very fast top 10 supercomputers needing to try to run the benchmark several times before they can get it to run reliably.


Corey Quinn: So when you wound up doing this, did you just over-provision by a bit and assume you would lose some number of nodes along the way?


Mike Warren: That also doesn't work. Once you've sort of labeled each of these processors with a number, you can't have one disappear. It's got part of the state of the problem in it. So if you start with 1,200 nodes, you need them all to work the whole time. So that's why these tightly coupled applications are a lot more challenging to scale up.


Corey Quinn: So when you wound up doing this, you effectively wound up with how many instances that were a part of this? I mean I saw the statistics on on petaflops, but that's hard for me to put into something that I can think of.


Mike Warren: Yeah, it was a bit under 1,200 nodes and these were a 36 core processors, or rather a 18 core processors with two dies per node. So that gives you a 41,000 odd processors and these are the hardware cores, not the abomination of virtual counting.


Corey Quinn: Gotcha. Yeah, it looks like you did this entirely on top of C5s, which is... It's the performance of those things is impressive and the economics of it are fantastic. For me, I think the hardest part to wrap my head around is the fact that you had 1,200 of these things running without a hiccup for three straight hours on Spot, which historically was always extremely interrupt prone. And it's surprising to me that none of those wound up getting reclaimed during that window. Just at that scale, I would expect even traditional computers, one of the 1,200 is going to fall over and crash into the sea because I'm lucky like that.


Mike Warren: Well there's, I think two things you're talking about there. One of those is solved by Amazon's Spot blocks. So you know, you say you want a certain number of processors for some number of hours between one and six and then Amazon guarantees it's not going to optionally take one of those away from you. So those are a bit more expensive than the normal Spot, but a lot less expensive than the non Spot instances.


So what becomes important is just the inherent failure rate of the hardware and the network. And in our experience the cloud resources we've used have been incredibly reliable to the point where, you know, certainly we've seen in Google their predictive task migration where they can sort of understand that a node is about to fail and then migrate that whole kernel and processes to another piece of hardware so that you never know about it. And they can pull that defective hardware out and and fix it.


Corey Quinn: The idea of being able to do something like this almost on a lark on some given afternoon for what generally falls well within any arbitrary employees corporate spending approval limit is just mind boggling to me. I know I can keep belaboring this, but it's one of those things that just tends to resonate an awful lot. Can you talk to me at all about how this winds up manifesting compared to, I guess the stuff you do day to day? I'm going to guess that the Descartes Labs doesn't have a lot of interesting stuff. I mean you folks work on satellite and aerial imagery to my understanding, but how does that tie back to effectively step one, take a giant supercomputer, we'll figure out step two later?


Mike Warren: Well, it's really democratizing super computing and it's an extension of the power that software gives an individual. You know, we're able to do things now with good software and a smart person that used to take an entire group of people with lots of infrastructure to do. So, it's always been something I've been very interested in and goes back to our building Beowulf clusters back in the 90s. That was really democratizing parallel computing for people who couldn't afford the state of the art super computers. And there were untold times that, you know, people collared me in and told them stories about their group in college who had a cluster that they built in the closet that allowed them to do the research that they were doing.


Corey Quinn: So on a day to day basis, what does I guess your computing environment look like? I mean, you spun out of a national lab, which for starters we know is going to be something that is relatively, shall we say computationally impressive. Mike Julian, my business partner used to help run the Titan Supercomputer at Oak Ridge National Lab about 10 years ago. So I've gotten absorbed into it by osmosis almost. And I always view that as here goes smart people, I just sit here and make sarcastic comments on the side. But what does that look like on a day to day basis of what Descartes Labs actually does?


Mike Warren: Well, the last big computation I did before we founded Descartes Labs was a on the Titan machine. We had 80 million CPU hours to calculate the evolution of the universe and we did that with a trillion particles. So a lot of that kind of thinking has carried over to our environment at Descartes Labs, and I spend a lot of my time in the command line writing Python scripts, but our interaction with with customers is much more focused around APIs and making it very easy to interact with these peta scale data sets. You tell us where on the earth and what time over the last 20 years you want to see an image and we can deliver that to you in less than a second.


Corey Quinn: And Are you building this entirely on top of AWS? Are you effectively for all comers as far as large cloud goes, is there a significant on-prem component?


Mike Warren: There's no on-prem at all. Descartes Labs, IT infrastructure consists of a laptop for everyone essentially. The bulk of our platform is implemented in Google Cloud. We do have data input processes that are running in AWS, but we've tried to keep most of the platform cloud agnostic so that, you know, we would move to another cloud if that made sense in terms of the economics.


Corey Quinn: Often a great approach, especially with what you're talking about, it sounds like the way that you're designing things requires some software that seems highly portable to almost wherever the data itself happens to live. You're not necessarily needing to leverage a lot of the higher level services in order to start effectively chewing on mass quantities of data with effectively undifferentiated compute services. It feels like the more you look at something like this, the economic story starts to be a lot more around the cost of moving data around. It turns out that moving compute to the data is often way cheaper than moving data to the compute.


Mike Warren: Right. And our philosophy has been to work at a fairly low level. You know, one of our big successes has been essentially implementing a virtual file system on top of the cloud object store. So we can take any number of open source packages and they see a POSIX file system to interact with and we don't have to spend a lot of time rewriting the io routines there.


Corey Quinn: A lot of those open source packages, are they relatively recent? Are they effectively coming from, I guess I want to say the heyday of a lot of the HPC work, but at least what felt like the heyday of it back in I want to say 2005 to 2010 that may not actually be the heyday and just when I was looking into it, but I'm assuming that my experience mirrors everyone's experiences?


Mike Warren: No, it spans the whole range. We've go all the way from you know, 20 year old million line, four tran packages to, you know, the latest convolutional neuro network, which was written a month ago.


Corey Quinn: So as you take a look right now at getting this project up and running, what could AWS have done to make it easier for you, if anything?


Mike Warren: Well, I think it's this environment that they offer is what I worked in 20 years ago. It's Linux plus Intel plus a fast network. And you know, the real key in my mind is the real expense is the software engineers that you need to write this code and deploy it. And now Amazon has eliminated all the friction around the hardware capacity to actually execute that and bring it to reality. So that's kind of the magic.


Mike Warren: Now, any improvements we can make in terms of making our programmers more efficient are not overshadowed by the fact that they can't get access to a petaflop machine to test it or help develop it. You know, it's remarkable, now you can probably get more capacity in AWS with five minutes notice than you can on most of these supercomputers who have to keep their queues very full to keep their utilization high, to justify the cost of that dedicated hardware.


Corey Quinn: When you take a look at doing this again, I mean effectively at first, is this something you would ever consider doing again just as a neat proof of concept? I mean arguably it got more attention when I linked against it in last week in AWS than most articles I link against. So it's clearly resonated with people.


Mike Warren: I mean sure maybe we'll help our local university do the next one or you know, do it in a couple other clouds at the same time. It's a big community and it's not... You know, the top 500 is kind of a... it doesn't provide any utility in itself, it's kind of just demonstrating what could be possible with a set of hardware. So I'm kind of more interested in going to the next level of lets run more codes than we already are in this HPC environment.


Corey Quinn: So I guess the big question here is what inspired you to do this on AWS? You mentioned that the bulk of what you're building today is on GCP. What triggered that decision from your perspective?


Mike Warren: I think AWS has been the first to have the network infrastructure that then makes this possible. You know, in the same way that you need all of the CPUs to be fully available during this computation, you can't have any network bottlenecks. So the new architecture and scalability of AWS's network, not having other users interfering with the network bandwidth available and having low enough latency in these messages. AWS was just the first to get there, but Azure has also demonstrated this sort of performance in their hardware that has a dedicated low latency networks. And you know, I would imagine Google is not far behind.


Corey Quinn: It's interesting and I think that a lot of people, myself included, don't tend to equate incredibly reliable networking as a prerequisite for something like this. But in hindsight that is effectively what defines supercomputers. Things like InfiniBand, how quickly can you get data, not just a across the bus inside of a given node, but between nodes in many cases. It's one of those things that sounds super easy until you look into it and then realize, "Huh, the way this is currently architected from a network point of view, we are never going to be able to move the kind of data that we think we're going to be working on to all of the nodes in question." It's one of those, I think very poorly understood aspects of systems design, especially in this modern world where everyone is going to more or less wind up in a place of, it's just an API call away. The number of people who have to think about that is getting smaller, not bigger.


Mike Warren: Definitely. And there's I think some research that needs to be done in the range of latencies. You know, InfiniBand's down at the microsecond level. AWS can now do things at the 1520 microsecond level and that's a lot shorter than the generation of 10 gigabit Ethernet, which a lot of applications showed didn't have a low enough latency for their needs. So a lot of these important sorts of molecular dynamics, seismic data processing, all these big HPC applications, we really don't know if they're limited by the current implementations or if you really need to go down to the microsecond latency network.


Corey Quinn: What's next? I mean you've been doing this an awfully long time. You've been focusing on HPC, working on computer to scale that frankly most of us have a difficult time imagining, let alone working with. What do you see as the next evolution down this path? I mean, HPC historically has been more or less the purview of researchers and academics. Now we're starting to see these types of things move into the commercial space in a way that I don't think we did before.


Mike Warren: Well that's, I think... I saw a factor of a million improvement in performance from when I started writing our gravitational end body evolution code in graduate school. And you think about that factor of a million in anything else in your experience, it's just never happened. So parallel computing has been a unique experience over the last 20 years and where it's at now is just being available to tens or hundreds or thousands times more programmers. So I think finally we'll get to an era where the investment in hardware has not disadvantaged all of these programmers who could really make some breakthroughs, but it's hard to do that when your code goes 10 times faster every five years without you having to do anything.


Corey Quinn: There's something to be said for the idea of a cloud computing provider where you just throw an application into their environment and you don't touch it again and over time the network gets better, the discs get more reliable, the instances it runs on gets faster. If you try that in your data center raccoons generally carry it off by the end of year three and that says terrible things about my pest control, but also about, I guess the way that people tend to on some level of reduce these hyper-scale cloud providers down to, "Oh, it's just someone else's computer. It's a different place for me to run my Vms," and I think that you're demonstrating that it has the potential and the opportunity to be far more than that.


Mike Warren: Absolutely. I mean there are now clear economies of scale for HBC, whereas before these were all very specialized systems in a not very big market. So the real democratization puts this power in the hands of anyone who can write the right type of software to take advantage of it. And it becomes a true commodity that is really just distinguished by its cost.


Corey Quinn: Wonderful. If people want to learn more about how you've done this and more about what you folks are up to over at Descartes Labs, where can they find you?


Mike Warren: We're at DescartesLabs.community. We've got a good series of blog posts around what we're up to and what we're doing and we've got big plans to grow our data platform beyond geospatial imagery into a lot of other very big data sets that are relevant to what's happening in the world.


Corey Quinn: So I'm going to go out on a limb and assume that you're hiring?


Mike Warren: We definitely are and it's a great place to work for a scientist. I kind of took my experience at a national lab and having worked at universities and I think we've put together the best of the history of research and engineering and made it into a really great place to work and think about software and solve the biggest problems that the world is facing.


Corey Quinn: Thank you so much for taking the time to speak with me today, Mike. I appreciate it.


Mike Warren: Thanks Corey. It's been fun.


Corey Quinn: Mike Warren Co-Founder and CTO of Descartes Labs. I'm Corey Quinn. This is Screaming in the Cloud.


Announcer: This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.


Announcer: This has been a HumblePod Production. Stay humble.
Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.