Alex is a software developer at Wellcome Collection, a museum in London that explores the history of human health and medicine. Their role primarily focuses on preservation and building systems to store the museum’s digital archive. Alex also helps run the annual PyCon UK conference, with a particular interest in the event’s diversity and inclusion initiatives.
Join Corey and Alex as they discuss how Alex built a calculator using DynamoDB, the role Corey played in inspiring Alex to do that, what Corey means when he calls someone a “code terrorist,” the features that are packed into Alex’s calculator, why Corey thinks the Wellcome Collection would be a great acquisition for AWS, why the museum always likes to keep two copies of things, how Glacier Deep Archive is great for long-term storage, the challenges museums face in the 21st century vs. the challenges they faced in the 18th century, what it’s like to digitize Betamax, VHS, and CD-ROMs, how to find items in a vast digital archive, and more.
About Alex Chan
Alex is a software developer at Wellcome Collection, a museum in London that explores the history of human health and medicine. Their role primarily focuses on preservation, and building systems to store the Collection’s digital archive. They also help to run the annual PyCon UK conference, with a particular interest in the event’s diversity and inclusion initiatives.
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Cloud Economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: This episode is sponsored in part by Catchpoint
. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com
, and tell them Corey sent you; wait for the wince.
will help you reduce AWS costs 15 to 50 percent if you do what tells you. But some people do. For example, watch their webcast, how Uber reduced AWS costs 15 percent in 30 days; that is six figures in 30 days. Rather than a thing you might do, this is something that they actually did. Take a look at it. It's designed for DevOps teams. nOps helps quickly discover the root causes of cost and correlate that with infrastructure changes. Try it free for 30 days, go to nops.io/snark
. That's N-O-P-S dot I-O, slash snark.
Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Alex Chan, who among many other things that we will get to is, most notably, a code terrorist. Alex, welcome to the show.
Alex: Hi, Corey. Thanks for having me.
Corey: So, you've built something wonderful and glorious. Well, that's my take on it. Most other folks are going to go in the opposite direction of that, and start shrieking. Namely, you've discovered that AWS ships a calculator, something the iPad does not, but AWS does, and the name of that calculator is, of course, DynamoDB. Tell me a little bit more about how you made this wondrous discovery.
Alex: So, I was watching one of your videos where you were talking about some of the work you were doing in AWS land, and you were talking about how you were starting to explore DynamoDB. And DynamoDB is the primary database that we use in my workplace, on AWS. And I knew you could do a little bit of mathematics in AWS; you could do sort of simple addition using things like the update expression API, and you can also get conditional logic using conditional expressions, and then I decided to string all that together with a series of Python and see if I could assemble those calls to get a basic working calculator.
Corey: Some folks would say that you could just do the calculator bits without the DynamoDB at all. Python has a terrific series of math functions and other libraries you can import. What do you say to those people?
Alex: The thing about running your code in Python is that you have to have somewhere to run it, and that's probably going to be something like a server. Whereas if you run it in DynamoDB, that's serverless. And as we know, that's much better.
Corey: Oh, absolutely because that's the whole point of modern architectures, where we wind up taking things that exist today, and then calling them crap, and reimagining them from first principles, just like we would on Hacker News, and turning them into far future ways that doesn't add much value but does let us re architect everything we've done yet again and win points on the internet which, as we all know, is what it's all about. And I'm extremely impressed by this, but my question now is, so as you've figured this out, what got you to a point where you looked at a database and said, “You know what? I bet I can make that a calculator.” How do you get there from here?
Alex: So, in this specific case, I’d done just a little bit of work with DynamoDB already, and I sort of had a vague notion that you could do something like this, but I'd never really understood. It was this API that I knew existed, and so I just started working through the documentation, pulling apart examples, and discovering, “Oh, yes, actually, this API can do this,” And in the process, actually came up with a much better understanding of the API than I had before.
Corey: One of the interesting pieces to me is that there's a certain class of people—and I refer to y'all as code terrorists, and I aspire to be one on some level—I look at something and my immediate question is, how can I misuse this? Kevin Kuchta, for example, was able to build a serverless URL shortener with Lambda—just Lambda; no data store. It was a self-modifying Lambda function, which is just awesome, and terrible, and wonderful all at the same time. And you've done something similar, and I have, in the sense of getting Route 53 to work as a relational database, sort of. And whenever I describe these things to people, they look at me in the strangest and arguably saddest ways that are just terrible, absolutely terrible.
I love the idea, I love the approach, I love the ethos, and I want to see more of it, but there's a long time that goes between me coming up with something like Route 53 as a database, and then there's a drought, and then eventually someone like you comes up with now we're going to use Dynamo as a calculator. How do we get more of this? I want to have these conversations more than once every six months.
Alex: How do we get more of this? I think, first of all, just coming up with the ideas and actually putting them into people's hands. We don't necessarily need to be the people who do them. Obviously, this DynamoDB as a calculator came about because of a throwaway comment you made on a video that I watched. So, if we can think of ways to, I don’t know, use SQS as persistent storage, or use SNS for two-way communication, and then you put those ideas out there and it will fit in someone's head—and hopefully someone who knows a bit more about these things—and they will start to think about it, and they will know what the edge cases are, and what the little bits of API they can exploit. And that's how these things come about, I think. But we've got to go out there and plant the idea in somebody else's head, and then let them sit on it three months, and then finally go, “Aha. I know how I'm going to do that.”
Corey: So, I didn't realize I was one of the proximate causes of this, and I take a little bit of joy and, I guess, pride in the fact that I can inspire that. So, there are a few different directions to take it in now. One, do you think that Texas Instruments has woken up yet to the fact that they have a new competitor in the form of AWS?
Alex: I mean, I've not heard from any of their lawyers yet. So, I have to assume, no.
Corey: I mean, the TI-83 hasn't changed since I was in high school 20 years ago, and it feels like maybe Texas Instruments is not the best-suited company, as far as ‘prepared to innovate rapidly’ goes. I kind of hope we’ll finally break their ridiculous calculator cartel.
Alex: I mean, certainly, I think that's one of the greater injustices in the tech industry today is that, you know, Texas Instruments remains the undisputed king of graphing calculators, and obviously we look forward to DynamoDB breaking that hegemony.
Corey: Well, that's the real question. How far does this go as a calculator? Basic arithmetic is sort of a gimme; what does it do beyond that?
Alex: I implemented the basic arithmetic operations, addition, multiplication, subtraction, and division. And then along the way, as I was trying to build those, I ended up coming up with a series of logical operators. So, first, NOT, then OR, and then an AND, and then finally, a NAND gate. Now, I don't have a computer science background, but I can read Wikipedia, which is almost as good, and Wikipedia tells me that if you have a NAND gate, you can essentially build a modern processor. And since we can do NAND gate in DynamoDB, there's no reason we can't build virtual processes on top of DynamoDB as well. And once you can simulate a virtual processor, then really it's a few short steps from there to having the next EC2.
Corey: Well, what I was wondering is, now we have the scientific calculator stuff potentially taken care of, the next step becomes, clearly, graphing calculators. Is that something that DynamoDB is going to be able to do, or are you going to have to switch over to Neptune AWS’s graph DB, which presumably will be needed for a graphing calculator?
Alex: I confess I haven't used Neptune. I have thought a bit about, though, how you might—
Corey: Well, you and everyone else, but that's beside the point, really.
Alex: But I have thought about how you might use DynamoDB for a graphing calculator, and really the answer here, obviously, is that we're going to have to turn to the console. The console will show you the rows in your DynamoDB database, and so we just use very narrow column names, and then we fill them with ones or zeros, and then that will allow you to draw shapes in the console.
Corey: I'm thinking through the ramifications of that. If you're able to suddenly start drawing shapes in the console, that would put the database system—now a graphing calculator—significantly further ahead than other AWS services like, for example, Amazon QuickSight, which ideally is a visualization tool, and in practice serves as a sad punchline that we're hoping they improve faster than Salesforce can ruin Tableau. But right now, it's like watching a turtle race.
Alex: Exactly, and I think this is the flexibility of having a general-purpose compute platform. You can do arithmetic operations, you can do your scientific calculator, but then you can build on that to do more sophisticated things, like visualizations. And I think it's a shame that more people haven't tapped the power of DynamoDB already.
Corey: I keep hoping to see further adoption. So, changing gears slightly on this, you have a, apparently, full-time day job that is not building things like this, which first, may I just say, is a tragedy for all of us. But secondly, what you do is interesting in its own right, tell me a little bit about it.
Alex: So, I work from for an organization called Wellcome Collection
, which is a museum and library in London—
Corey: And that is Wellcome with two L’s.
Alex: Wellcome with two L's. It ruins your ability to spell the greeting.
Corey: I'm waiting for AWS to acquire them. Anything that has a terrible name with extra letters, vowels, et cetera, seems like it is exactly on-brand for them.
Alex: I couldn't possibly comment.
Corey: Of course not. So, what does the Wellcome Collection do?
Alex: So, we are a museum and library, and we primarily think about the human health and medicine. So, obviously, we've got a lot to think about right now. And one of the things we do is we have a large digital archive so that's a significant quantity of [00:10:13 born] digital material. Somebody actually gives us their files, their documents, their presentations, their podcast recordings, and also a significant amount of digitized material where we've got some book in the archive, we take a photograph, we can put that photograph on the internet, and then people can read the book without actually having to physically come to the museum. And what I work on currently is the system that's going to hold all of those files because it turns out that if you have 60 terabytes of stuff and you just put it on a hard drive and leave it in the closet, that's apparently not so great.
Corey: It's great, right up until magically it isn't, or so I'm told. But yeah, you're right. Every time I talk about long term archival storage, it seems like there's a difference in terms. “Oh, some of our old legacy archives are almost five years old,” is a very different story when we're talking in the context of a library. People who believe the internet is forever, my counter-argument to them is, “Great. Do me a favor, what is the oldest file on your computer?” And that tends to sometimes be an instructive response?
Alex: Absolutely. I mean, our digitization program goes back about a decade at this point. So, we've been keeping files for that long. But then we're looking very far into the future with the archive in general.
And in fact—so rules vary around the world, but in the UK, certainly, the standard rule is that if you've got an archive about a living person, you typically close that archive for 70 years after their death. So, that means they're gone, anyone who remembered them was gone, and also probably their children—and maybe their grandchildren as well—are gone because, particularly when you're dealing with medical records, say people may not want it known that their grandfather was in this particular hospital. So, we are planning very much decades or even, in some cases, centuries into the future.
Corey: So, when you're looking at solutions that need to transcend decades, does that mean that cloud services are on the table, off the table, part of the solution? How do you approach this?
Alex: So, for the work we're doing, we very much do use cloud services. We’re mostly running in AWS, and then we're going to start doing backups into Azure soon because you don't really want to rely on a single cloud provider for this sort of thing, in case AWS hear what I'm doing with DynamoDB and close our account, and in part, because organizations like AWS, like Microsoft Azure, they have much more expertise than we can have in-house on building very robust, reliable systems. So, if you imagine a 60 terabyte archive, and you're holding that locally, that's probably at the limit of your personal expertise on how to store that amount of data safely. If you give 60 terabytes to the Amazon S3 team and say, “This is a lot of data.” They will laugh at you because their entire job is around storing large amounts of data, ensuring it lasts a very long time, ensuring that when disks fail they get replaced and the data is replicated back onto the new system. So, for us, we've really embraced using the Cloud as a place to put all this stuff because a whole lot of problems around, “Is this disk still going to work in two years?” Are solved for us.
Corey: There's another challenge, too, in some ways. As you look at larger and larger datasets and looking at cloud providers—one of my favorite tools in Amazon's archival toolbox has been this idea of Glacier Deep Archive where the economics are incredible. It's $1,000 per month per petabyte, which is just lunacy-scale pricing. But retrievals can take 12 to 24 hours depending upon economics and how you want to prioritize that. And that works super-well in scenarios where you're a company and you need to keep things around for audit or compliance purposes, and your responsiveness to those requests is going to be measured with a calendar rather than a stopwatch, but for a library where you need to have a lot of these things available online when people request them, 12 to 24 hours seems like an awfully long time to sit there and watch a spinner go around on a website.
Alex: It is and it isn’t. Twelve to twenty-four hours is actually, in the context of some library things, is quite fast. If you're requesting physical objects certainly, there are a lot of things in the library you can't just go and pick up off the shelf. London real estate is expensive, so about half of our physical collection actually lives in a salt mine in Cheshire, and if you want to see it, you make a special request to us, and a van drives to the salt mine and picks it up for you.
Corey: In a digital context, we generally just refer to the ‘salt mine’ as Twitter.
Alex: Exactly. The way we actually handle this for most of our work is we have two copies of everything because you never want to have just one copy of it because then you're one fat-fingered delete away from losing your entire archive. So, we have one copy that lives in standard IA. That's the warm copy; that's the copy that we can call up very quickly if someone wants to look at something on our website, and then we have a second copy that lives in Glacier Deep Archive, and then that’s separate. Nothing should be reading that; nothing should be really touching that, but we know there's a second copy there if something terrible happens to the first copy.
Corey: It comes down to the idea of what the expected tolerances and design principles going into a given system are. You're talking about planning things that can span decades into centuries, and I'm curious as to how much the evolution of what it is you're storing is going to change, grow, and impact the decisions you make. For example, if I'm starting a library in the 1800s, the data that I care about is effectively almost entirely the printed word, and ideally some paintings but that's sort of hard to pull off. As we continue to move into the 20th century, now you have video to worry about, and audio recordings that are becoming a technology. And nowadays we're seeing much higher fidelity video, and larger and larger objects while the cost of storage continues to get cheaper over time, as well. So, I'm curious as to how you're looking at this. Today's 60 terabyte archive could easily be 60 exabytes in 20 or 30 years. We don't necessarily know. How are you planning around that looking forward?
Alex: Well, it's very hard to predict the future. And if I could, I would have made some very different life decisions. So, what we've done instead is just try to build it in a quite a generic way in a way that doesn't tie too strongly to the content that we're storing. The storage service we've built mostly just treats the files entirely as opaque blobs: it puts them in S3, it puts them in that Glacier Deep Archive, it checks they're correct, but it doesn't care if you hand it a JPEG, or movie file, or Word document. It's just going to make sure that file is tracked, and is sorted vaguely sensibly. And we're hoping that that will give us the flexibility to continue to change the software that supports it as our requirements change.
One of the things we were very conscious of is any software that we write in 2018 and 2019 to do this sort of thing is going to be obsolete and thrown away, probably within a decade, certainly by 2040. And so we want to design something and store the data in a way that was not tied to a particular piece of software, and that somebody could come along in the future, and pull it back out again and understand how it was organized, or start adding their own files to it if our software has long gone.
Corey: If you’re looking to wind up standing up infrastructure but don’t want to spend six months going to cloud school first, consider Linode
. They’ve been doing this far longer than I have, they’re straightforward to get started with, their support is world-class—you’ll get an actual human, empowered to handle your problem rather than passing you off to someone else like some ridiculous game of ticket tennis—and they are cost-competitive with any other provider out there, with better performance in almost every case. Visit linode.com/morningbrief
to learn more. That’s linode.com/morningbrief
Corey: Historically, when I was doing this stuff with longer-term, “Archival media,” quote-unquote—you know, those special CDRs that are guaranteed to last over a decade. Now the biggest problem is finding something to read them because technology moves on. Bit rot became a concern; the idea that the hard drive that you stored this on doesn't wind up working, or there's media damage, or it turns out that there was a magnet incident in the tape vault. Whatever it is, the idea that eventually the media that holds that data winds up eroding underneath it, rendering whatever it stores as completely unrecoverable. How do you think about that?
Alex: We think about that a lot because we still have a lot of that magnetic media. We still have Betamax cassettes, and VHS tapes, and CD ROMs, and one of the big things we're currently doing is a massive AV digitization project to digitize as much of that as possible before it becomes unreadable. I think if I'm remembering correctly, Sony stopped making Betamax players a number of years ago, so the number of players left in the universe—and the number of spare parts—is now finite, and is only going to get smaller. And even though those types might be good for another 10 years in our temperature-controlled vaults, we and a lot of similar organizations are really having to prioritize digitizing that and converting it to a format that can be stored in something like S3 because otherwise, it's just going to be lost forever.
Corey: Do you find that having to retrieve the data every so often, and validate that it's still good, and rewrite it is something that is viable for this? Does that not solve the problem the way it needs to be solved? I've dabbled looking into a couple of options at this stuff years ago, and never really took it much further than that. So, I'm coming at this with a very naive perspective.
Alex: We've never looked at this in detail, but we have done exercises where we pull out large chunks of the archive, and completely re-checksum of them, and validate the SHA-256 of the thing we wrote six months ago is indeed still the SHA-256 of the thing that's now sitting in S3. We did this for a significant chunk of the archive recently; it was pretty cost-effective. We were able to run it on Amazon Fargate, scaled it out massively, ran in parallel, it was very nice. The biggest cost was in fact, the cost of all the GetObject calls we had to make against S3, but it was a couple of hundred dollars at most. So, this sort of money where if we felt it was important to do, we’d just do it again.
Corey: It feels like on some level, that's what things like S3 have to be doing under the hood where they have multiple—like this idea of erasure coding or information dispersal—the idea of you can have certain aspects of it rot and it doesn't tarnish the entire thing, [00:20:55 unintelligible] some arbitrary percentage. And we've played with this on Usenet in years past with parity files: download enough objects and you have enough to reconstruct the whole.
Alex: Exactly. And there are people at Amazon whose entire job revolves around making sure S3 doesn't lose files, which is part of why we use it because they're going to think about that problem much more than we can. And we basically trust that if we put something in S3, it's probably going to be fine there. The biggest thing we're worried about is making sure we put the right files into S3 in the first place.
Corey: And that's always the other problem, too, which is a—if you'll pardon the term—library problem. You have all this data living in various storage systems. That's great. But how do you find it? It feels like it's that scene from Indiana Jones and—one of those movies, I don't know what it was—Indiana Jones and the Impossible Cloud Service, where they have a warehouse scene at the end where everything for miles is just this enormous warehouse and everything's in crates. How do you find it again? Which system did that live in? That always seems to become the big challenge. And we see it with everything, be it Lambda functions, DynamoDB tables—still a great calculator—and other things. Which account was that in? Which region was it in? Expanding that beyond that to data storage feels like, unless you're very intentional to beginning, you're never going to find that again.
Alex: Yeah, so one of the things we did that we made quite a conscious decision to do early on, was we tie everything back to an identifier in another system. So, all of our files will be associated with at least one other library record. So, they might be associated with a record in our book catalog, they might be something in the painting records, it might be something in the archive catalog. And then that's the identifier we use just to hold the thing in the storage service.
So, you can look at a thing in the storage service and say, “Ah. This has the identifier B1234. I know that's a book number, I can go and look in the library catalog and find the book that's associated with,” and vice versa. So, essentially, we're pushing out the problem of organizing back to the librarians and the archivists because they have very strong opinions and rules about that sort of thing, and it's much easier just to let them handle it in one place than to try and replicate all that logic again in a second place.
Corey: Do you think that there is a common business tie-in as far as—a library looking to store things on that kind of timeline seems like it's a very specific use case and problem space that any random for-profit company is going to take one look at and say, “Oh, that's not really our area. We don't know what next quarter is going to hold, let alone the far distant future.” Do you think that's a mistake? Do you think that there are lessons that can be learned here that map to everyone, and where do those live?
Corey: Part of me wonders on some level though, that when I'm building a company that doesn't necessarily know whether it's going to exist in a week, it feels like that is such an early optimization. Like, the things that I worry about, even in the course of my business—which is fixing AWS bills—is, “What if Amazon dries up and blows away?” Is fundamentally very core to my business as far as disruptive things that might happen, but if that happens, everyone is going to be having challenges. It's going to be a brave new world, and building out what I've done in a multi-cloud-style environment or able to be pivoted easily to other providers just hasn't been on the roadmap as a strategic priority. Maybe that's naive, but I honestly don't know at this point.
Alex: No, and we’re still—you know, in the grand scheme of human history, we're still very early in these things. And yeah, maybe AWS will go away next week. I certainly hope not, but when we were doing this, we didn’t—obviously, we thought about this a lot more than a lot of people would because we really are expecting to optimize for that very long use case, so what I'm suggesting is not that you prepare yourself to pivot multi-cloud, that you prepare to be able to run workloads anywhere, you be able to shift your workloads around dynamically, but it's just taking a little look at your decision saying, “Is this decision going to lock me in, in a really aggravating way? And is there just a slightly simpler way that I can do this that is going to be much easier to unpick from later?”
One of the big ones for us was, for a long time, we were looking at using UUIDs to store everything because UUIDs are brilliant. You never have to worry about uniqueness or versioning. It's just handled for you, but then we thought about what it would take to unpick those UUIDs later and work out what they meant, and we realized, “Well, alternatively, there's a great identifier over here sitting in the library catalog. Why don't we just use that instead, and throw away all these UUIDs?” And that wasn't a huge amount of work, right? It was just a case of deciding which of these two strings do we put into the database. But I think long term, that's going to make a massive difference to how portable the system ends up being.
Corey: That's one more topic I wanted to get into before we call it a show, that ties together the two things that you've been doing, maybe. Namely, how did you get into looking at systems like DynamoDB and seeing a calculator, and possibly the archival stuff, too, in such a weird and unusual way? It's not common, and it is far too rare of a skill. How did you get like this is the question I guess I'm trying to ask, but without the potentially insulting overtones?
Alex: No insult taken. So, I think like a lot of people, my first community on the internet was primarily fan-ish. I grew up on the internet, reading fan fiction, and for people of a certain age they will remember sites like fanfiction.net, Wattpad, and the big one at the time was LiveJournal. And huge fan history discussions were conducted on LiveJournal.
And I got to know a few people there, and a friend of mine was friends with the head of LiveJournal’s trust and safety. And if you've never come across it, trust and safety is this fascinating role where you have to look at every aspect of a system and think about how terrible people will misuse it to hurt people. And we're talking about things like stalkers, like abusive exes, like that coworker who doesn't know what boundaries are, and you've got to work out how, for example, a social media site is going to be completely ripped to shreds by these people and used to hurt users. Because if you're not doing that in the design phase, those people will do that work for you when you deploy to production, and then people get hurt, and then you're extremely sad.
And so that was the thing I was thinking about very early on on the internet, was I was talking to these people, I was hearing their stories, I was hearing how they design their services to prevent this sort of abuse, and I got into this mindset of looking at the system and trying to think, “Well, okay, if I wanted to do something evil with this system, how would I do it?” And in turn, when I'm building systems, I'm now thinking, “What would somebody evil do with this, and how can I stop them doing it?”
Corey: It almost feels like an InfoSec-style skill set.
Alex: Yeah, there's definitely a lot of overlap there, and a lot of the people who end up doing that sort of trust and safety work are also in the InfoSec space.
Corey: I think there's a lot of wisdom buried in there, and I think that, frankly, we've all learned a lot today. If nothing else, how to think longer-term about calculator design. Thank you so much for taking the time to speak with me. If people want to hear more about what you have to say, where can they find you?
Alex: I'm on Twitter as @alexwlchan
, that’s W-L-C-H-A-N, and I blog about brilliant ideas in calculator design at alexwlchan.net
Corey: Excellent. And we will throw links to that in the [00:29:30 show notes]. Alex Chan, senior software developer, and code terrorist. I am Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts, whereas if you've hated this podcast, please leave a five-star review on Apple Podcasts and a comment telling me exactly why I'm wrong about Texas Instruments.
Announcer: This has been this week’s episode of Screaming in the Cloud
. You can also find more Corey at ScreamingintheCloud.com
, or wherever fine snark is sold.
This has been a HumblePod production. Stay humble.