Join Corey and Jaana as they talk about Spanner and all things database, why Jaana believes that five nines is extreme for most businesses, the CAP theorem and what it actually means, the difference between Google’s internal Spanner product and the Cloud Spanner product you can buy with someone else’s credit card, how Google designs all of its major releases with scalability in mind, the role Jaana played in the Go community, what Jaana loves about working at Google, Jaana’s career advice, and more.
Jaana Dogan is working on Spanner at Google to make state not your problem problem. She has 15+ years of experience in building infrastructure, developer platforms, and tools. Jaana's current work is focused on storage systems, observability and performance tools, and helping customers with architectural design tradeoffs.
- Recommended book: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
- Twitter: https://twitter.com/rakyll
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Cloud Economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: It's, at least as of this recording, morning on the West coast, which means there's no better time than to inflict a homework assignment upon you in the form of a 42 page ebook from StackRox. Learn about the dancing flames of EKS cluster security, evade the toxic dumpster of the standard controls, and tame the wild beast of best practices for minimizing the risk around cluster workloads. Become renowned for your feats of daring, as you learn the specific requirements for securing an EKS cluster and its associated infrastructure. To learn more, visit snark.cloud/stackrox. That's snark.cloud/stacROX.
Corey: This episode is brought to you by Trend Micro Cloud One, a security services platform for organizations building in the cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? "I'm glad we have Trend Micro Cloud One, a security services platform for organizations building in the cloud" or "Hey, bad news. It's gonna be a few more weeks. I kind of forgot about that security thing." I thought so.
Trend Micro Cloud One is an automated, flexible, all-in-one solution that protects your workflows and containers with cloud native security. Identify and resolve security issues earlier in the pipeline and access your cloud environment sooner, with full visibility, so you can get back to what you do best, which is generally building great applications.
Discover Trend Micro Cloud One, a security services platform for organizations building in the cloud. Whew. At trendmicro.com/screaming.
Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Jaana Dogan, staff engineer at a small company called Google. Jaana, welcome to the show.
Jaana: Hi, how are you?
Corey: I am very well, and I'm better, now that I get to talk to you. One of the—I guess, not one of—the best database in the world is DNS, and that is a hill I will die on. Almost as impressive is a product that you work on, namely Spanner. What is Spanner? And why would someone care about it?
Jaana: Spanner is our relational, transactional, and globally scalable database. So, historically—or even today—it's just actually really hard to make transactional relational databases scale. Google actually has humble background in databases as well. Lots of people are thinking about Google as this large company that only cares about large scale problems, but in the beginning, it started very small. There's this very typical story around our MySQL usage, specifically, AdWords—our ads business—had been heavily dependent on MySQL, and they got to this point that there was, like, 90 shards of MySQL instances. And they’ve been dealing with [library] sharding things, it's was causing outages, and so on.
And around this time, people decided to maybe take another look at the storage in general, and they figured out, we definitely need something transactional because, you know, we were doing a lot of money transactional related things. Consistency is really important for us because—you know, you want to be consistent, especially if it's about money. And they needed relational capabilities because there are a lot of relational problems they had. So, Spanner came as a result of these problems, but it didn't appear in a day or two. It took them, like, six years of experiments to figure out the right thing. And it's been largely in use in a lot of systems at Google. And one of the things that I really like about it is, it does a lot of work on behalf of you. It gives some, sort of, promises, and you as a user don't have to think about those problems that much. We can talk a little bit about maybe some of the higher level promises it makes.
Corey: One of the interesting things that came out of the original Spanner paper was that in the world of databases, there are the idea of CAP theorem where you have either consistency, availability, or partitioning. It's one of those good, fast, cheap; you can only ever have two. What made Spanner so groundbreaking was that, yeah, we've decided that we can actually cheat and hit all three of those things, which normally one laughs and makes fun of and then goes back to doing serious work, but this wasn't Twitter for Pets announcing this. This was Google, you folks generally tend to hire smart people who are right about these sorts of things. So, that was definitely eye-opening. I guess first off, how is that possible? And how does it do it?
Jaana: Firstly, maybe I should explain to you my mental model about the CAP theorem. Because according to Eric Brewer, this is a way that you think about these problems. You think about, like, compromises in distributed systems, and there are three things that you care at the very extreme cases. Consistency, availability, or network partitioning. You just pick two of these, right? But according to him, when you're getting closer to 100, things are changing so much. You can’t do all of those compromises.
When Spanner was launched, they come up with this idea that we are almost 100 for all of these, but not hundreds. What Eric Brewer was telling was CAP theorem is really great if you're very close to 100, but it depends on how close they are. For example, Spanner says, we have five nines of availability. Is that close enough to 100%? Or is it, like, you know… eight nines that we should consider to get to what Eric Brewer is talking about? So, I think the controversy or the discussion around whether Spanner is actually breaking the CAP theorem or not, is what does CAP theorem actually means, or what Eric Brewer was trying to achieve by talking about those extreme cases?
Corey: It also feels like this is much more of a, you must be at least this far along the maturity curve before you begin to worry about these sorts of things. I know, for example, when I build databases, what compromises do I make on a CAP theorem? I hit none of those three points because everything I build is fundamentally awful. At some point, you start caring about this sort of thing only after you've gotten to a point of your site doesn't fall over on its own every 20 minutes.
Jaana: Absolutely. And lots of people don't need actually five nines, right? Lots of large businesses, just, are on three nines. And as a random business, you probably don't need that much availability as well. Like, three nines is still at a level that cloud providers are trying to achieve. So, five nines or beyond is just really extreme.
So, I think what Eric Brewer was trying to explain in the CAP theorem mental model that, like, he was trying to give people a way to think about these extreme problems and extreme sacrifices you have to make. So, maybe in practice, it doesn't necessarily fit well because we never can achieve—or there's no real reason for us to achieve that level of extreme availability, or consistency, or network partitioning problems.
Corey: Feel free to opt out of answering this question, but Spanner was always one of those who were doing something at Google that we can't generally talk about, similar to Borg. But then one day, there was an announcement out of Google Cloud—a division, I tend to spend a fair bit of time tracking—and announcing that you were releasing something called Google Cloud Spanner. Which, ooh, Spanner that word sounds vaguely familiar. What is the relationship between Spanner as we've been talking about it, and Cloud Spanner that is something that I can go out and buy with somebody else's credit card?
Jaana: So, I think one of the differences—a lot of people keep asking me this question; is it like the case where we open-source Kubernetes, which looks like an equivalent of Borg but it's actually, like, completely different systems. What is the relationship with the Cloud product and the internal product? What Cloud Spanner does is it packs the internal solution we have and deploys it to the user/s nodes. In order for us to have isolation, we have to make sure that we are not sharing the same deployment environment. So, there are a lot of cases that Cloud Spanner is completely behaving similarly, but it's running on our cloud stack. So, networking-wise, maybe, it might be going through some different hops. But it also is trying to achieve something similar, which is dedicated networking, similar operational model, and so on. So, they are way more similar than what people think.
Corey: Of course, there is always the difference between the fact that I can buy Cloud Spanner, whereas if I want to buy actual Spanner, I probably have to acquire Google, which is currently not on the roadmap for at least the next 18 months.
Jaana: Yeah, there are cheaper options, probably. [laughs].
Corey: One or two. It's interesting because whenever I've worked with various environments where I was running ops teams or, heaven forbid, being an operations-engineer-type, the database was always the root of all problems in the sense of, okay, we're doing disaster contingency planning; we want to have multiple availability zones so we could wind up taking up rack loss, or building loss, and then expanding beyond that into going multi-region. Oh, now you have a problem because, sure, you can read from databases from anywhere, but when you start having something that has a lot of writes, and you want those writes to be consistent, now you're having to make a whole bunch of determinations that all come down to something you talk about in your bio, which I will quote from now: you spent a lot of your time, “Helping customers with architectural design trade-offs,” and everything that I've ever seen around databases—and most other things as well—are built from trade-offs. So, how does that inform how you see the world?
Jaana: A while ago, a coworker of mine said this very useful quote that I can completely relate to. He said, “Any useful system has some state.” And in any architecture, when I was working with any customer at large, I realized that there is no way that you can ignore data problems. Even in systems where data is not the intensive work, a lot of things are designed around data problems.
I realized that over the couple of years, I see myself recommending Martin Kleppmann’s book on data-intensive applications for people who are asking for architectural catalog of problems. I just realized that there's such a big overlap in terms of hard system problems, as well as data problems. And one of the biggest problems with databases is databases don't keep their promises. There are edge cases—the way they implement some of the features have a lot of edge cases, and they don't necessarily are transparent about what's out there. And there are so many choices; if you think about the whole spectrum of databases, we have relational databases, and then we have NoSQL, we have key-value stuff, we have different storage engines, we have different persistency options. And then you have niche databases where you have document DBs or graph DBs, whatever. Basically you have to know a lot about your problem, and a lot about databases in order to get things right at the first time.
So, I've seen that if I can go and explain people the overall trade-offs, and give them some guidance about data, it really reflects on their progress on their overall system design. Because data keeps being always the bottleneck. I'm really surprised that we're talking a lot about, like, this infinite scalability when it comes to Kubernetes, or Lambda, or whatever. But in the end of the day, your biggest bottleneck, you're going to hit that bottleneck with your data system. And the way you handle data from your modeling, from the way that you operate your database is just really impacting the whole design of the system.
Corey: For me, one of the reasons I always stayed away from databases, to be perfectly honest, is that if I screw up a web server, well, that's funny: everyone gets the point and laugh, we’ll redeploy it, and things are back to normal. If you screw up a database, in some cases you don't have a company anymore. And I am whatever the digital version of accident-prone is. So, first, this taught me to do very good backups and, two, it taught me to hire professionals to wind up handling anything with persistence, by and large, which has led to some very interesting beliefs and structures in my world around, for example, DNS as a database. What do you find that—from a customer perspective—the biggest misconceptions are that require architectural trade-offs?
Jaana: I think the biggest problem is—especially with Cloud—they believe that, like, resources are infinite. And it's easy to auto-scale. Some of the customers are coming from this really dynamic workload type of environments, and they believe that over on Cloud, we have no capacity issues, plus we can just auto-scale and we can dynamically resize our pool. And I think most of our compute products are sort of making this a bigger issue because we made auto-scaling too easy, without necessarily considering what it means to the overall limitations of the design. So, I see a lot of people coming from that and realizing that that's not the case. And then they start to see everything more holistically, maybe they realize that they need to start about understanding the limitations at the database layer.
And that's also a very complicated problem because the things that they are looking at, like latency, and throughput—and these are very superficial numbers to take a look at—they still have to realize they have to still identify large specific operations and how they're going to work against their database, particular loads, and so on. They're kind of getting lost because the spectrum is really high in terms of what to measure, and the existence studies around standard benchmarking or standardized stuff just doesn't really help their particular use cases. So, they have to do a lot of prototyping, they have to evaluate a lot of things before they are somewhat happy about their overall initial design. And this is if you're building things from scratch. If you're migrating over, it's just getting much harder.
Corey: In some ways, it feels like working at Google puts you in a position where something that the rest of us have to struggle with, but Google doesn't. Specifically, whenever I build something, it probably doesn't need to scale until suddenly, it's absolutely going to have to scale because it turns out, I built something that resonates with people. That doesn't seem like it's a Google problem because if you slap the word Google on a product that gets launched, on day one you'll have millions of people using it, so anything you build has to scale. Therefore, it removes the entire debate of do we build this right or do we just slap something up there and go back and refactor it later, I would think. Am I right or am I wrong?
Jaana: It's true that we design for scale because we expect, let's say, this number of millions of users on the day first. This is mainly true for large products that we are going to release. So, Google is a very large company, we have, like, all these different systems that doesn't do any consumer market things, so there's actually a variety of different scales. But for consumer market problems, yes, it's true. We have this large XX million expectation on the first day, that's why we specifically pick this type of trade-offs, pick this type of solutions, and everything is more in an [unintelligible] way, but there's a large spectrum of other problems inside Google that doesn't necessarily need that type of scale. And internally, for example, we have a lot of database solutions, a lot of general storage solutions, and there's a huge—also a decision chart internally that—which one you have to pick. And it really depends on the type of problem you have. So, even at Google, it's true that—product teams especially—are more biased towards very large scale. There are a lot of small scale problems, too.
Corey: In what you might be forgiven for mistaking for a blast from the past, today, I want to talk about New Relic. They seem to be a relatively legacy monitoring company, and I would have agreed with that assessment up until relatively recently, but they did something a little out there: they reworked everything. They went open source, they made it so you can monitor your whole stack in one place. And most notably from my perspective, they simplified their pricing into something that is much more affordable for almost everyone. There's even a free tier with one user and a hundred gigs per month, totally free.
Check it out at newrelic.com.
Corey: In many cases, what's right for Google is absolutely not going to be right for other people, but at a certain point of scale, certain things change. And if you take a look at all of the big tech companies out there, they've all built their own programming languages. For example, Microsoft has a whole bunch of them: .NET, ASP.NET, C# et cetera; Facebook came out with Hack, their custom PHP thing; Apple came out with Swift; Amazon came out with CloudFormation; and Google came out with Go, which is something you were deeply involved in before a relatively recent shift over to work on Spanner. What did you do for Go, and what made you decide it was time to go stare at databases instead?
Jaana: I was working on Go after the 1.0 release. So, I started, I think, around 2012. The funny story is, when Go was released, I was not working at Google, and I was working in telecoms. We were working on message parsing systems. These are highly concurrent systems, so we were just basically looking after what else is coming—especially in the languages and runtime space—to make our jobs much easier.
And I was looking at Go around that time. And no, I didn't truly understand the type system or anything, and I felt like this could be something that I can consider in the long term, maybe, but I don't really feel like this language is really the best choice for my personal things that I like in a language and so on. So, I just really didn't do much work. But after I joined, I was—by chance—sitting right next to the compiler team in the Zurich office in Switzerland. And I was just, kind of like, you know, in the conversations because they were all language enthusiastics around me.
And I started just taking a look at things, and I at that time, I was working on Google Drive, so we had about bunch of migration projects with lots of networking and an I/O, and I started just kind of writing small things, and trying out things, and as a result of that, I started to publish some of my open-source tools, and so on, and realize that the community is just really amazing in Go. And I realized in a couple of years that maybe I should do something as a part of my full-time job. And I joined to the Go team to work on, generally, our external API, client libraries, some tooling around them, gRPC—gRPC was just coming around at that time, so we did a lot of work on gRPC as well. We had some sort of project to unify our stubby internal API stuff with gRPC. So, I did some Go specific things. I did all these reviews for all these cloud products who wanted to support Go. I actually initiated one of the earlier projects for our cloud to support Go as a first-class language.
So, back then nobody was interested in Go. This was back in 2002. Go was still kind of like a smaller language and a community. So, I initiated a lot of, like you know, the bunch of small things, and they got funded, luckily. Now there are teams actually working on those things that I initiated as a 20 percent, finally. That's how I started my journey with the Go team. I was necessarily just handling more of the cloud support related things, and then I switched, some sort of like, my interest. There was a project that was trying to make the Go runtime working on Android and iOS. I briefly worked on it, also contributed some of the tooling.
And recently, before my switch, I was really interested in instrumentation, and performance, and debugging tools, and that sort of—and there was a small subset in the Go team that was handling a lot of performance-related stuff. I'm not sure if you're familiar with Go has pprof support, we have an execution tracer. We were thinking about maybe establishing some sort of primitives for distributed tracing, we were thinking about some metrics APIs now. There were a bunch of small things, so we were trying to see what is the overall larger picture; what else we can do.
And so I worked on that team for a while and then left that team to work on instrumentation at Google at large because I realized that a lot of things that I was trying to do in Go was actually larger than just Go problems. So, maybe I should just go and work on the instrumentation team to get some more exposure to that problems, and then I can go back to Go and apply them. But then I ended up being on the Spanner team because, you know, instrumentation team at Google was sponsored by the storage system, so I was really involved in a lot of storage problems as well as networking problems. That was a really gradual switch from Go to other things, but it's funny, sometimes you have to do what you can do, and what's important, and the most priority thing.
And I like to be able to switch back and forth at Google. We have this very loose way of collaboration, and so, I mean, we don't have to necessarily go through interviews or anything, you can just switch projects and contribute, and sometimes overlapping different skills are very useful for the project that you are going in. Like, you're bringing a completely different background. On the Spanner team, for example, I have some experience before coming to Google. I actually left my previous company because of the database problems that we have. So, I had a lot of experience migrating us to different systems, designing, and evaluating databases. That was my previous job, and at some point—I don't want to name a database—but we were losing data, and it turned out to be a very fundamental issue at a database I don't want to name, but we spent weeks, and I spent—
Corey: Yeah, I will fight you if the answer to that database that you don't want to name is DNS.
Jaana: [laughs]. It is not DNS, thankfully. DNS is a better database than that database, I'm pretty sure.
Corey: [laughs]. The funny thing is I honestly don't know the answer, but I can think of at least five that fit that profile.
Jaana: Yeah, yeah. Probably you can tell, if you have maybe a shortlist of two, you will be able to tell which one it is. I don't want to tell it; I don't think that I want to be in that fight. [laughs]. The thing is, you know, I just gradually ended up leaving my previous company just because I was so tired of the storage problems. And I joined Google because they gave me an offer, first, and the second thing is I can actually learn about storage, maybe, at this company because they seem to know what they were doing.
Corey: Yeah, I mean, again, there's a lot of criticisms that you can lobby at a whole bunch of different companies. Google is right in that list too, and I do have a counted list that I don't have the time on this episode to read and blame you personally for all of them, but one thing that Google has always gotten very right has been fantastic technical solutions to incredibly hard problems at scale. It's easy to bag on companies, but there's a lot of hard work that goes into making these global world-scanning systems, and I think that that's something that often gets forgotten. I mean, there was a time where Google was lightyears ahead of absolutely everyone else. And now it seems that, oh, well, what do they do? They built this world-spanning thing that's super fast no matter where you are on the planet. Yeah, here's my five lines, of YAML and 20 cents a month, and I could do something like that, too. We all stand on the shoulders of giants, and it's easy to forget that Google's 25 years old; they've built a lot of these things that have been transformative to the entire industry. But now it's, well what have they done for me lately?
Jaana: I agree with this, but I feel like we still need to do a better job in terms of understanding what it means to scale to small, right? We have these aspirational experiments, maybe, we're running on behalf of other people because we had some unique problems in the past and we built these systems that works for us. Some of those experiments could be aspirational for other things, rather than completely translating—the interesting thing, before joining to the spanner team, I had this concern: is this like—we don't want to be this new, shiny new thing that nobody cares about that only works at Google scale. A lot of people on the team have been telling me they had the same concern. They didn't want to actually release Cloud Spanner.
The more they were talking about customers, largely about database systems, they were asking them to release Cloud Spanner because they wanted a solution. They didn't want to deal with consistency issues the way they used to do. Like, in traditional databases, especially relational ones, it's so hard to scale writes, for example. This is such a fundamental problem, right? You can’t scale your writes, but you want to launch this large game and you want to focus on your game, you don't want to deal with your database layer.
So, just because they've been so hard on them, on the team, if you have an end solution to this, you should share as a product. And then that's how Cloud Spanner actually comes around. I was very impressed by the fact that Spanner team is thinking too much about the customers. This is something that I have to tell. At Google—maybe this is the only team that I've been working at—and at every meeting, I think, we keep talking about customers and customer issues all the time. And like, that is such an incredible thing, how much we actually care about customers in necessarily prioritizing what they want.
Corey: So, before you run Spanner, you worked on Go, and you mentioned that you did telco work before that. What is the story? Generally speaking, Google's staff engineers do not spring fully formed from the forehead of some god. What was the journey that got you to where you are at your career?
Jaana: I started actually—before telecoms, I started at a small company based in London—it’s headquartered in London called Multimap. They were online mapping company that, sort of like, was Google Maps before Google Maps was Google Maps. And they’d been acquired by Microsoft. So, my first experience in life was actually working at a company that was acquired by Microsoft. And it was very interesting for me because I've seen two phases of the same thing, right?
Like, you have this small company that cares a lot about their business problems in a smaller scale, as well as there's this giant coming in and trying to see what kind of differentiated value that acquisition can bring. And they have a completely different scale of systems, and different trade-offs, and so on. So, that was like a really eye-opening thing. And just going through a lot of discussions about how things work, or why things don't work in the new scale, and so on, really helped me a lot. After that, I started to work mainly in telecoms.
And, as I said, I was working on this highly concurrent message parsing rule engine type of systems. They're actually very boring, but they really helped me a lot. Just prototype, and evaluate, and understand the overall system problems. What are some of the limitations that I should take a look? What are some of the failure modes that I should care? And it was in a very, actually, fast-paced environment that I was able to prototype, build, change significant things, push things to production, see some outage, iterate.
So, I've seen a lot of interesting things, being able to touch a lot of different things. We were also trying to modernize parts of our stack, so there was a lot of work in terms of going and discovering new things, and, like, stress testing a lot of new tools, or a lot of new libraries, or language runtimes, and so on. As I said, I was looking at Go as a part of that work. So, that really helped me to see everything in a broader sense. And I really like the fact that I've worked so much more outside of Google than years that I've spent at Google because I've seen problems of different scale, and I've seen different levels of flexibility when it comes to introducing new technology, and I've also seen a lot of different types of organization with different types of problems in terms of scale and the organizational issues.
That really is helping me right now to help our customers because, I mean, I can relate to a lot of things large majority of the customers are going through. After telecoms, by the way, before coming to Google, I've worked for two small companies. They were trying to bootstrap; they were just at the initial design phase for lots of things, but they also had some sort of established business going on. So, I had the chance to see the both sides of the things in more of a playground area where we can go and try out new stuff, as well as a lot of established problems, and legacy issues, and scale problems, and large organizational problems that we have to tackle.
Corey: It's always interesting to me to see how people come to where they are from various different places. So, my last question before we wind up calling it a show is what advice would you give to people who are looking to go from where they are in their careers to a job that looks a little bit more like yours?
Jaana: My biggest concern when I was earlier in my career was, I was like thinking, I am wasting too much time. I was feeling like I am wasting too much time all the time; I'm working on these problems that actually doesn't make any sense in the larger scheme of the things, and so on, and I was very frustrated. I was feeling very demotivated. And if I was able to go back to that person and tell her something, I would say that, “Just don't worry,” because at some point, all of that little experiences really just gets you to a point that you can overlap different experiences, and bring some different perspective.
What I realized is, over time that—especially on the Go team; there were a lot of very senior engineers—but in some cases that I realized that my particular background in all of this weird stuff gave me a huge, very niche thing, but at the same time, a very general perspective that I can apply anywhere on some of the topics. And in some cases, I was the only one in the room that actually have any experience in all across, and was able to say something when we're thinking about design, or some of the goals, or some of the trade-offs. So, I would say people shouldn't worry too much, and especially if they think that they are learning, there is nothing tedious about learning, trying out.
Going after new shiny thing—most people think going after the new shiny solutions is the only way to kind of have any job security. I also disagree with that. Just work on the tedious things. Most of the problems in the world are very tedious and become the person who can recognize and identify the tedious parts, and the common patterns of problems. Tinker through them, and that's going to contribute to your career or your growth more than just going after every shiny new thing.
Corey: Thank you so much for taking the time to speak with me today about basically all of this. If people want to hear more about what you have to say, where can they find you?
Jaana: I have a Twitter account. I usually am trying to be very public about what I am working on, which helps me to hear other voices. So, you can find me on Twitter. And I try to write a lot. Nowadays, I'm not writing that much, but I have two blogs. So, I'll give you the links. You can probably link them.
Corey: Excellent and I will put links to those in the show notes. Jaana Dogan, staff engineer at Google. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you hated this podcast, please leave a five-star review on Apple Podcasts, and then leave a comment telling me what I got wrong, written in Go.
Announcer: This has been this week’s episode of Screaming in the Cloud. You can also find more Corey at ScreamingintheCloud.com, or wherever fine snark is sold.
This has been a HumblePod production. Stay humble.