Episode Show Notes & Transcript
- Company Website: https://www.stanza.systems/
- Twitter: https://twitter.com/lauralifts
- LinkedIn: https://www.linkedin.com/in/laura-nolan-bb7429/
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. My guest today is someone that I have been low-key annoying to come onto this show for years, and finally, I have managed to wear her down. Lauren Nolan is a Principal Software Engineer over at Stanza. At least that’s what you’re up to today, last I’ve heard. Is that right?
Laura: That is correct. I’m working at Stanza, and I don’t want to go on and on about my startup, but I’m working with Niall Murphy and Joseph Bironas and Matthew Girard and a bunch of other people who more recently joined us. We are trying to build a load management SaaS service. So, we’re interested in service observability out of the box, knowing if your critical user journeys are good or bad out of the box, being able to prioritize your incoming requests by what’s most critical in terms of visibility to your customers. So, an emerging space. Not in the Gartner Group Magic Circle yet, but I’m sure at some point [laugh].
Corey: It is surreal to me to hear you talk about your day job because for, it feels like, the better part of a decade now, “Laura, Laura… oh, you mean USENIX Laura?” Because you are on the USENIX board of directors, and in my mind, that is what is always short-handed to what you do. It’s, “Oh, right. I guess that isn’t your actual full-time job.” It’s weird. It’s almost like seeing your teacher outside of the elementary school. You just figure that they fold themselves up in the closet there when you’re not paying attention. I don’t know what you do when SREcon is not in process. I assume you just sit there and wait for the next one, right?
Laura: Well, no. We’ve run four of them in the last year, so there hasn’t been very much waiting. I’m afraid. Everything got a little bit smooshed up together during the pandemic, so we’ve had a lot of events coming quite close together. But no, I do have a full-time day job. But the work I do with USENIX is just as a volunteer. So, I’m on the board of directors, as you say, and I’m on the steering committee for all of the global SREcon events, and typically is often served by the program committee as well. And I’m sort of there, annoying the chairs to, “Hey, do your thing on time,” very much like an elementary school teacher, as you say.
Corey: I’ve been a big fan of USENIX for a while. One of the best interview processes I ever saw was closely aligned with evaluating candidates along with USENIX SAGE levels to figure out what level of seniority are they in different areas. And it was always viewed through the lens of in what types of consulting engagements will the candidate shine within, not the idea of, “Oh, are you good or are you crap? And spoiler, if I’m asking the question, I’m of course defaulting myself to goading you to crap.” Like the terrible bespoke artisanal job interview process that so many companies do. I love how this company had built this out, and I asked them about it, and, “Oh, yeah, it comes—that dates back to the USENIX SAGE things.” That was one of my first encounters with what USENIX actually did. And the more I learned, the more I liked. How long have you been involved with the group?
Laura: A relatively short period of time. I think I first got involved with USENIX in around 2015, going to [Lisa 00:03:29] and then going on to SREcon. And it was all by accident, of course. I fell onto the SREcon program committee somehow because I was around. And then because I was still around and doing stuff, I got eventually—you know, got co-opted into chairing and onto the steering committee and so forth.
And you know, it’s like everything volunteer. I mean, people who stick around and do stuff tend to be kept around. But USENIX is quite important to me. We have an open access policy, which is something that I would like to see a whole lot more of, you know, we put everything right out there for free as soon as it is ready. And we are constantly plagued by people saying, “Hey, where’s my SREcon video? The conference was like two weeks ago.” And we’re like, “No, no, we’re still processing the videos. We’ll be there; they’ll be there.”
We’ve had people, like, literally offer to pay extra money to get the videos sooner, but [laugh] we’re, like, we are open access. We are not keeping the videos away from you. We just aren’t ready yet. So, I love the open access policy and I think what I like about it more than anything else is the fact that it’s… we are staunchly non-vendor. We’re non-technology specific and non-vendor.
So, it’s not, like, say, AWS re:Invent for example or any of the big cloud vendor conferences. You know, we are picking vendor-neutral content by quality. And as well, as anyone who’s ever sponsored SREcon or any of the other events will also tell you that that does not get you a talk in the conference program. So, the content selection is completely independent, and in fact, we have a complete Chinese wall between the sponsorship organization and the content organization. So, I mean, I really like how we’ve done that.
I think, as well, it’s for a long time been one of the family of conferences that our organizations have conferences that has had the best diversity. Not perfect, but certainly better than it was, although very, very unfortunately, I see conference diversity everywhere going down after the pandemic, which is—particularly gender diversity—which is a real shame.
Corey: I’ve been a fan of the SREcon conferences for a while before someone—presumably you; I’m not sure—screwed up before the pandemic and apparently thought they were talking about someone else, and I was invited to give a keynote at SREcon in EMEA that I co-presented with John Looney. Which was fun because he and I met in person for the first time three hours beforehand, beat together our talk, then showed up an hour beforehand, found there will be no confidence monitor, went away for the next 45 minutes and basically loaded it all into short term cash and gave a talk that we could not repeat if we had to for a million dollars, just because it was so… you’re throwing the ball to your partner on stage and really hoping they’re going to be able to catch it. And it worked out. It was an anger subtext translator skit for a bit, which was fun. All the things that your manager says but actually means, you know, the fun sort of approach. It was zany, ideally had some useful takeaways to it.
But I loved the conference. That was one of the only SREcons that I found myself not surprised to discover was coming to town the next week because for whatever reason, there’s presumably a mailing list that I’m not on somewhere where I get blindsided by, “Oh, yeah, hey, didn’t you know SREcon is coming up?” There’s probably a notice somewhere that I really should be paying attention to, but on the plus side, I get to be delightfully surprised every time.
Laura: Indeed. And hopefully, you’ll be delightfully surprised in March 2024. I believe it’s the 18th to the 20th, when SREcon will be coming to town in San Francisco, where you live.
Corey: So historically, in addition to, you know, the work with USENIX, which is, again, not your primary occupation most days, you spent over five years at Google, which of course means that you have strong opinions on SRE. I know that that is a bit dated, where the gag was always, it’s only called SRE if it comes from the Mountain View region of California, otherwise it’s just sparkling DevOps. But for the initial take of a lot of the SRE stuff was, “Here’s how to work at Google.” It has progressed significantly beyond that to the point where companies who have SRE groups are no longer perceived incorrectly as, “Oh, we just want to be like Google,” or, “We hired a bunch of former Google people.”
But you clearly have opinions to this. You’ve contributed to multiple books on SRE, you have spoken on it at length. You have enabled others to speak on it at length, which in many ways, is by far the better contribution. You can only go so far scaling yourself, but scaling other people, that has a much better multiplier on it, which feels almost like something an SRE might observe.
Laura: It is indeed something an SRE might observe. And also, you know, good catch because I really felt you were implying there that you didn’t like my book contributions. Oh, the shock.
Corey: No. And to be clear, I meant [unintelligible 00:08:13], strictly to speaking.
Corey: Books are also a great one-to-many multiplier because it turns out, you can only shove so many people into a conference hall, but books have this ability to just carry your words beyond the room that you’re in a way that video just doesn’t seem to.
Laura: Ah, but open access video that was published on YouTube, like, six weeks ahead [laugh]. That scales.
Corey: I wish. People say they want to write a book and I think they’re all lying. I think they want to have written the book. That’s my philosophy on it. I do not understand people who’ve written a book. Like, “So, what are you going to do now?” “I’m going to write another book.” “Okay.” I’m going to smile, not take my eyes off you for a second and back away slowly because I do not understand your philosophy on that. But you’ve worked on multiple books with people.
Laura: I actually enjoy writing. I enjoy the process of it because I always learn something when I write. In fact, I learn a lot of things when I write, and I enjoy that crafting. I will say I do not enjoy having written things because for me, any achievement once I have achieved it is completely dead. I will never think of it again, and I will think only of my excessively lengthy-to do list, so I clearly have problems here. But nevertheless. It’s exactly the same with programming projects, by the way. But back to SRE we were talking about SRE. SRE is 20 now. SRE can almost drink alcohol in the US, and that is crazy.
Corey: So, 2003 was the founding of it, then.
Corey: Yay, I can do simple arithmetic in my head, still. I wondered how far my math skills had atrophied.
Laura: Yes. Good job. Yes, apparently invented in roughly 2003. So, the—I mean, from what I understand Google’s publishing of the, “20 years of SRE at Google,” they have, in the absence of an actual definite start date, they’ve simply picked. Ben Treynor’s start date at Google as the start date of SRE.
But nevertheless, [unintelligible 00:09:58] about 20 years old. So, is it all grown up? I mean, I think it’s become heavily commodified. My feeling about SRE is that it’s always been this—I mean, you said it earlier, like, it’s about, you know, how do I scale things? How do I optimize my systems? How do I intervene in systems to solve problems to make them better, to see where we’re going to be in pain and six months, and work to prevent that?
That’s kind of SRE work to me is, figure out where the problems are, figure out good ways to intervene and to improve. But there’s a lot of SRE as bureaucracy around at the moment where people are like, “Well, we’re an SRE team, so you know, you will have your SLO Golden Signals, and you will have your Production Readiness Checklists, which will be the things that we say, no matter how different your system is from what we designed this checklist for, and that’s it. We’re doing SRE now. It’s great.” So, I think we miss a lot there.
My personal way of doing SRE is very much more about thinking, not so much about the day-to-day SLO [excursion-type 00:10:56] things because—not that they’re not important; they are important, but they will always be there. I always tend to spend more time thinking about how do we avoid the risk of, you know, a giant production fire that will take you down for days, or God forbid, more than days, you know? The sort of, big Roblox fire or the time that Meta nearly took down the internet in late-2021, that kind of thing. So, I think that modern SRE misses quite a lot of that. It’s a little bit like… so when BP, when they had the Deepwater Horizon disaster on that very same day, they received an award for minimizing occupational safety risks in their environment. So, you know, [unintelligible 00:11:41] things like people tripping and—
Corey: Must have been fun the next day. “Yeah, we’re going to need that back.”
Laura: [laugh] people tripping and falling, and you know, hitting themselves with a hammer, they got an award because it was so safe, they had very little of that. And then this thing goes boom.
Corey: And now they’ve tried to pivot into an optimization award for efficiency, like, we just decided to flash fry half the sea life in the Gulf at once.
Laura: Yes. Extremely efficient. So, you know, I worry that we’re doing SRE a little bit like BP. We’re doing it back before Deepwater Horizon.
Corey: I should disclose that I started my technical career as a grumpy old Unix sysadmin—because it’s not like you ever see one of those who’s happy or young; didn’t matter that I was 23 years old, I was grumpy and old—and I have viewed the evolution since then have going from calling myself a sysadmin to a DevOps engineer to an SRE to a platform engineer to whatever we’re calling it this week, I still view it as fundamentally the same job, in the sense that the responsibility has not changed, and that is keep the site or environment up. But the tools, the processes and the techniques we apply to it have evolved. Is that accurate? Does it sound like I’m spouting nonsense? You’re far closer to the SRE world than I ever was, but I’m curious to get your take on that perspective. And please feel free to tell me I’m wrong.
Laura: No, no. I think you’re completely right. And I think one of the ways that I think is shifted, and it’s really interesting, but when you and I were, when we were young, we could see everything that was happening. We were deploying on some sort of Linux box or other sort of Unix box somewhere, most likely, and if we wanted, we could go and see the entire source code of everything that our software was running on. And kids these days, they’re coming up, and they are deploying their stuff on RDS and ECS and, you know, how many layers of abstraction are sitting between them and—
Corey: “I run Kubernetes. That means I don’t know where it runs, and neither does anyone else.” It’s great.
Laura: Yeah. So, there’s no transparency anymore in what’s happening. So, it’s very easy, you get to a point where sometimes you hit a problem, and you just can’t figure it out because you do not have a way to get into that system and see what’s happening. You know, even at work, we ran into a problem with Amazon-hosted Prometheus. We were like, “This will be great. We’ll just do that.” And we could not get some particular type of remote write operation to work. We just could not. Okay, so we’ll have to do something else.
So, one of the many, many things I do when I’m not, you know, trying to run the SREcon conference or do actual work or definitely not write a book, I’m studying at Lund University at the moment. I’m doing this master’s degree in human factors and system safety. And one of the things I’ve realized since doing that program is, in tech, we missed this whole 1980s and 1990s discipline of cognitive systems theory, cognitive systems engineering. This is what people were doing. They were like, how can people in the control room in nuclear plants and in the cockpit in the airplane, how can they get along with their systems and build a good mental model of the automation and understand what’s going on?
We missed all that. We came of age when safety science was asking questions like how can we stop organizational failures like Challenger and Columbia, where people are just not making the correct decisions? And that was a whole different sort of focus. So, we’ve missed all of this 1980s and 1990s cognitive system stuff. And there’s this really interesting idea there where you can build two types of systems: you can build a prosthesis which does all your interaction with a system for you, and you can see nothing, feel nothing, do nothing, it’s just this black box, or you can have an amplifier, which lets you do more stuff than you could do just by yourself, but lets you still get into the details.
And we build mostly prostheses. We do not build amplifiers. We’re hiding all the details; we’re building these very, very opaque abstractions. And I think it’s to the detriment of—I mean, it makes our life harder in a bunch of ways, but I think it also makes life really hard for systems engineers coming up because they just can’t get into the systems as easily anymore unless they’re running them themselves.
Corey: I have to confess that I have a certain aversion to aspects of SRE, and I’m feeling echoes of it around a lot of the human factor stuff that’s coming out of that Lund program. And I think I know what it is, and it’s not a problem with either of those things, but rather a problem with me. I have never been a good academic. I have an eighth grade education because school is not really for me. And what I loved about being a systems administrator for years was the fact that it was like solving puzzles every day.
I got to do interesting things, I got to chase down problems, and firefight all the time. And what SRE is represented is a step away from that to being more methodical, to taking on keeping the site up as a discipline rather than an occupation or a task that you’re working on. And I think that a lot of the human factors stuff plays directly into it. It feels like the field is becoming a lot more academic, which is a luxury we never had, when holy crap, the site is down, we’re going to go out of business if it isn’t back up immediately: panic mode.
Laura: I got to confess here, I have three master’s degrees. Three. I have problems, like I said before. I got what you mean. You don’t like when people are speaking in generalizations and sort of being all theoretical rather than looking at the actual messy details that we need to deal with to get things done, right? I know. I know what you mean, I feel it too.
And I’ve talked about the human factors stuff and theoretical stuff a fair bit at conferences, and what I always try to do is I always try and illustrate with the details. Because I think it’s very easy to get away from the actual problems and, you know, spend too much time in the models and in the theory. And I like to do both. I will confess, I like to do both. And that means that the luxury I miss out on is mostly sleep. But here we are.
Corey: I am curious as far as what you’ve seen as far as the human factors adoption in this space because every company for a while claimed to be focused on blameless postmortems. But then there would be issues that quickly turned into a blame Steve postmortem instead. And it really feels, at least from a certain point of view, that there was a time where it seemed to be gaining traction, but that may have been a zero interest rate phenomenon, as weird as that sounds. Do you think that the idea of human factors being tied to keeping systems running in a computer sense has demonstrated staying power or are you seeing a recession? It could be I’m just looking at headlines too much.
Laura: It’s a good question. There’s still a lot of people interested in it. There was a conference in Denver last February that was decently well attended for, you know, a first initial conference that was focusing on this issue, and this very vibrant Slack community, the LFI and the Learning from Incidents in Software community. I will say, everything is a little bit stretched at the moment in industry, as you know, with all the layoffs, and a lot of people are just… there’s definitely a feeling that people want to hunker down and do the basics to make sure that they’re not seen as doing useless stuff and on the line for layoffs.
But the question is, is this stuff actually useful or not? I mean, I contend that it is. I contend that we can learn from failures, we can learn from what we’re doing day-to-day, and we can do things better. Sometimes you don’t need a lot of learning because what’s the biggest problem is obvious, right [laugh]? You know, in that case, yeah, your focus should just be on solving your big obvious problem, for sure.
Corey: If there was a hierarchy of needs here, on some level, okay, step one, is the building—
Corey: Currently on fire? Maybe solve that before thinking about the longer-term context of what this does to corporate culture.
Laura: Yes, absolutely. And I’ve gone into teams before where people are like, “Oh, well, you’re an SRE, so obviously, you wish to immediately introduce SLOs.” And I can look around and go, “Nope. Not the biggest problem right now. Actually, I can see a bunch of things are on fire. We should fix those specific things.”
I actually personally think that if you want to go in and start improving reliability in a system, the best thing to do is to start a weekly production meeting if the team doesn’t have that, actually create a dedicated space and time for everyone to be able to get together, discuss what’s been happening, discuss concerns and risks, and get all that stuff out in the open. I think that’s very useful, and you don’t need to spend however long it takes to formally sit down and start creating a bunch of SLOs. Because if you’re not dealing with a perfectly spherical web service where you can just use the Golden Signals and if you start getting into any sorts of thinking about data integrity, or backups, or any sorts of asynchronous processing, these sorts of things, they need SLOs that are a lot more interesting than your standard error rate and latency. Error rate and latency gets you so far, but it’s really just very cookie-cutter stuff. But people know what’s wrong with their systems, by and large. They may not know everything that’s wrong with their systems, but they’ll know the big things, for sure. Give them space to talk about it.
Corey: Speaking of bigger things and turning into the idea of these things escaping beyond pure tech, you have been doing some rather interesting work in an area that I don’t see a whole lot of people that I talked to communicating about. Specifically, you’re volunteering for the campaign to stop killer robots, which ten years ago would have made you sound ridiculous, and now it makes you sound like someone who is very rationally and reasonably calling an alarm on something that is on our doorstep. What are you doing over there?
Laura: Well, I mean, let’s be real, it sounds ridiculous because it is ridiculous. I mean, who would let a computer fly around to the sky and choose what to shoot at? But it turns out that there are, in fact, a bunch of people who are building systems like that. So yeah, I’ve been volunteering with the campaign for about the last five years, since roughly around the time that I left Google, in fact, because I got interested in that around about the time that Google was doing the Project Maven work, which was when Google said, “Hey, wouldn’t it be super cool if we took all of this DoD video footage of drone video footage, and, you know, did a whole bunch of machine-learning analysis on it and figured out where people are going all the time? Maybe we could click on this house and see, like, a whole timeline of people’s comings and goings and which other people they are sort of in a social network with.”
So, I kind of said, “Ahh… maybe I don’t want to be involved in that.” And I left Google. And I found out that there was this campaign. And this campaign was largely lawyers and disarmament experts, people of that nature—philosophers—but also a few technologists. And for me, having run computer systems for a large number of years at this point, the idea that you would want to rely on a big distributed system running over some janky network with a bunch of 18-year-old kids running it to actually make good decisions about who should be targeted in a conflict seems outrageous.
And I think almost every [laugh] software operations person, or in fact, software engineer that I’ve spoken to, tends to feel the same way. And yet there is this big practical debate about this in international relations circles. But luckily, there has just been a resolution in the UN just in the last day or two as we record this, the first committee has, by a very large majority, voted to try and do something about this. So hopefully, we’ll get some international law. The specific interventions that most of us in this field think would be good would be to limit the amount of force that autonomous weapon, or in fact, an entire set of autonomous weapons in a region would be able to wield because there’s a concern that should there be some bug or problem or a sort of weird factor that triggers these systems to—
Corey: It’s an inevitability that there will be. Like, that is not up for debate. Of course, it’s going to break in 2020, the template slide deck that AWS sent out for re:Invent speakers had a bunch of clip art, and one of them was a line art drawing of a ham with a bone in it. So, I wound up taking that image, slapping it on a t-shirt, captioning it “AWS Hambone,” and selling that as a fundraiser for 826 National.
Corey: Now, what happened next is that for a while, anyone who tweeted the phrase “AWS Hambone” would find themselves banned from Twitter for the next 12 hours due to some weird algorithmic thing where it thought that was doxxing or harassment or something. And people on the other side of the issue that you’re talking about are straight face-idly suggesting that we give that algorithm [unintelligible 00:24:32] tool a gun.
Laura: Or many guns. Many guns.
Corey: I’m sorry, what?
Corey: Yes, or missiles or, heck, let’s build a whole bunch of them and turn them loose with no supervision, just like we do with junior developers.
Laura: Exactly. Yes, so many people think this is a great idea, or at least they purport to think this is a great idea, which is not always the same thing. I mean, there’s lots of different vested interests here. Some people who are proponents of this will say, well, actually, we think that this will make targeting more accurate, less civilians will actually will die as a result of this. And the question there that you have to ask is—there’s a really good book called Drone by Chamayou, Grégoire Chamayou, and he says that there’s actually three meanings to accuracy.
So, are you hitting what you’re aiming at is one of it—one thing. And that’s a solved problem in military circles for quite some time. You got, you know, laser targeting, very accurate. Then the other question is, how big is the blast radius? So, that’s just a matter of, you know, how big an explosion are you going to get? That’s not something that autonomy can help with.
The only thing that autonomy could even conceivably help with in terms of accuracy is better target selection. So, instead of selecting targets that are not valid targets, selecting more valid targets. But I don’t think there’s any good reason to think that computers can solve that problem. I mean, in fact, if you read stuff that military experts write on this, and I’ve got, you know, lots of academic handbooks on military targeting processes, they will tell you, it’s very hard and there’s a lot of gray areas, a lot of judgment. And that’s exactly what computers are pretty bad at. Although mind you, I’m amused by your Hambone story and I want to ask if AWS Hambone is a database?
Corey: Anything is a database, if you hold it wrong.
Corey: It’s fun. I went through a period of time where, just for fun, I would ask people to name an AWS service and I would talk about how you could use it incorrectly as a database. And then someone mentioned, “What about AWS Neptune,” which is their graph database, which absolutely no one understands, and the answer there is, “I give up. It’s impossible to use that thing as a database.” But everything else can be. Like, you know, the tagging system. Great, that has keys and values; it’s a database now. Welcome aboard. And I didn’t say it was a great database, but it is a free one, and it scales to a point. Have fun with it.
Laura: All I’ll say is this: you can put labels on anything.
Laura: We missed you at the most recent SREcon EMEA. There was a talk about Google’s internal Chubby system and how people started using it as a database. And I did summon you in Slack, but you didn’t show up.
Corey: No. Sadly, I’ve gotten a bit out of the SRE space. And also, frankly, I’ve gotten out of the community space for a little while, when it comes to conferences. And I have a focused effort at the start of 2024 to start changing that. I am submitting CFPs left and right.
My biggest fear is that a conference will accept one of these because a couple of them are aspirational. “Here’s how I built the thing with generative AI,” which spoiler, I have done no such thing yet, but by God, I will by the time I get there. I have something similar around Kubernetes, which I’ve never used in anger, but soon will if someone accepts the right conference talk. This is how I learned Git: I shot my mouth off in a CFP, and I had four months to learn the thing. It was effective, but I wouldn’t say it was the best approach.
Laura: [laugh]. You shouldn’t feel bad about lying about having built things in Kubernetes, and with LLMs because everyone has, right?
Corey: Exactly. It’ll be true enough by the time I get there. Why not? I’m not submitting for a conference next week. We’re good. Yeah, Future Corey is going to hate me.
Laura: Have it build you a database system.
Corey: I like that. I really want to thank you for taking the time to speak with me today. If people want to learn more, where’s the best place for them to find you these days?
Laura: Ohh, I’m sort of homeless on social media since the whole Twitter implosion, but you can still find me there. I’m @lauralifts on Twitter and I have the same tag on BlueSky, but haven’t started to use it yet. Yeah, socials are hard at the moment. I’m on LinkedIn. Please feel free to follow me there if you wish to message me as well.
Corey: And we will, of course, put links to that in the [show notes 00:28:31]. Thank you so much for taking the time to speak with me. I appreciate it.
Laura: Thank you for having me.
Corey: Laura Nolan, Principal Software Engineer at Stanza. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that soon—due to me screwing up a database system—will be transmogrified into a CFP submission for an upcoming SREcon.
Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.