Reliability Starts in Cultural Change with Amy Tobey

Episode Summary

Corey has been talking to Amy Tobey, Senior Principal Engineer at Equinix, for quite some time! And now they’ve finally sat down for a round of “Screaming.” Amy does an awful lot, but we want to get some structure behind the many, many obstacles that Amy tackles as a senior engineer.

Amy breaks down what she does at Equinix, who has multiple data centers all over the world, as well as other products. Amy works on Equinix Metal, and does pretty much everything when it comes to keeping it functioning. But Amy’s contribution doesn’t stop there. For Amy there is a lot of space for improvement in the reliability space that can be at the cultural level. She offers up some excellent insight into ways to make that happen, keeping the grumpiness out of sysadmin, and more!

Episode Show Notes & Transcript

About Amy

Amy Tobey has worked in tech for more than 20 years at companies of every size, working with everything from kernel code to user interfaces. These days she spends her time building an innovative Site Reliability Engineering program at Equinix, where she is a principal engineer. When she's not working, she can be found with her nose in a book, watching anime with her son, making noise with electronics, or doing yoga poses in the sun.

Links Referenced:

Equinix Metal: https://metal.equinix.com
Personal Twitter: https://twitter.com/MissAmyTobey
Personal Blog: https://tobert.github.io/

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by our friends at Vultr. Optimized cloud compute plans have landed at Vultr to deliver lightning-fast processing power, courtesy of third-gen AMD EPYC processors without the IO or hardware limitations of a traditional multi-tenant cloud server. Starting at just 28 bucks a month, users can deploy general-purpose, CPU, memory, or storage optimized cloud instances in more than 20 locations across five continents. Without looking, I know that once again, Antarctica has gotten the short end of the stick. Launch your Vultr optimized compute instance in 60 seconds or less on your choice of included operating systems, or bring your own. It’s time to ditch convoluted and unpredictable giant tech company billing practices and say goodbye to noisy neighbors and egregious egress forever. Vultr delivers the power of the cloud with none of the bloat. “Screaming in the Cloud” listeners can try Vultr for free today with a $150 in credit when they visit getvultr.com/screaming. That’s G-E-T-V-U-L-T-R dot com slash screaming. My thanks to them for sponsoring this ridiculous podcast.

Corey: Finding skilled DevOps engineers is a pain in the neck! And if you need to deploy a secure and compliant application to AWS, forgettaboutit! But that’s where DuploCloud can help. Their comprehensive no-code/low-code software platform guarantees a secure and compliant infrastructure in as little as two weeks, while automating the full DevSecOps lifestyle. Get started with DevOps-as-a-Service from DuploCloud so that your cloud configurations are done right the first time. Tell them I sent you and your first two months are free. To learn more visit: snark.cloud/duplo. Thats’s snark.cloud/D-U-P-L-O-C-L-O-U-D.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Every once in a while I catch up with someone that it feels like I’ve known for ages, and I realize somehow I have never been able to line up getting them on this show as a guest. Today is just one of those days. And my guest is Amy Tobey who has been someone I’ve been talking to for ages, even in the before-times, if you can remember such a thing. Today, she’s a Senior Principal Engineer at Equinix. Amy, thank you for finally giving in to my endless wheedling.

Amy: Thanks for having me. You mentioned the before-times. Like, I remember it was, like, right before the pandemic we had beers in San Francisco wasn’t it? There was Ian there—

Corey: Yeah, I—

Amy: —and a couple other people. It was a really great time. And then—

Corey: I vaguely remember beer. Yeah. And then—

Amy: And then the world ended.

Corey: Oh, my God. Yes. It’s still March of 2020, right?

Amy: As far as I know. Like, I haven’t checked in a couple years.

Corey: So, you do an awful lot. And it’s always a difficult question to ask someone, so can you encapsulate your entire existence in a paragraph? It’s—

Amy: [sigh].

Corey: —awful, so I’d like to give a bit more structure to it. Let’s start with the introduction: You are a Senior Principal Engineer. We know it’s high level because of all the adjectives that get put in there, and none of those adjectives are ‘associate’ or ‘beginner’ or ‘junior,’ or all the other diminutives that companies like to play games with to justify paying people less. And you’re at Equinix, which is a company that is a bit unlike most of the, shall we say, traditional cloud providers. What do you do over there and both as a company, as a person?

Amy: So, as a company Equinix, what most people know about is that we have a whole bunch of data centers all over the world. I think we have the most of any company. And what we do is we lease out space in that data center, and then we have a number of other products that people don’t know as well, which one is Equinix Metal, which is what I specifically work on, where we rent you bare-metal servers. None of that fancy stuff that you get any other clouds on top of it, there’s things you can get that are… partner things that you can add-on, like, you know, storage and other things like that, but we just deliver you bare-metal servers with really great networking. So, what I work on is the reliability of that whole system. All of the things that go into provisioning the servers, making them come up, making sure that they get delivered to the server, make sure the API works right, all of that stuff.

Corey: So, you’re on the Equinix cloud side of the world more so than you are on the building data centers by the sweat of your brow, as they say?

Amy: Correct. Yeah, yeah. Software side.

Corey: Excellent. I spent some time in data centers in the early part of my career before cloud ate that. That was sort of cotemporaneous with the discovery that I’m the hardware destruction bunny, and I should go to great pains to keep my aura from anything expensive and important, like, you know, the SAN. So—

Amy: Right, yeah.

Corey: Companies moving out of data centers, and me getting out was a great thing.

Amy: But the thing about SANs though, is, like, it might not be you. They’re just kind of cursed from the start, right? They just always were kind

of fussy and easy to break.

Corey: Oh, yeah. I used to think—and I kid you not—that I had a limited upside to my career in tech because I sometimes got sloppy and I was fairly slow at crimping ethernet cables.

Amy: [laugh].

Corey: That is very similar to growing up in third grade when it became apparent that I was going to have problems in my career because my handwriting was sloppy. Yeah, it turns out the future doesn’t look like we predicted it would.

Amy: Oh, gosh. Are we going to talk about, like, neurological development now or… [laugh] okay, that’s a thing I struggle with, too right, is I started typing as soon as they would let—in fact, before they would let me. I remember in high school, I had teachers who would grade me down for typing a paper out. They want me to handwrite it and I would go, “Cool. Go ahead and take a grade off because if I handwrite it, you’re going to take two grades off my handwriting, so I’m cool with this deal.”

Corey: Yeah, it was pretty easy early on. I don’t know when the actual shift was, but it became more and more apparent that more and more things are moving towards a world where you could type. And I was almost five when I started working on that stuff, and that really wound up changing a lot of aspects of how I started seeing things. One thing I think you’re probably fairly well known for is incidents. I want to be clear when I say that you are not the root cause as—“So, why are things broken?” “It’s Amy again. What’s she gotten into this time?” Great.

Amy: [laugh]. But it does happen, but not all the time.

Corey: Exa—it’s a learning experience.

Amy: Right.

Corey: You’ve also been deeply involved with SREcon and a number of—a lot of aspects of what I will term—and please don’t yell at me for this—SRE culture—

Amy: Yeah.

Corey: Which is sometimes a challenging thing to wind up describing or putting a definition around. The one that I’ve always been somewhat partial to is, “SRE is DevOps, except you worked at Google for a while.” I don’t know how necessarily accurate that is, but it does rile people up.

Amy: Yeah, it does. Dave Stanke actually did a really great talk at SREcon San Francisco just a couple weeks ago, about the DORA report. And the new DORA report, they split SRE out into its own function and kind of is pushing against that old model, which actually comes from Liz Fong-Jones—I think it’s from her, or older—about, like, class SRE implements DevOps, which is kind of this idea that, like, SREs make DevOps happen. Things have evolved, right, since then. Things have evolved since Google released those books, and we’re all just figured out what works and what doesn’t a little bit.

And so, it’s not that we’re implementing DevOps so much. In fact, it’s that ops stuff that kind of holds us back from the really high impact work that SREs, I think, should be doing, that aren’t just, like, fixing the problems, the symptoms down at the bottom layer, right? Like what we did as sysadmins 20 years ago. You know, we’d go and a lot of people are SREs that came out of the sysadmin world and still think in that mode, where it’s like, “Well, I set up the systems, and when things break, I go and I fix them.” And, “Why did the developers keep writing crappy code? Why do I have to always getting up in the middle of the night because this thing crashed?”

And it turns out that the work we need to do to make things more reliable, there’s a ceiling to how far away the platform can take us, right? Like, we can have the best platform in the world with redundancy, and, you know, nine-way replicated data storage and all this crazy stuff, and still if we put crappy software on top, it’s going to be unreliable. So, how do we make less crappy software? And for most of my career, people would be, like, “Well, you should test it.” And so, we started doing that, and we still have crappy software, so what’s going on here? We still have incidents.

So, we write more tests, and we still have incidents. We had a QA group, we still have incidents. We send the developers to training, and we still have incidents. So like, what is the thing we need to do to make things more reliable? And it turns out, most of it is culture work.

Corey: My perspective on this stems from being a grumpy old sysadmin. And at some point, I started calling myself a systems engineer or DevOps or production engineer, or SRE. It was all from my point of view, the same job, but you know, if you call yourself a sysadmin, you’re just asking for a 40% pay cut off the top.

Amy: [laugh].

Corey: But I still tended to view the world through that lens. I tended to be very good at Linux systems internals, for example, understanding system calls and the rest, but increasingly, as the DevOps wave or SRE wave, or Google-isation of the internet wound up being more and more of a thing, I found myself increasingly in job interviews, where, “Great, now, can you go wind up implementing a sorting algorithm on the whiteboard?” “What on earth? No.” Like, my lingua franca is shitty Bash, and no one tends to write that without a bunch of tab completions and quick checking with manpages—die.net or whatnot—on the fly as you go down that path.

And it was awful, and I felt… like my skill set was increasingly eroding. And it wasn’t honestly until I started this place where I really got into writing a fair bit of code to do different things because it felt like an orthogonal skill set, but the fullness of time, it seems like it’s not. And it’s a reskilling. And it made me wonder, does this mean that the areas of technology that I focused on early in my career, was that all a waste? And the answer is not really. Sometimes, sure, in that I don’t spend nearly as much time worrying about inodes—for example—as I once did. But every once in a while, I’ll run into something and I looked like a wizard from the future, but instead, I’m a wizard from the past.

Amy: Yeah, I find that a lot in my work, now. Sometimes things I did 20 years ago, come back, and it’s like, oh, yeah, I remember I did all that threading work in 2002 in Perl, and I learned everything the very, very, very hard way. And then, you know, this January, did some threading work to fix some stability issues, and all of it came flooding back, right? Just that the experiences really, more than the code or the learning or the text and stuff; more just the, like, this feels like threads [BLEEP]-ery. Is a diagnostic thing that sometimes we have to say.

And then people are like, “Can you prove it?” And I’m like, “Not really,” because it’s literally thread [BLEEP]-ery. Like, the definition of it is that there’s weird stuff happening that we can’t figure out why it’s happening. There’s something acting in the system that isn’t synchronized, that isn’t connected to other things, that’s happening out of order from what we expect, and if we had a clear signal, we would just fix it, but we don’t. We just have, like, weird stuff happening over here and then over there and over there and over there.

And, like, that tells me there’s just something happening at that layer and then have to go and dig into that right, and like, just basically charge through. My colleagues are like, “Well, maybe you should look at this, and go look at the database,” the things that they’re used to looking at and that their experiences inform, whereas then I bring that ancient toiling through the threading mines experiences back and go, “Oh, yeah. So, let’s go find where this is happening, where people are doing dangerous things with threads, and see if we can spot something.” But that came from that experience.

Corey: And there’s so much that just repeats itself. And history rhymes. The challenge is that, do you have 20 years of experience, or do you have one year of experience repeated 20 times? And as the tide rises, doing the same task by hand, it really is just a matter of time before your full-time job winds up being something a piece of software does. An easy example is, “Oh, what’s your job?” “I manually place containers onto specific hosts.” “Well, I’ve got news for you, and you’re not going to like it at all.”

Amy: Yeah, yeah. I think that we share a little bit. I’m allergic to repeated work. I don’t know if allergic is the right word, but you know, if I sit and I do something once, fine. Like, I’ll just crank it out, you know, it’s this form, or it's a datafile I got to write and I’ll—fine I’ll type it in and do the manual labor.

The second time, the difficulty goes up by ten, right? Like, just mentally, just to do it, be like, I’ve already done this once. Doing it again is anathema to everything that I am. And then sometimes I’ll get through it, but after that, like, writing a program is so much easier because it’s like exponential, almost, growth in difficulty. You know, the third time I have to do the same thing that’s like just typing the same stuff—like, look over here, read this thing and type it over here—I’m out; I can’t do it. You know, I got to find a way to automate. And I don’t know, maybe normal people aren’t driven to live this way, but it’s kept me from getting stuck in those spots, too.

Corey: It was weird because I spent a lot of time as a consultant going from place to place and it led to some weird changes. For example, “Oh, thank God, I don’t have to think about that whole messaging queue thing.” Sure enough, next engagement, it’s message queue time. Fantastic. I found that repeating myself drove me nuts, but you also have to be very sensitive not to wind up, you know, stealing IP from the people that you’re working with.

Amy: Right.

Corey: But what I loved about the sysadmin side of the world is that the vast majority of stuff that I’ve taken with me, lives in my shell config. And what I mean by that is I’m not—there’s nothing in there is proprietary, but when you have a weird problem with trying to figure out the best way to figure out which Ruby process is stealing all the CPU, great, turns out that you can chain seven or eight different shell commands together through a bunch of pipes. I don’t want to remember that forever. So, that’s the sort of thing I would wind up committing as I learned it. I don’t remember what company I picked that up at, but it was one of those things that was super helpful.

I have a sarcastic—it’s a one-liner, except no sane editor setting is going to show it in any less than three—of a whole bunch of Perl, piped into du, piped into the rest, that tells you one of the largest consumers of files in a given part of the system. And it rates them with stars and it winds up doing some neat stuff. I would never sit down and reinvent something like that today, but the fact that it’s there means that I can do all kinds of neat tricks when I need to. It’s making sure that as you move through your career, on some level, you’re picking up skills that are repeatable and applicable beyond one company.

Amy: Skills and tooling—

Corey: Yeah.

Amy: —right? Like, you just described the tool. Another SREcon talk was John Allspaw and Dr. Richard Cook talking about above the line; below the line. And they started with these metaphors about tools, right, showing all the different kinds of hammers.

And if you’re a blacksmith, a lot of times you craft specialized hammers for very specific jobs. And that’s one of the properties of a tool that they were trying to get people to think about, right, is that tools get crafted to the job. And what you just described as a bespoke tool that you had created on the fly, that kind of floated under the radar of intellectual property. [laugh].

So, let’s not tell the security or IP people right? Like, because there’s probably billions and billions of dollars of technically, like, made-up IP value—I’m doing air quotes with my fingers—you know, that’s just basically people’s shell profiles. And my God, the Emacs automation that people have done. If you’ve ever really seen somebody who’s amazing at Emacs and is 10, 20, 30, maybe 40 years of experience encoded in their emacs settings, it’s a wonder to behold. Like, I look at it and I go, “Man, I wish I could do that.”

It’s like listening to a really great guitar player and be like, “Wow, I wish I could play like them.” You see them just flying through stuff. But all that IP in there is both that person’s collection of wisdom and experience and working with that code, but also encodes that stuff like you described, right? It’s just all these little systems tricks and little fiddly commands and things we don’t want to remember and so we encode them into our toolset.

Corey: Oh, yeah. Anything I wound up taking, I always would share it with people internally, too. I’d mention, “Yeah, I’m keeping this in my shell files.” Because I disclosed it, which solves a lot of the problem. And also, none of it was even close to proprietary or anything like that. I’m sorry, but the way that you wind up figuring out how much of a disk is being eaten up and where in a more pleasing way, is not a competitive advantage. It just isn’t.

Amy: It isn’t to you or me, but, you know, back in the beginning of our careers, people thought it was worth money and should be proprietary. You know, like, oh, that disk-checking script as a competitive advantage for our company because there are only a few of us doing this work. Like, it was actually being able to, like, manage your—[laugh] actually manage your servers was a competitive advantage. Now, it’s kind of commodity.

Corey: Let’s also be clear that the world has moved on. I wound up buying a DaisyDisk a while back for Mac, which I love. It is a fantastic, pretty effective, “Where’s all the stuff on your disk going?” And it does a scan and you can drive and collect things and delete them when trying to clean things out. I was using it the other day, so it’s top of mind at the moment.

But it’s way more polished than that crappy Perl three-liner. And I see both sides, truly I do. The trick also, for those wondering [unintelligible 00:15:45], like, “Where is the line?” It’s super easy. Disclose it, what you’re doing, in those scenarios in the event someone is no because they believe that finding the right man page section for something is somehow proprietary.

Great. When you go home that evening in a completely separate environment, build it yourself from scratch to solve the problem, reimplement it and save that. And you’re done. There are lots of ways to do this. Don’t steal from your employer, but your employer employs you; they don’t own you and the way that you think about these problems.

Every person I’ve met who has had a career that’s longer than 20 minutes has a giant doc somewhere on some system of all of the scripts that they wound up putting together, all of the one-liners, the notes on, “Next time you see this, this is the thing to check.”

Amy: Yeah, the cheat sheet or the notebook with all the little commands, or again the Emacs config, sometimes for some people, or shell

profiles. Yeah.

Corey: Here’s the awk one-liner that I put that automatically spits out from an Apache log file what—the httpd log file that just tells me what are the most frequent talkers, and what are the—

Amy: You should probably let go of that one. You know, like, I think that one’s lifetime is kind of past, Corey. Maybe you—

Corey: I just have to get it working with Nginx, and we’re good to go.

Amy: Oh, yeah, there you go. [laugh].

Corey: Or S3 access logs. Perish the thought. But yeah, like, what are the five most high-volume talkers, and what are those relative to each other? Huh, that one thing seems super crappy and it’s coming from Russia. But that’s—hmm, one starts to wonder; maybe it’s time to dig back in.

So, one of the things that I have found is that a lot of the people talking about SRE seem to have descended from an ivory tower somewhere. And they’re talking about how some of the best-in-class companies out there, renowned for their technical cultures—at least externally—are doing these things. But there’s a lot more folks who are not there. And honestly, I consider myself one of those people who is not there. I was a competent engineer, but never a terrific one.

And looking at the way this was described, I often came away thinking, “Okay, it was the purpose of this conference talk just to reinforce how smart people are, and how I’m not,” and/or, “There are the 18 cultural changes you need to make to your company, and then you can do something kind of like we were just talking about on stage.” It feels like there’s a combination of problems here. One is making this stuff more accessible to folks who are not themselves in those environments, and two, how to drive cultural change as an individual contributor if that’s even possible. And I’m going to go out on a limb and guess you have thoughts on both aspects of that, and probably some more hit me, please.

Amy: So, the ivory tower, right. Let’s just be straight up, like, the ivory tower is Google. I mean, that’s where it started. And we get it from the other large companies that, you know, want to do conference talks about what this stuff means and what it does. What I’ve kind of come around to in the last couple of years is that those talks don’t really reach the vast majority of engineers, they don’t really apply to a large swath of the enterprise especially, which is, like, where a lot of the—the bulk of our industry sits, right? We spend a lot of time talking about the darlings out here on the West Coast in high tech culture and startups and so on.

But, like, we were talking about before we started the show, right, like, the interior of even just America, is filled with all these, like, insurance and banks and all of these companies that are cranking out tons of code and servers and stuff, and they’re trying to figure out the same problems. But they’re structured in companies where their tech arm is still, in most cases, considered a cost center, often is bundled under finance, for—that’s a whole show of itself about that historical blunder. And so, the tech culture is tend to be very, very different from what we experience in—what do we call it anymore? Like, I don’t even want to say West Coast anymore because we’ve gone remote, but, like, high tech culture we’ll say. And so, like, thinking about how to make SRE and all this stuff more accessible comes down to, like, thinking about who those engineers are that are sitting at the computers, writing all the code that runs our banks, all the code that makes sure that—I’m trying to think of examples that are more enterprise-y right?

Or shoot buying clothes online. You go to Macy’s for example. They have a whole bunch of servers that run their online store and stuff. They have internal IT-ish people who keep all this stuff running and write that code and probably integrating open-source stuff much like we all do. But when you go to try to put in a reliability program that’s based on the current SRE models, like SLOs; you put in SLOs and you start doing, like, this incident management program that’s, like, you know, you have a form you fill out after every incident, and then you [unintelligible 00:20:25] retros.

And it turns out that those things are very high-level skills, skills and capabilities in an organization. And so, when you have this kind of IT mindset or the enterprise mindset, bringing the culture together to make those things work often doesn’t happen. Because, you know, they’ll go with the prescriptive model and say, like, okay, we’re going to implement SLOs, we’re going to start measuring SLIs on all of the services, and we’re going to hold you accountable for meeting those targets. If you just do that, right, you’re just doing more gatekeeping and policing of your tech environment. My bet is, reliability almost never improves in those cases.

And that’s been my experience, too, and why I get charged up about this is, if you just go slam in these practices, people end up miserable, the practices then become tarnished because people experienced the worst version of them. And then—

Corey: And with the remote explosion as well, it turns out that changing jobs basically means their company sends you a different Mac, and the next Monday, you wind up signing into a different Slack team.

Amy: Yeah, so the culture really matters, right? You can’t cover it over with foosball tables and great lunch. You actually have to deliver tools that developers want to use and you have to deliver a software engineering culture that brings out the best in developers instead of demanding the best from developers. I think that’s a fundamental business shift that’s kind of happening. If I’m putting on my wizard hat and looking into the future and dreaming about what might change in the world, right, is that there’s kind of a change in how we do leadership and how we do business that’s shifting more towards that model where we look at what people are capable of and we trust in our people, and we get more out of them, the knowledge work model.

If we want more knowledge work, we need people to be happy and to feel engaged in their community. And suddenly we start to see these kind of generational, bigger-pie kind of things start to happen. But how do we get there? It’s not SLOs. It maybe it’s a little bit starting with incidents. That’s where I’ve had the most success, and you asked me about that. So, getting practical, incident management is probably—

Corey: Right. Well, as I see it, the problem with SLOs across the board is it feels like it’s a very insular community so far, and communicating it to engineers seems to be the focus of where the community has been, but from my understanding of it, you absolutely need buy-in at significantly high executive levels, to at the very least by you air cover while you’re doing these things and making these changes, but also to help drive that cultural shift. None of this is something I have the slightest clue how to do, let’s be very clear. If I knew how to change a company’s culture, I’d have a different job.

Amy: Yeah. [laugh]. The biggest omission in the Google SRE books was [Ers 00:22:58]. There was a guy at Google named Ers who owns availability for Google, and when anything is, like, in dispute and bubbles up the management team, it goes to Ers, and he says, “Thou shalt…” right? Makes the call. And that’s why it works, right?

Like, it’s not just that one person, but that system of management where the whole leadership team—there’s a large, very well-funded team with a lot of power in the organization that can drive availability, and they can say, this is how you’re going to do metrics for your service, and this is the system that you’re in. And it’s kind of, yeah, sure it works for them because they have all the organizational support in place. What I was saying to my team just the other day—because we’re in the middle of our SLO rollout—is that really, I think an SLO program isn’t [clear throat] about the engineers at all until late in the game. At the beginning of the game, it’s really about getting the leadership team on board to say, “Hey, we want to put in SLIs and SLOs to start to understand the functioning of our software system.” But if they don’t have that curiosity in the first place, that desire to understand how well their teams are doing, how healthy their teams are, don’t do it. It’s not going to work. It’s just going to make everyone miserable.

Corey: It feels like it’s one of those difficult to sell problems as well, in that it requires some tooling changes, absolutely. It requires cultural change and buy-in and whatnot, but in order for that to happen, there has to be a painful problem that a company recognizes and is willing to pay to make go away. The problem with stuff like this is that once you pay, there’s a lot of extra work that goes on top of it as well, that does not have a perception—rightly or wrongly—of contributing to feature velocity, of hitting the next milestone. It’s, “Really? So, we’re going to be spending how much money to make engineers happier? They should get paid an awful lot and they’re still complaining and never seem happy. Why do I care if they’re happy other than the pure mercenary perspective of otherwise they’ll quit?” I’m not saying that it’s not worth pursuing; it’s not a worthy goal. I am saying that it becomes a very difficult thing to wind up selling as a product.

Amy: Well, as a product for sure, right? Because—[sigh] gosh, I have friends in the space who work on these tools. And I want to be careful.

Corey: Of course. Nothing but love for all of those people, let’s be very clear.

Amy: But a lot of them, you know, they’re pulling metrics from existing monitoring systems, they are doing some interesting math on them, but what you get at the end is a nice service catalog and dashboard, which are things we’ve been trying to land as products in this industry for as long as I can remember, and—

Corey: “We’ve got it this time, though. This time we’ll crack the nut.” Yeah. Get off the island, Gilligan.

Amy: And then the other, like, risky thing, right, is the other part that makes me uncomfortable about SLOs, and why I will often tell folks that I talk to out in the industry that are asking me about this, like, one-on-one, “Should I do it here?” And it’s like, you can bring the tool in, and if you have a management team that’s just looking to have metrics to drive productivity, instead of you know, trying to drive better knowledge work, what you get is just a fancier version of more Taylorism, right, which is basically scientific management, this idea that we can, like, drive workers to maximum efficiency by measuring random things about them and driving those numbers. It turns out, that doesn’t really work very well, even in industrial scale, it just happened to work because, you know, we have a bloody enough society that we pushed people into it. But the reality is, if you implement SLOs badly, you get more really bad Taylorism that’s bad for you developers. And my suspicion is that you will get worse availability out of it than you would if you just didn’t do it at all.

Corey: This episode is sponsored by our friends at Revelo. Revelo is the Spanish word of the day, and its spelled R-E-V-E-L-O. It means “I reveal.” Now, have you tried to hire an engineer lately? I assure you it is significantly harder than it sounds. One of the things that Revelo has recognized is something I’ve been talking about for a while, specifically that while talent is evenly distributed, opportunity is absolutely not. They’re exposing a new talent pool to, basically, those of us without a presence in Latin America via their platform. It’s the largest tech talent marketplace in Latin America with over a million engineers in their network, which includes—but isn’t limited to—talent in Mexico, Costa Rica, Brazil, and Argentina. Now, not only do they wind up spreading all of their talent on English ability, as well as you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform are hands down the most talented engineers that I’ve ever spoken to. Let’s also not forget that Latin America has high time zone overlap with what we have here in the United States, so you can hire full-time remote engineers who share most of the workday as your team. It’s an end-to-end talent service, so you can find and hire engineers in Central and South America without having to worry about, frankly, the colossal pain of cross-border payroll and benefits and compliance because Revelo handles all of it. If you’re hiring engineers, check out revelo.io/screaming to get 20% off your first three months. That’s R-E-V-E-L-O dot I-O slash screaming.

Corey: That is part of the problem is, in some cases, to drive some of these improvements, you have to go backwards to move forwards. And it’s one of those, “Great, so we spent all this effort and money in the rest of now things are worse?” No, not necessarily, but suddenly are aware of things that were slipping through the cracks previously.

Amy: Yeah. Yeah.

Corey: Like, the most realistic thing about first The Phoenix Project and then The Unicorn Project, both by Gene Kim, has been the fact that companies have these problems and actively cared enough to change it. In my experience, that feels a little on the rare side.

Amy: Yeah, and I think that’s actually the key, right? It's for the culture change, and for, like, if you really looking to be, like, do I want to work at

this company? Am I investing my myself in here? Is look at the leadership team and be, like, do these people actually give a crap? Are they looking just to punt another number down the road?

That’s the real question, right? Like, the technology and stuff, at the point where I’m at in my career, I just don’t care that much anymore. [laugh]. Just… fine, use Kubernetes, use Postgres, [unintelligible 00:27:30], I don’t care. I just don’t. Like, Oracle, I might have to ask, you know, go to finance and be like, “Hey, can we spend 20 million for a database?” But like, nobody really asks for that anymore, so. [laugh].

Corey: As one does. I will say that I mostly agree with you, but a technology that I found myself getting excited about, given the time of the recording on this is… fun, I spent a bit of time yesterday—from when we’re recording this—teaching myself just enough Go to wind up being together a binary that I needed to do something actively ridiculous for my camera here. And I found myself coming away deeply impressed by a lot of things about it, how prescriptive it was for one, how self-contained for another. And after spending far too many years of my life writing shitty Perl, and shitty Bash, and worse Python, et cetera, et cetera, the prescriptiveness was great. The fact that it wound up giving me something I could just run, I could cross-compile for anything I need to run it on, and it just worked. It’s been a while since I found a technology that got me this interested in exploring further.

Amy: Go is great for that. You mentioned one of my two favorite features of Go. One is usually when a program compiles—at least the way I code in Go—it usually works. I’ve been working with Go since about 0.9, like, just a little bit before it was released as 1.0, and that’s what I’ve noticed over the years of working with it is that most of the time, if you have a pretty good data structure design and you get the code to compile, usually it’s going to work, unless you’re doing weird stuff.

The other thing I really love about Go and that maybe you’ll discover over time is the malleability of it. And the reason why I think about that more than probably most folks is that I work on other people’s code most of the time. And maybe this is something that you probably run into with your business, too, right, where you’re working on other people’s infrastructure. And the way that we encode business rules and things in the languages, in our programming language or our config syntax and stuff has a huge impact on folks like us and how quickly we can come into a situation, assess, figure out what’s going on, figure out where things are laid out, and start making changes with confidence.

Corey: Forget other people for a minute they’re looking at what I built out three or four years ago here, myself, like, I look at past me, it’s like, “What was that rat bastard thinking? This is awful.” And it’s—forget other people’s code; hell is your own code, on some level, too, once it’s slipped out of the mental stack and you have to re-explore it and, “Oh, well thank God I defensively wound up not including any comments whatsoever explaining what the living hell this thing was.” It’s terrible. But you’re right, the other people’s shell scripts are finicky and odd.

I started poking around for help when I got stuck on something, by looking at GitHub, and a few bit of searching here and there. Even these large, complex, well-used projects started making sense to me in a way that I very rarely find. It’s, “What the hell is that thing?” is my most common refrain when I’m looking at other people’s code, and Go for whatever reason avoids that, I think because it is so prescriptive about formatting, about how things should be done, about the vision that it has. Maybe I’m romanticizing it and I’ll hate it and a week from now, and I want to go back and remove this recording, but.

Amy: The size of the language helps a lot.

Corey: Yeah.

Amy: But probably my favorite. It’s more of a convention, which actually funny the way I’m going to talk about this because the two languages I work on the most right now are Ruby and Go. And I don’t feel like two languages could really be more different.

Syntax-wise, they share some things, but really, like, the mental models are so very, very different. Ruby is all the way in on object-oriented programming, and, like, the actual real kind of object-oriented with messaging and stuff, and, like, the whole language kind of springs from that. And it kind of requires you to understand all of these concepts very deeply to be effective in large programs. So, what I find is, when I approach Ruby codebase, I have to load all this crap into my head and remember, “Okay, so yeah, there’s this convention, when you do this kind of thing in Ruby”—or especially Ruby on Rails is even worse because they go deep into convention over configuration. But what that’s code for is, this code is accessible to people who have a lot of free cognitive capacity to load all this convention into their heads and keep it in their heads so that the code looks pretty, right?

And so, that’s the trade-off as you said, okay, my developers have to be these people with all these spare brain cycles to understand, like, why I would put the code here in this place versus this place? And all these, like, things that are in the code, like, very compact, dense concepts. And then you go to something like Go, which is, like, “Nah, we’re not going to do Lambdas. Nah”—[laugh]—“We’re not doing all this fancy stuff.” So, everything is there on the page.

This drives some people crazy, right, is that there’s all this boilerplate, boilerplate, boilerplate. But the reality is, I can read most Go files from top to the bottom and understand what the hell it’s doing, whereas I can go sometimes look at, like, a Ruby thing, or sometimes Python and e—Perl is just [unintelligible 00:32:19] all the time, right, it’s there’s so much indirection. And it just be, like, “What the [BLEEP] is going on? This is so dense. I’m going to have to sit down and write it out in longhand so I can understand what the developer was even doing here.” And—

Corey: Well, that’s why I got the Mac Studio; for when I’m not doing A/V stuff with it, that means that I’ll have one core that I can use for, you know, front-end processing and the rest, and the other 19 cores can be put to work failing to build Nokogiri in Ruby yet again.

Amy: [laugh].

Corey: I remember the travails of working with Ruby, and the problem—I have similar problems with Python, specifically in that—I don’t know if I’m special like this—it feels like it’s a SRE DevOps style of working, but I am grabbing random crap off a GitHub constantly and running it, like, small scripts other people have built. And let’s be clear, I run them on my test AWS account that has nothing important because I’m not a fool that I read most of it before I run it, but I also—it wants a different version of Python every single time. It wants a whole bunch of other things, too. And okay, so I use ASDF as my version manager for these things, which for whatever reason, does not work for the way that I think about this ergonomically. Okay, great.

And I wind up with detritus scattered throughout my system. It’s, “Hey, can you make this reproducible on my machine?” “Almost certainly not, but thank you for asking.” It’s like ‘Step 17: Master the Wolf’ level of instructions.

Amy: And I think Docker generally… papers over the worst of it, right, is when we built all this stuff in the aughts, you know, [CPAN 00:33:45]—

Corey: Dev containers and VS Code are very nice.

Amy: Yeah, yeah. You know, like, we had CPAN back in the day, I was doing chroots, I think in, like, ’04 or ’05, you know, to solve this problem, right, which is basically I just—screw it; I will compile an entire distro into a directory with a Perl and all of its dependencies so that I can isolate it from the other things I want to run on this machine and not screw up and not have these interactions. And I think that’s kind of what you’re talking about is, like, the old model, when we deployed servers, there was one of us sitting there and then we’d log into the server and be like, I’m going to install the Perl. You know, I’ll compile it into, like, [/app/perl 558 00:34:21] whatever, and then I’ll CPAN all this stuff in, and I’ll give it over to the developer, tell them to set their shebang to that and everything just works. And now we’re in a mode where it’s like, okay, you got to set up a thousand of those. “Okay, well, I’ll make a tarball.” [laugh]. But it’s still like we had to just—

Corey: DevOps, but [unintelligible 00:34:37] dev closer to ops. You’re interrelating all the time. Yeah, then Docker comes along, and add dev is, like, “Well, here’s the container. Good luck, asshole.” And it feels like it’s been cast into your yard to worry about.

Amy: Yeah, well, I mean, that’s just kind of business, or just—

Corey: Yeah. Yeah.

Amy: I’m not sure if it’s business or capitalism or something like that, but just the idea that, you know, if I can hand off the shitty work to some other poor schlub, why wouldn’t I? I mean, that’s most folks, right? Like, just be like, “Well”—

Corey: Which is fair.

Amy: —“I got it working. Like, my part is done, I did what I was supposed to do.” And now there’s a lot of folks out there, that’s how they work, right? “I hit done. I’m done. I shipped it. Sure. It’s an old [unintelligible 00:35:16] Ubuntu. Sure, there’s a bunch of shell scripts that rip through things. Sure”—you know, like, I’ve worked on repos where there’s hundreds of things that need to be addressed.

Corey: And passing to someone else is fine. I’m thrilled to do it. Where I run into problems with it is where people assume that well, my part was the hard part and anything you schlubs do is easy. I don’t—

Amy: Well, that’s the underclass. Yeah. That’s—

Corey: Forget engineering for a second; I throw things to the people over in the finance group here at The Duckbill Group because those people are wizards at solving for this thing. And it’s—

Amy: Well, that’s how we want to do things.

Corey: Yeah, specialization works.

Amy: But we have this—it’s probably more cultural. I don’t want to pick, like, capitalism to beat on because this is really, like, human cultural thing, and it’s not even really particularly Western. Is the idea that, like, “If I have an underclass, why would I give a shit what their experience

is?” And this is why I say, like, ops teams, like, get out of here because most ops teams, the extant ops teams are still called ops, and a lot of them have been renamed SRE—but they still do the same job—are an underclass. And I don’t mean that those people are below us. People are treated as an underclass, and they shouldn’t be. Absolutely not.

Corey: Yes.

Amy: Because the idea is that, like, well, I’m a fancy person who writes code at my ivory tower, and then it all flows down, and those people, just faceless people, do the deployment stuff that’s beneath me. That attitude is the most toxic thing, I think, in tech orgs to address. Like, if you’re trying to be like, “Well, our liability is bad, we have security problems, people won’t fix their code.” And go look around and you will find people that are treated as an underclass that are given codes thrown over the wall at them and then they just have to toil through and make it work. I’ve worked on that a number of times in my career.

And I think just like saying, underclass, right, or caste system, is what I found is the most effective way to get people actually thinking about what the hell is going on here. Because most people are just, like, “Well, that’s just the way things are. It’s just how we’ve always done it. The developers write to code, then give it to the sysadmins. The sysadmins deploy the code. Isn’t that how it always works?”

Corey: You’d really like to hope, wouldn’t you?

Amy: [laugh]. Not me. [laugh].

Corey: Again, the way I see it is, in theory—in theory—sysadmins, ops, or that should not exist. People should theoretically be able to write code as developers that just works, the end. And write it correct the first time and never have to change it again. Yeah. There’s a reason that I always like to call staging environments in places I work ‘theory’ because it works in theory, but not in production, and that is fundamentally the—like, that entire job role is the difference between theory and practice.

Amy: Yeah, yeah. Well, I think that’s the problem with it. We’re already so disconnected from the physical world, right? Like, you and I right now are talking over multiple strands of glass and digital transcodings and things right now, right? Like, we are detached from the physical reality.

You mentioned earlier working in data centers, right? The thing I miss about it is, like, the physicality of it. Like, actually, like, I held a server in my arms and put it in the rack and slid it into the rails. I plugged into power myself; I pushed the power button myself. There’s a server there. I physically touched it.

Developers who don’t work in production, we talked about empathy and stuff, but really, I think the big problem is when they work out in their idea space and just writing code, they write the unit tests, if we’re very lucky, they’ll write a functional test, and then they hand that wad off to some poor ops group. They’re detached from the reality of operations. It’s not even about accountability; it’s about experience. The ability to see all of the weird crap we deal with, right? You know, like, “Well, we pushed the code to that server, but there were three bit flips, so we had to do it again. And then the other server, the disk failed. And on the other server…” You know? [laugh].

It’s just, there’s all this weird crap that happens, these systems are so complex that they’re always doing something weird. And if you’re a developer that just spends all day in your IDE, you don’t get to see that. And I can’t really be mad at those folks, as individuals, for not understanding our world. I figure out how to help them, and the best thing we’ve come up with so far is, like, well, we start giving this—some responsibility in a production environment so that they can learn that. People do that, again, is another one that can be done wrong, where it turns into kind of a forced empathy.

I actually really hate that mode, where it’s like, “We’re forcing all the developers online whether they like it or not. On-call whether they like it or not because they have to learn this.” And it’s like, you know, maybe slow your roll a little buddy because the stuff is actually hard to learn. Again, minimizing how hard ops work is. “Oh, we’ll just put the developers on it. They’ll figure it out, right? They’re software engineers. They’re probably smarter than you sysadmins.” Is the unstated thing when we do that, right? When we throw them in the pit and be like, “Yeah, they’ll get it.” [laugh].

Corey: And that was my problem [unintelligible 00:39:49] the interview stuff. It was in the write code on a whiteboard. It’s, “Look, I understood how the system fundamentally worked under the hood.” Being able to power my way through to get to an outcome even in language I don’t know, was sort of part and parcel of the job. But this idea of doing it in artificially constrained environment, in a language I’m not super familiar with, off the top of my head, it took me years to get to a point of being able to do it with a Bash script because who ever starts with an empty editor and starts getting to work in a lot of these scenarios? Especially in an ops role where we’re not building something from scratch.

Amy: That’s the interesting thing, right? In the majority of tech work today—maybe 20 years ago, we did it more because we were literally building the internet we have today. But today, most of the engineers out there working—most of us working stiffs—are working on stuff that already exists. We’re making small incremental changes, which is great that’s what we’re doing. And we’re dealing with old code.

Corey: We’re gluing APIs together, and that’s fine. Ugh. I really want to thank you for taking so much time to talk to me about how you see all these things. If people want to learn more about what you’re up to, where’s the best place to find you?

Amy: I’m on Twitter every once in a while as @MissAmyTobey, M-I-S-S-A-M-Y-T-O-B-E-Y. I have a blog I don’t write on enough. And there’s a couple things on the Equinix Metal blog that I’ve written, so if you’re looking for that. Otherwise, mainly Twitter.

Corey: And those links will of course be in the [show notes 00:41:08]. Thank you so much for your time. I appreciate it.

Amy: I had fun. Thank you.

Corey: As did I. Amy Tobey, Senior Principal Engineer at Equinix. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, or on the YouTubes, smash the like and subscribe buttons, as the kids say. Whereas if you’ve hated this episode, same thing, five-star review all the platforms, smash the buttons, but also include an angry comment telling me that you’re about to wind up subpoenaing a copy of my shell script because you’re convinced that your intellectual property and secrets are buried within.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

Reliability Starts in Cultural Change with Amy Tobey

Episode Summary

Episode Show Notes & Transcript

Transcript

You might also like

When AI Starts Writing the Pull Requests with Madelyn Olson

The Appalachian Cloud Trail: Hiking, Cloud Economics, and Finding Perspective

Coding Agents, Chaos, and the Future of Dev Work with Dexter Horthy

Get the Newsletter

Gnarly cloud cost questions?