Best Practices Don’t Exist with Paul Osman

Episode Summary

Paul Osman is a lead instrumentation engineer at Honeycomb.io, an observability platform that helps engineers get a deeper understanding of their production environments. He brings more than 20 years of tech experience to the role, having worked as a senior engineering manager at Under Armour, a platform engineer manager at PagerDuty, director of platform engineering at 500px, a developer evangelist at SoundCloud, and a web development lead at Mozilla, among other positions. Join Corey and Paul as they discuss what exactly it is that a lead instrumentation engineer does, how Paul initially didn’t like serverless at first and why he does now, why Paul believes in using the least amount of technology when possible, why Corey thinks that setting your database to your local timezone is a terrible idea, how there is no such thing as best practices that work for everyone, Paul’s favorite programming languages, what Paul thinks the right tech stack is, how Paul approaches computing languages he’s not well-versed in, and more.

Episode Show Notes & Transcript

About Paul Osman
Paul Osman is a Software Engineer with 20 years of experience in the industry. He's the Lead Instrumentation Engineer at Honeycomb.io and is passionate about making production a less scary word. Having spent most of his career in the ill-defined space between software development and operations, Paul spends a lot of time thinking about making on-call experiences better, responding to and learning from incidents, and improving ways for software engineers to share knowledge. Before joining Honeycomb.io, Paul worked in Platform and SRE teams at Under Armour, PagerDuty, and SoundCloud.

Links Referenced:

Honeycomb.io
Follow Paul on Twitter
Paul’s Blog

Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Cloud Economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of Cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

This episode is sponsored by our friends at New Relic. If you’re like most environments, you probably have an incredibly complicated architecture, which means that monitoring it is going to take a dozen different tools. And then we get into the advanced stuff. We all have been there and know that pain, or will learn it shortly, and New Relic wants to change that. They’ve designed everything you need in one platform with pricing that’s simple and straightforward, and that means no more counting hosts. You also can get one user and a hundred gigabytes a month, totally free. To learn more, visit newrelic.com. Observability made simple.

Corey: This episode has been sponsored in part by our friends at Veeam. Are you tired of juggling the cost of AWS backups and recovery with your SLAs? Quit the circus act and check out Veeam. Their AWS backup and recovery solution is made to save you money—not that that’s the primary goal, mind you—while also protecting your data properly. They’re letting you protect 10 instances for free with no time limits, so test it out now. You can even find them on the AWS Marketplace at snark.cloud/backitup. Wait? Did I just endorse something on the AWS Marketplace? Wonder of wonders, I did. Look, you don’t care about backups, you care about restores, and despite the fact that multi-cloud is a dumb strategy, it’s also a realistic reality, so make sure that you’re backing up data from everywhere with a single unified point of view. Check them out at snark.cloud/backitup.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Paul Osman, who is either a lead engineer or the lead engineer at Honeycomb, overseeing instrumentation. Paul, welcome to the show, and which is it?

Paul: Thanks so much for having me, Corey. Happy to be here. Lead instrumentation engineer? I don't want to say I'm the lead instrumentation engineer; that seems to put too much weight of responsibility on my shoulders.

Corey: Well, that's the whole point. It's all about weight of responsibility. That's why I'm mispronouncing it; it's actually ‘lead’ engineer, and it's all based upon density and the fact that you are never ever going to float.

Paul: Absolutely. [laugh]. Especially when you're putting software libraries in your systems. That's what you want to think about.

Corey: Absolutely. Everyone talks about these lightweight instrumentation frameworks. No, no. You go the opposite. You are the heavyweight instrumentation framework.

Paul: [laugh]. We will become the center of gravity in your system.

Corey: Exactly. Not quite the direction Honeycomb has chosen to go for a variety of reasons, not least among them being that it's a terrible idea. So, what do you do? What does instrumentation engineering look like at a company that is fundamentally—well, I'll get in trouble if I call them anything other than an observability company, but instrumentation is kind of what they do.

Paul: Exactly. Yeah. The most succinct way I can think about it is, my team works on the tools that help you get data into Honeycomb. So, if you think about a system like Honeycomb, you've got the platform, you've got the web UI, and then you've got everything that runs on a user's or customer’s system. And that's my team.

Corey: So, fundamentally, you're in charge of the agents, the embedded SDKs, the libraries that people shove into their systems, the—depending on how you're orchestrating it these days—the 800 Lambda functions, CloudWatch integrations, and whatnot, and run this magic CloudFormation template that instruments all of my AWS accounts to hurl information into your system. That sort of thing?

Paul: Absolutely. And the list is long, you're right. [laugh].

Corey: It turns out that the more that you put into your system, architecturally, the more things there are to monitor.

Paul: Exactly, and to pull data out of. You mentioned Lambda, and there's a whole bunch of interesting ways that you can get data out of Lambda functions. Who knew? It's not just—

Corey: Especially their new extensions API—

Paul: Yeah.

Corey: —which is super interesting. I haven't gone diving into it in any depth yet, but I like the idea.

Paul: I really like the idea. I'm a big fan of serverless in general. You could call me a convert because I was honestly skeptical at first, but the idea of creating a platform where you just ship freaking code and you don't worry about anything else. Now, having the ability to run processes in parallel, run sidecars in a serverless environment, I think is really, really cool.

Corey: There's so much capability that's, I guess, fantastic to see. It's amazing to, I guess, look at the complexity of even toy applications. And, on the one hand, it's, “Wow, what an amazing system I've built with all of these different services, and everything tied together, and the way that it interacts with another, and even if it's well-instrumented, that's great.” And then the other side of it is, “So, what does this application do?” “It shows people pictures of cats.” And that's really it.

And at some point, it feels like this is painfully overwrought. Now, this is not a new problem; it feels like that's a bit of a cyclical thing. Things get so complex it no longer fits in anyone's head anymore, and then there's a collapsing function of an abstraction layer that winds up becoming broadly adopted, and then the cycle repeats anew. At least, that's my impression on this having been spending the better part of the last two decades in the ops engineering space, but you have spent two decades in the ops and engineering space. What's your take on it?

Paul: Yeah. This is something I wrestle with a lot. The idea of complexity, right? You can look at a lot of these sort of architectural guides and just go, “Holy crap, there's a lot there.” And sometimes that's what you need.

So, I think struggling, or balancing, or figuring out the balance between needed complexity and kind of accidental or complexity debt is key there. For a simple thing, you want the simplest thing that could possibly work.

Corey: Yeah, and there's never any real tacit acknowledgment of that. It always seems that these frameworks and tools and the rest have, “Example one is ‘Hello, world.’ Example two is ‘Hello to the entire world.’” and it—great, not all of that stuff is needed for every environment, but you also probably don't want to build something hyperscale on the first example. There has to be some point of complexity where, okay, at this scale, the complexity trade-off is well worth doing and in fact, it's dangerous to not have it. That's not everything. Not every system needs to scale globally at all times. Now, that enrages some people when I point it out, but it's true.

Paul: Oh, yeah. And the type of scaling that you need is also highly dependent on the workloads that you're managing. You mentioned I come from an ops background. I was working as an SRE before I joined Honeycomb, and one of the things I've always tried to stick to, not always successfully, is the least amount of technology possible. If you're dealing with something that just has to horizontally scale out and you've got a pretty consistent workload, maybe you don't need it to be running a container orchestrator. Maybe you just need an ALB that can do horizontal scaling on CPU usage or something.

Corey: Yeah, it winds up being a problem when I talk about my philosophy on things as a best practice because I say other things that tend to fly directly against that. Somewhat recently, I got in trouble—again—on Twitter—again—for bringing up the idea that setting your database to your local timezone is a terrible idea. Put it in UTC, and then let the presentation layer figure it out from there, and the answer—legitimately—was, “Look, it's a local payroll app that's only for a one branch company in a single timezone. Why would you ever need to worry about that?” Well, for that kind of story, my position is if you're building this small thing, great, leave the door open for it to potentially become a big thing. 95 percent of apps will never hit a point of success where they need to go hyperscale, but for those 5 percent that do, don't bury landmines they are going to trip over down the road when that time comes.

Paul: Exactly. One of the things that can be challenging with examples like that is there are defaults. And we're not always aware of the consequences of accepting some of the defaults. And it can be really hard as engineers to think through, “What is the reversibility of this setting that I'm accepting, or this state that I'm accepting?” And if the answer is that it's going to be really hard to reverse, then maybe you want to think twice before doing that.

Corey: What are the problems that I keep seeing is that there's a lack of awareness of how to build hyperscale applications, and it occurred to me that part of the reason is, is that no one knows how to build a web property with hundreds of millions of users. I think that's true. Every company that has done that has had to figure it out as they go, for their particular workload, for their particular constraints. And this is proven out by the fact that if you talk to any hyperscale company about their application architecture, how things are built, ignore what they say at conferences on stage, pay attention to what they say at conferences in the bar after you pour six beers into them, and they all admit that it's crap. “Everything we've done is garbage we're doing as best we can, but there's a lot of rough edges. It feels like we're always a hair's breadth from disaster.” I can't shake the feeling that we're all just making it up as we go along.

Paul: I totally think we are. You mentioned earlier, best practices, and what the hell are best practices when they're so highly dependent on the specific architectural decisions made, on the traffic patterns, on the social aspects of how an organization works? I've had the good fortune of being part of a few teams that have had to scale up to hundreds of millions of users, and no story has been correct. This is one of the things that always used to annoy me about—I'm glad to say it doesn't seem to happen as much anymore, but when people would point at specific technologies, like, “Ruby doesn't scale,” or something like that.

That's a meaningless statement. What does that mean? It certainly has scaled for some people in some environments; it just depends on what you're actually doing. And like you said, there's no blanket advice that seems to work for everybody. There are principles, I think. And if we worked really hard, we could probably dig out some of those principles. But the idea that there's a one size fits all pattern, that seems to come from people who are trying to sell you something.

Corey: Oh, yeah. At the time that we're doing this recording, there was recently a great tweet by GitHub—or GifHub depending upon pronunciations—’s CTO.

Paul: Well, I'm Canadian. So, you know.

Corey: Oh yeah. The best part of this show is mispronouncing things. It's not Postgres, it's Postgres-squeal. I digress, the question that he was asking was, “If you're going to start a new company today, what technical stack do you pick? What cloud provider? What language?” Et cetera, et cetera. And my response to it is, “Oh, that's easy. It's the one that the engineers I'm hiring are conversant with and want to work in.”

Paul: Yeah.

Corey: Because I could look around the landscape and see an awful lot of business failures for a variety of reasons. I'm really hard-pressed to identify any of them as, “Ah, they pick the wrong technical stack.”

Paul: Yeah. How many companies have actually been sunk by a decision like that? It literally never happens. And for what it's worth, I completely agree. The right tech stack is the tech stack that you have experience with, the tech stack that you're comfortable with. Way more important.

And it's funny because people—I don't know, sometimes I feel like we talk about this less, but it’s, how comfortable are you with everything else? Who cares what programming language your code is written in if you're not confident in the way that you actually deploy changes. Or if you're not confident in the way that you configure how traffic is routed to it. That stuff, all—I would say—arguably matters a lot more than the actual expression of business logic that gets converted into machine code.

Corey: It really is. And that's what I want to ask you about, too, is that you have exposure to a bunch of different stacks, presumably because you are the instrumentation engineer who's made of lead, and you wind up building these integrations into every godforsaken stack that all of your customers are going to be using, or any of your customers are going to be using, which means that you get to touch a lot of different languages, you get to touch a lot of different platforms, presumably. Is that correct? Or am I—

Paul: Oh, yeah.

Corey: —dramatically overestimating Honeycomb’s compatibility with different systems?

Paul: Oh, no, no. You are absolutely on the nose there. When I was being interviewed by Honeycomb, we have a coding exercise that we send to a lot of candidates, and the only difference with me from an average product or platform engineer at the company was they had me do it in a number of languages just to see how comfortable I was moving from one platform to another because being on the instrumentation team, that is definitely part of the job.

Corey: So, at this point, it's one of those questions that I always used to ask my parents, “Am I the favorite, or is my brother?” And the answer that they gave was, “You're my children. I can't stand either one of you.” So, to that end, what is your favorite stack to integrate with, and your least favorite stack? Because, you know, it's not really a podcast unless you enrage people.

Paul: I'm pausing intentionally because we've been interrupted by an adorable three-year-old.

Corey: Aw. Yeah, I have one of those, too, lurking around here somewhere.

Paul: [laugh].

Corey: And an infant, but that's a separate problem.

Paul: Oh, yeah. [aside] Hey, can you go play with mama? [pause] She literally just came in, stole my phone, and now ran away.

Corey: Oh, yep. Sounds like a very similar story here. Also, thank you for not apologizing. It drives me nuts when people apologize for having the temerity to have a family.

Paul: Oh, right. Especially now, right? When we're all in our home.

Corey: Like, when the kid wanders in when you’re on a video call, “Excuse me.” She lives here; you don’t.

Paul: Yeah. Especially nowadays, when we're all literally working in our homes, right?

Corey: Oh my God, yes.

Paul: It's like, you're in her home, not the other way around.

Corey: I've also never asked an employee or colleague to turn on their camera.

Paul: Mmm. Oh, very good point. Yeah. Especially right now. That's a great [00:13:54 crosstalk].

Corey: Excuse me, invite me into your home like I'm some sort of godforsaken corporate vampire? No, thank you. We hit a perfect stopping point. What is your favorite stack to integrate with and your least favorite stack? Go.

Paul: Right, yeah, so 100 percent based on what we were saying earlier, the ones that I prefer, I'm going to surprise you: they're the ones that I have the most experience working in. [laugh]. And so I've trained my brain to think in a number of different ways, I think fairly well. I'm a really big fan of functional programming—a little. So, I like languages that tend to support a little bit of functional programming.

I come from a background—accidentally, I ended up doing a lot of Scala at a lot of different companies. And so I'm very happy working there. But conversely, I also really like working in Go, one of the languages that is often kind of made fun of—lovingly—for being a very basic language, and it's not too fancy in terms of features.

Corey: I want to be very clear here that my position is that language bigotry is awful.

Paul: Oh, yeah.

Corey: It's one of those ways of gatekeeping and it drives me nuts. It doesn't matter what language you pick, I can write shitty code and all of them.

Paul: Absolutely. And I have and I will.

Corey: It didn’t even compile, it's so bad. Personally, I don't get JavaScript to save my life. It does not match my understanding of the world. Python, conversely, is something that aligns much better with how I see things, and Ruby was also a great [00:15:12 unintelligible] for me for a while. I was also heavily into Perl for a long time. But again, as an old ops person, my favorite language is and always will be, bash scripting.

Paul: Oh, beautiful. Yes. It's funny, I have a very similar experience. Maybe it's something about us ops people, but JavaScript, I have not trained my brain to work that way. I completely agree with you about language bigotry being awful and a form of gatekeeping, and so my approach is when I see somebody who's proficient in JavaScript and can write wonderful applications in Node, or browser applications in React, I'm in [BLEEP] awe. It's just a way that I haven't managed to make my brain as compatible.

Corey: The challenge, of course, is that it's your responsibility to fundamentally support all stacks. So, how do you approach doing an integration in a language or stack with which you're not familiar?

Paul: Yeah. That's a great question. So, part of it is, you just kind of dive in and kind of work through it, which I think if you've worked in enough companies that have different languages and different stacks, you might have some experience doing. I've worked in companies where [laugh]—I worked in one company once where we started the whole microservices journey, and we regretted this decision—spoiler—but we said everybody can choose whatever language they want to use because it doesn't matter at the end of the day; we're all talking to each other over HTTP and JSON APIs. So, that resulted in this Cambrian explosion and, surprise, if you wanted to go and work on something on a different team, or that a different team had created, it's going to be in a language you may have never even seen before.

And so part of it is, you just got to kind of dive in and be willing to learn. Where there are real gaps or weaknesses, that's where hiring becomes important. It's funny, I've been a hiring manager in the past in previous lives, and I've been involved in hiring processes at a bunch of different companies, and I'm very opposed to just hiring based on specific technology or language experience. But sometimes you'd have to say, “Oh, it's a real bonus if this person fills a gap that we don't have [laugh] on the team at the moment.”

Corey: Oh, absolutely. I think that hiring is one of those hard parts where it's easy to fall into the very common trap of never ever wanting to hire someone who's weak in something, as opposed to, “Okay, great. Maybe your Python is crappy, but we have three engineers already who are great with it. But if you know Ruby and we don’t, cool.” That's a strength, not a weakness.

Hire for strengths. Forget the, “I can come up with some puzzler problem to put on a whiteboard that'll stump you.” Hell with that. Show me what you're best at. I want to see you shine. I don't want to see what it looks like when you're sitting there flailing because you haven't brushed up on your CompSci curriculum in 20 years.

Paul: Oh, God. Absolutely. I was very pleasantly surprised—as an aside when I was interviewing this last round, and I joined Honeycomb about a year ago—I did a pretty extensive job hunt, and I ended up doing a fair number of on-sites, I think it was like six in total, which seems exhausting now just thinking about it. But I was so relieved that no one had asked [00:18:07 crosstalk]—

Corey: Six conversations or six different trips to San Francisco to visit them on-site?

Paul: Three of them were trips, two of them were remote, and one of them was local.

Corey: Okay, those are actual separate interviews.

Paul: That's right.

Corey: With different folks at different times. Okay. Yeah, that's a lot of back and forth.

Paul: It's a decent amount. But I was so pleasantly surprised to see that nobody asked me one of those whiteboard questions. Not a single thing that would show up [BLEEP] Cracking the Coding Interview, or LeetCode, or whatever other tool you want.

Corey: Yeah, part of it is also just this—it's almost corporate hazing sense. It sounds weird, especially given that, let's be honest here, most of the audience of this show has an engineering background, but I personally find hiring folks who are either engineers or engineering adjacent to be way easier than a lot of other hires. For example, if I'm hiring another cloud economist who needs to be able to delve into AWS and have some SRE experience, and be able to look at this from a financial analysis perspective, great. I've done a lot of that myself. I know exactly what to look for, what to ask what to uncover.

Whereas if I'm hiring for, I don't know, a product marketer, or an accountant, or a graphic designer, I have no earthly idea how to even frame the question. Part of the challenge, then, is that in many cases, if you're not reaching out to experts who are great at this stuff to help with the winnowing and interviewing process, you're probably going to wind up hiring the person who sounds the most confidant, which is kind of awful.

Paul: Right. Exactly. I think the only thing I've ever found that can even begin to crack that for me, is ask people what they've done and then delve into really, really specific follow up. If somebody comes in and says, “I'm great at X,” great. Tell me about a time when you use x to a good result.

And obviously, you're going to run into people who are just really good at self-selling, but I think if you ask enough follow-ups and if you look for things like communication skills, their ability to connect their effort with outcomes and things like that, you can still get pretty good results.

Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.

Corey: I think you're probably right. I think that there's a lot to be said for digging into things. What I love is asking open-ended questions in interviews. And at some point, one of us is going to get to a point of, “I don't know.” I'm either learning something, or I'm seeing how people think and what they do when they hit a wall which, especially for senior roles, is incredibly important. You don't want folks who are going to sit and not go anywhere, it's, “Great. I'm blocked. How do I resolve this? What do I do?”

Paul: Yeah.

Corey: And in my case, it's reach out to people, look on the internet, do some searching, but don't sit there and stand at the whiteboard and tear up. It's one of those, yeah, we don't know these things off the top of our heads. No one does. So, ask. That's the point. I want to see people saying that they don't know how to do something.

Paul: Yeah. And this is one of the hardest things to do, but when you do manage it—and I don't have the perfect answer, but I've seen it—when you get some kind of collaboration happening in the actual interview, and you get a sense of, “Oh, my gosh, this is what it would be like working with this person because we're actively collaborating on a problem that none of us know the actual answer to.” In other words, what we're paid to do, day-to-day. [laugh].

Corey: So, to that end, I have to ask you, given that you see a lot of this, what makes writing slash shipping slash producing software harder than it needs to be?

Paul: You know, I think there's a few different things there. Writing and communicating, I mean, that's hard because you're dealing with human beings. And to our previous discussion about software stacks, and tools, and tech, and processes, there's no perfect answer. And so the hard thing is figuring out, what do you actually need to communicate? What do you actually need to do?

In terms of shipping software? I think that that comes from making it harder than it needs to be by creating situations where you're scared to touch anything. My background as an SRE, the thing that always terrifies me the most is the service or the software that people don't touch very often. It's the stuff that, maybe it's harder to find out how it works, or how it breaks or whatnot because, frankly, you just never have a need to interact with it. That's the stuff that really scares the crap out of me.

Corey: From your perspective, I guess, what's the interesting part of software versus what's the part of it that no engineer should ever have to touch? Or do again? What is the valuable part? What should engineers of the future be building, focusing on, working on, and what should folks never think about again? I like the fact that you're coming from an engineering perspective because normally if I ask questions like that, it turns into a sales pitch answer.

Paul: Yeah, [laugh] exactly. It's funny because I find myself kind of conflicted here, between what I like to do and what I believe to be actually correct. And what I mean, there is, I like thinking about all the plumbing that makes software go; I like thinking about infrastructure, and I like thinking about writing tools and helping create things that make it easier for other software developers to push code to production, and help users, and delight users, and all that sort of thing. And that's exactly what most businesses shouldn't have to worry about. They shouldn't have to employ people like you, or I—from ops backgrounds—who just know how to make the stuff go because that should just be a given.

I was talking earlier about serverless and some stuff that I think is hopeful there. The average software developer, I think, who wants to delight users, who wants to create things that create value for a business and for customers, they don't want to care if it's running on Kubernetes or if it's running on Spot instances or things like that. They just want to push it, and they want to go. The tricky part comes in when it breaks. And when it breaks, we want something that we have that sort of ability to introspect and debug, even if it's hidden behind some kind of abstraction. And that's a balance that I don't think we've seen yet in the industry. But I think we're getting closer.

Corey: Well see, when I have conversations with folks like you, and we discuss these types of things, and the answers always seem so eminently reasonable, and then I leave the ivory tower of my podcasting studio and go back into the world, and then I see the nonsense everyone's building instead. It feels like on some level, there's two worlds: the aspirational way that we all want to be doing things, and then the messy way that we really are doing things. And I’m starting to despair of ever being able to fully bridge that gap.

Paul: Oh, interesting. By the ivory tower, what would be an example of an ivory tower perspective or point of view?

Corey: Oh, sure. Any conference talk you've ever seen on any technology under the sun, where they talk about how they wind up seamlessly deploying software into production. CI/CD stories, for example, are notorious for this. It's the—you watch these amazing presentations like, “Wow, I'd love to work in a place that did things like that.” And the person next to you says, “Yeah, me too.” And you look at their badge, and they work at the company the presenter works at.

Paul: [laugh].

Corey: It’s the myths, we tell ourselves. Sometimes individual groups wind up solving these problems within larger companies. Sometimes it's a new thing that they're running in test but haven't rolled out everywhere and, let's not kid ourselves, if it touches the payment system, everyone's doing waterfall development whether they admit it or not. But there's a broader world out there of folks who want to be doing things the right way, they want to be getting rid of the boilerplate and stop reinventing the wheel and re-implementing the wheel and get on to doing the truly interesting and innovative stuff. And those people right now are also listening to this while going back to code a login page. You never get past it on some level. That's what bugs me.

Paul: And you know what's super interesting about that? In my experience, which may not be representative, but the places that I've seen that have accomplished the closest to that kind of story, have done it in really simple, almost kludgy ways. And what I mean by that is, like, I've never personally worked somewhere where we had this great system that tracks state of all of these different services and made sure that there is, like, traffic going from here to there in a way that was canary testing and everything. You know, that all sounds like a lot of moving parts; the best places I've worked have a freaking cron script that just pushes out changes or has a webhook that kicks off something that pulls down a tarball from an S3 bucket and then ships it to a machine. Oftentimes this stuff, I think it doesn't make for sexy conference talks, but it's just roll up your sleeves kind of work to get it happening, and then move on to something else. I think sometimes we maybe trip ourselves up wanting it to be more interesting than it actually is.

Corey: That's part of it. If we were completely honest with people at what we were actually building or working on at any given point in time, the answer would be incredibly depressing and we would just be sadder after explaining our jobs to people. I try not to give talks to classrooms full of schoolchildren anymore on what I do for a living for that specific reason.

Paul: And yet—I agree, but isn't it great sometimes that this shit works? If the point is to deliver value to customers quickly and efficiently, maybe investing just enough to make that work repeatedly and in a way that people trust, and frankly, is simple enough that you can also debug when it doesn't do the thing that it's supposed to do, maybe that's actually enough. Maybe we're sometimes overinvesting in complicated solutions that might fall into that accidental complexity scenario we were talking about earlier.

Corey: Well, all right. Let's take that to its logical extent here. Here's something I know for a fact you have an opinion on. Now, I have opinions on things, too, which would surprise no one who listens to this show, but what do you think stops engineers from wanting to be on-call for the service that they work on? Now, there are a couple of answers I have to that: one is the polite public answer, and one's the real answer. But I'd like to hear your answer.

Paul: Yeah, sure. I want to come back to the difference between the public and the private answer, too. My answer is, there's a whole bunch of stuff. One of them is social—and I think this is more common—is that engineers are on call for things, and they don't feel like they have necessarily the autonomy to actually react to things the way that they need to.

And what I mean by that is, like—I've certainly seen this, and I've done my part to try to fix it, or to encourage others to—power people to fix it, or whatever the hell you need to do but, people get paged and they're like, “Oh, that alert means nothing.” “Okay, so get rid of that alert.” “I can't do that.” “Why not? [laugh]. Just do it.”

If you are getting paged for something, you have the right to change the system that is alerting you to something. And what you're seeing on the ground level, as the on-call engineer, should be gospel; it should be the thing that dictates how the future person experiences that role. And if you don't have that, it's a really shitty experience.

I think the other thing—that's technical—is this notion of accidental complexity, is when you have a system that you're responsible for, that you're on call for, and it's just, whether it's because of over-engineering, or it's just out of necessity complex, you don't know how to insert yourself when it [BLEEP] up, right? Like you can start to look at it and say, “Okay, we've got a drop in traffic,” or, “We've got a spike in error rate,” or something like that, but if you're nervous about the actual mechanics that get your changes from your laptop to the production environment, then it can be a really terrifying experience to make changes. And I've been in environments where people just freeze, and it sucks. So, that's why I always think of, like, if you can make it easier to get the changes from your laptop to production, that is the best investment that you can possibly make, technically.

Corey: I would agree with that sentiment. It feels like when you talk to software developers who are building these systems and then complaining about a problem in production. “Here, log into the prod server and see.” Well, this looks nothing like their IDE; it looks nothing whatsoever like their development environment. And people feel awkward and out-of-sorts there.

I mean, I intentionally in years past when I was working in ops roles made production uncomfortable to work in intentionally so, because that's not your default place to operate in. But if people are used to using Visual Studio Code, for example, then, “Okay, now the only editor we have installed here is VI, so you're going to have to spend some time learning, even to look at what's going on here.” That's an awful experience, not to mention that people are never doing these things during the workday, invariably. It's always two in the morning when you're bleary-eyed and have no idea what you're doing. And, congratulations, you're being confronted by the Puzzle Master. It doesn't go well.

Paul: No. And that's actually a great point that I think is within our control as engineering teams to change. Yeah, it'll happen at two in the morning, that's for sure. Any 24/7 service that you're on call for, it's going to break at an uncomfortable time, and you're going to have to debug it, but that doesn't have to be the only time you do this stuff. And in fact, when it is the only time you do this stuff, that's terrible.

And that's why I'm a big proponent of have fire drills, have game days, break the shit that breaks often so that it breaks when everybody's around. Resolve those kinds of uncertainties as much as you can—because, obviously, some things are just unknowable—but practice those muscles as often as possible. There's a funny thing that we talk about sometimes at Honeycomb, and this sounds like a humblebrag, but it's not—it's just that there are periods of time where we don't have as many incidents, and that makes it actually really hard to make sure that people are primed to be on-call. And so we're thinking through what can we do to just make it more comfortable? Like if someone comes on board, and their first on-call rotation is quiet, that doesn't really help them. So, what can we do to, kind of, force interaction with production as often as possible to make it almost routine and muscle memory?

Corey: I talk with companies back when I was looking at various roles where, “Oh, everyone is on call.” And you hear that during an interview, and having been through many on-call rotations myself, it's, “Yeah, that's not a strong point, to be perfectly honest with you.” That sounds like, if you're not very careful how you position this, that everyone is woken up for every incident, and I won't get a whole lot of sleep working here, and not to be unkind, you're not paying significantly more than other folks who don't subject me to that.

Paul: Yeah, that's terrible. Everybody is on call. That reminds me of the companies that I worked for… I don't know, before a certain time when, I don't know, maybe it was the PagerDuty became a ubiquitous tool that was used in companies of a certain size. But it was that old time when the first person to respond is really the person who's on call. And that's a terrible environment, and that's a recipe for burnout.

You should have a clear escalation path, and you should have clear responsibilities, and every engineer should have a huge chunk of time when they're not on call, and they know that they're not on call so they can delete Slack from their phone; they can turn off all of their alerts. And when they're done at the end of the day, they're just done. So, yeah, I would also run screaming from a company who said that, these days. Anecdotally, there's two questions that I always like to ask companies when I'm looking for jobs, and talking to companies.

One is, “How do you get your code to production? Walk me through as many steps as you're comfortable disclosing in an interview,” which hopefully is a lot. And two is, “How are people put on call? And what's the last major incident you had, and what does it look like? Who was involved? What happened? How did the people get the support they needed?” All those questions are really, really interesting ones to dive into. I wish you had more opportunities than interviewing to ask other companies this stuff. [laugh].

Corey: Oh, yeah. My personal favorite way of responding to that, which is why I generally don't get offered jobs, a whole lot is, “So, you have an on-call rotation here?” “Oh, yes. It's absolutely critical that our site is up all the time.” “Cool. So, why don't you staff multiple shifts of people who are responsible for keeping the site up during those times so that you're not making people wake up in the middle of the night to break things?” And suddenly we're in one of those, what I say versus what I do are different territories. And that becomes a problem.

Paul: Oh, are you talking about, like, follow the sun rotations?

Corey: Either follow the sun or having folks who are either night owls who enjoy night shifts or something for—and I'm not talking small startups here. I'm talking companies that have 1500 engineers working there. It's at some point you have multiple offices in various places. Why are you still waking people up in the primary timezone every week?

Sorry. When I say primary timezone, I should be very explicit on this: there's always a timezone hierarchy in every company. It's the headquarters time, and that is how it's going to be regardless of what companies claim otherwise.

Paul: Of course. It's the center of their universe.

Corey: And invariably, it seems to be Pacific West Coast.

Paul: Yes, exactly. Yeah, I agree completely. At a certain size, and there are plenty of companies that I think are doing this, but you have the opportunity to let folks in North America time zones, just stop working. And then folks in European time zones will take over. And then folks in certain Asian time zones will take over. And yeah, that is a great way to do things, I think, if you can manage it.

Corey: So, I guess my last question for you, since I've been peppering you with these—is if people want to learn more, where can they find you?

Paul: I am on Twitter. I'm not sure how much value you'll get from my tweets, but every once in a while, maybe I'll tweet something that at least provokes some discussion: @paulosman. And I very occasionally blog at paulosman.me. And I think that's it.

Corey: And we'll put links to those in the [00:35:44 show notes].

Paul: Excellent.

Corey: Paul, thank you so much for taking the time to speak with me today. I really appreciate it.

Paul: No problem. Corey. I really enjoyed the discussion. Thanks a lot.

Corey: As did I. Paul Osman, lead—or lead—engineer of instrumentation at Honeycomb. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice and a comment telling me of why you're on-call rotation is different and unique.

Announcer: This has been this week’s episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com, or wherever fine snark is sold.

This has been a HumblePod production. Stay humble.

Best Practices Don’t Exist with Paul Osman

Episode Summary

Episode Show Notes & Transcript

You might also like

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

The Latest State of IaC with Ido Neeman

Get the Newsletter

Sponsor an Episode