Episode Show Notes & Transcript
- Honeycomb: https://www.honeycomb.io/
- Twitter: https://twitter.com/Mike_Goldsmith
- Honeycomb blog: https://www.honeycomb.io/blog
- LinkedIn: https://www.linkedin.com/in/mikegoldsmith/
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. This promoted guest episode is brought to us by our friends at Honeycomb who I just love talking to. And we’ve gotten to talk to these folks a bunch of different times in a bunch of different ways. They’ve been a recurring sponsor of this show and my other media nonsense, they’ve been a reference customer for our consulting work at The Duckbill Group a couple of times now, and we just love working with them just because every time we do we learn something from it. I imagine today is going to be no exception. My guest is Mike Goldsmith, who’s a staff software engineer over at Honeycomb. Mike, welcome to the show.
Mike: Hello. Thank you for having me on the show today.
Corey: So, I have been familiar with Honeycomb for a long time. And I’m still trying to break myself out of the misapprehension that, oh, they’re a small, scrappy, 12-person company. You are very much not that anymore. So, we’ve gotten to a point now where I definitely have to ask the question: what part of the observability universe that Honeycomb encompasses do you focus on?
Mike: For myself, I’m very focused on the telemetry side, so the place where I work on the tools that customers deploy in their own infrastructure to collect all of that useful data and make—that we can then send on to Honeycomb to make use of and help identify where the problems are, where things are changing, how we can best serve that data.
Corey: You’ve been, I guess on some level, there’s—I’m trying to make this not sound like an accusation, but I don’t know if we can necessarily avoid that—you have been heavily involved in OpenTelemetry for a while, both professionally, as well as an open-source contributor in your free time because apparently you also don’t know how to walk away from work when the workday is done. So, let’s talk about that a little bit because I have a number of questions. Starting at the very beginning, for those who have not gone trekking through that particular part of the wilderness-slash-swamp, what is OpenTelemetry?
Mike: So, OpenTelemetry is a vendor-agnostic set of tools that allow anybody to collect data about their system and then send it to a target back-end to make use of that data. The data, the visualization tools, and the tools that make use of that data are a variety of different things, so whether it’s tracing data or metrics or logs, and then it’s trying to take value from that. The big thing what OpenTelemetry is aimed at doing is making the collection of the data and the transit of the data to wherever you want to send it a community-owned resource, so it’s not like you get vendor lock-in by going to using one competitor and then go to a different—you want to go and try a different tool and you’ve got to re-instrument or change your application heavily to make use of that. OpenTelemetry abstracts all that away, so all you need to know about is what you’re instrumented with, what [unintelligible 00:03:22] can make of that data, and then you can send it to one or multiple different tools to make use of that data. So, you can even compare some tools side-by-side if you wanted to.
Corey: So, given that it’s an open format, from the customer side of the world, this sounds awesome. Is it envisioned that this is something—an instrument that gets instrumented at the application itself or once I send it to another observability vendor, is it envisioned that okay, if I send this data to Honeycomb, I can then instrument what Honeycomb sees about that and then send that onward somewhere else, maybe my ancient rsyslog server, maybe a different observability vendor that has a different emphasis. Like, how is it envisioned unfolding within the ecosystem? Like, in other words, can I build a giant ring of these things that just keep building an infinitely expensive loop?
Mike: Yeah. So ideally, you would try and try to pick one or a few tools that will provide the most value that you can send to, and then it could answer all of the questions for you. So, at Honeycomb, we try to—we are primarily focused on tracing because we want to do application-level information to say, this user had this interaction, this is the context of what happened, these are the things that they clicked on, this is the information that flowed through your back-end system, this is the line-item order that was generated, the email content, all of those things all linked together so we know that person did this thing, it took this amount of time, and then over a longer period of time, from the analytics point of view, you can then say, “These are the most popular things that people are doing. This is typically how long it takes.” And then we can highlight outliers to say, “Okay, this person is having an issue.” This individual person, we can identify them and say, “This is an issue. This is what’s different about what they’re doing.”
So, that’s quite a unique tracing tool or opportunity there. So, that lets you really drive what’s happening rather than what has happened. So, logs and metrics are very backward-looking to say, “This is the thing that this thing happened,” and tries to give you the context about it. Tracing tries to give you that extra layer of context to say that this thing happened and it had all of these things related to it, and why is it interesting?
Corey: It’s odd to me that vendors would be putting as much energy into OpenTelemetry—or OTel, as it seems to always be abbreviated as when I encounter it, so I’m using the term just so people, “Oh, wait, that’s that thing I keep seeing. What is that?” Great—but it seems odd to me that vendors would be as embracing of that technology as they have been, just because historically, I remember whenever I had an application when I was using production in anger—which honestly, ‘anger’ is a great name for the production environment—whenever I was trying to instrument things, it was okay, you’d have to grab this APM tools library and instrument there, and then something else as well, and you wound up with an order of operations where which one wrapped the other. And sometimes that caused problems. And of course, changing vendors meant you had to go and redeploy your entire application with different instrumentation and hope nothing broke. There was a lock-in story that was great for the incumbents back when that was state of the art. But even some of those incumbents are now embracing OTel. Why?
Mike: I think it’s because it’s showing that there’s such a diverse group of tools there, and [unintelligible 00:06:32] being the one that you’ve selected a number of years ago and then they could hold on to that. The momentum slowed because they were able to move at a slower pace because they were the organizations that allowed us—they were the de facto tooling. And then once new companies and competitors came around and we’re open to trying to get a part of that market share, it’s given the opportunity to then really pick the tool that is right for the job, rather than just the best than what is perceived to be the best tool because they’re the largest one or the ones that most people are using. OpenTelemetry allows you to make an organization and a tool that’s providing those tools focus on being the best at it, rather than just the biggest one.
Corey: That is, I think, a more enlightened perspective than frankly, I expect a number of companies out there to have taken, just because it seems like lock-in seems to be the order of the day for an awful lot of companies. Like, “Okay, why are customers going to stay with us?” “Because we make it hard to leave,” is… I can understand the incentive, but that only works for so long if you’re not actively solving a problem that customers have. One of the challenges that I ran into, even with OTel, was back when I was last trying to instrument a distributed application—which was built entirely on Lambda—is the fact that I was doing this for an application that was built entirely on Lambda. And it felt like the right answer was to, oh, just use an OTel layer—a Lambda layer that wound up providing the functionality you cared about.
But every vendor seemed to have their own. Honeycomb had one, Lightstep had one, AWS had one, and now it’s oh, dear, this is just the next evolution of that specific agent problem. How did that play out? Is that still the way it works? Is there other good reasons for this? Or is this just people trying to slap a logo on things?
So, what the vendor specifics, what you’ve suggested there around like Honeycomb, or other organizations providing the layers, they’re trying to simplify the usage of the SDK to make some of those assumptions for you that you are going to be sending telemetry to Honeycomb, you are going to be talking about an API key that is going to be in a particular format, it is easier to pass that information into the SDK so it knows how to communicate rather than—as well as where it’s going to communicate that data to.
Corey: There’s a common story that I tend to find myself smacking into almost against my will, where I have found myself at the perfect intersection of a variety of different challenges, and for some reason, I have stumbled blindly and through no ill intent into ‘this is terrible’ territory. I wound to finally getting blocked and getting distracted by something else shiny on this project about two years ago because the problem I was getting into was, okay, I got to start sending traces to various places and that was awesome, but now I wanted to annotate each span with a user identity that could be derived from code, and the way that it interfaced with the various Lambda layers at that point in time was, ooh, that’s not going to be great. And I think there were a couple of GitHub issues opened on it as feature enhancements for a couple of layers. And then I, again, was still distracted by shiny things and never went back around to it. But I was left with the distinct impression that building something purely out of Lambda functions—and also probably popsicle sticks—is something of an edge case. Is there a particular software architecture or infrastructure architecture that OTel favors?
Mike: I don’t think it favors any in particular, but it definitely suffers because it’s, as I said earlier, it’s trying to do that avail—the single SDK is available to many different use cases, which has its own challenges because then it has to deal with so many different options. But I don’t think OpenTelemetry has a specific, like, use case in mind. It’s definitely focused on, like—sorry, telemetry tracing—tracing is focused on application telemetry. So, it’s focused on about your code that you build yourself and then deploy. There are other tools that can collect operational data, things like the OpenTelemetry Collector is then available to sit outside of that process and say, what’s going on in my system?
But yeah, I wouldn’t say that there’s a specific infrastructure that it’s aimed at doing. A lot of the cloud operators and tools are trying to make sure that that information is available and OpenTelemetry SDKs are available. But yeah, at the moment, it does require some knowledge around what’s best for your application if you’re not in complete control of all of the infrastructure that it’s running in.
Corey: It feels that with most things that are sort of pulled into the orbit of the CNCF—and OTel is no exception to this—that there’s an idea that oh, well, everything is going to therefore be running in containers, on top of Kubernetes. And that might be unfair, but it also, frankly, winds up following pretty accurately what a lot of applications I’m seeing in client environments have been doing. Don’t take it as a criticism. But it does seem like it is designed with an eye toward everything being microservices running on containers, scheduled which, from a infrastructure perspective, what appears to be willy-nilly abandoned, and how do you wind up gathering useful information out of that without drowning in data? That seems to be, from at least my brief experience with OTel, the direction it heads in. Is that directionally correct?
Mike: Yeah, I think so. I think OpenTelemetry has a quite strong relationship with CNCF and therefore Kubernetes. That is a use case that we see as a very common with customers that we engage with, both at the prospect level and then just initial conversations, people using something like Kubernetes to do the application orchestration is very, very common. It’s something that OpenTelemetry and Honeycomb are wanting to improve on as well. We want to get by a very good experience because it is so common when we come up to it that we want to have a very good, strong opinion around, well, if you’re running in Kubernetes, these are the tools and these are the right ways to use OpenTelemetry to get the best out of it.
Corey: I want to change gears a little bit. Something that’s interested me about Honeycomb for a while has been its culture. Your founders have been very public about their views on a variety of different things that are not just engineering-centric, but tangential to it, like, engineering management: how not to be terrible at it. And based on a huge number of conversations I’ve had with folks over there, I’m inclined to agree that the stories they tell in public do align with how things go internally. Or at least if they’re not, I would not expect you to admit it on the record, so either way, we’ll just take that as a given.
What I’m curious about is that you are many timezones away from their very nice office here in San Francisco. What’s it like working remote in a company that is not fully distributed? Which is funny, we talk about distributed applications as if they’re a given but distributed teams are still something we’re wrangling with.
Mike: Yeah, it’s something that I’ve dealt with for quite a while, for maybe seven or eight years is worked with a few different organizations that are not based in my timezone. There’s been a couple, primarily based in San Francisco area, so Pacific Time. An eight-hour time difference for the UK is challenging, it has its own challenges, but it also has a lot of benefits, too. So typically, I get to really have a lot of focus time on a morning. That means that I can start my day, look through whatever I think is appropriate for that morning, and not get interrupted very easily.
I get a lot of time to think and plan and I think that’s helped me at, like, the tech lead level because I can really focus on something and think it through without that level of interruption that I think some people do if you’re working in the same timezone or even in the same office as someone. That approachability is just not naturally there. But the other side of that is that I have a very limited amount of natural overlap with people I work with on a day-to-day basis, so it’s typically meetings from 2 till 5 p.m. most days to try and make sure that I build those social relationships, I’m talking to the right people, giving status updates, planning and that sort of thing. But it works for me. I really enjoy that balance of some ty—like, having a lot of focus time and having, like, then dedicated time to spend with people.
And I think that’s really important, as well is that a distributed team naturally means that you don’t get to spend a lot of time with people and a lot of, like, one-on-one time with people, so that’s something that I definitely focus on is doing a lot of social interaction as well. So, it’s not just I have a meeting, we’ve got to stand up, we’ve got 15 minutes, and then everyone goes and does their own thing. I like to make sure that we have time so we can talk, we can connect to each other, we know each other, things that would—[unintelligible 00:16:35] that allow a space for conversations to happen that would naturally happen if you were sat next to somebody at a desk, or like, the more traditional, like, water cooler conversations. You hear somebody having a conversation, you go talk to them, that naturally evolves.
Corey: That was where I ran into a lot of trouble with it myself. My first outing as a manager, I had—most of the people on my team were in the same room as I was, and then we had someone who was in Europe. And as much as we tried to include this person in all of our meetings, there was an intrinsic, “Let’s go get a cup of coffee,” or, “Let’s have a discussion and figure things out.” And sometimes it's four in the afternoon, we’re going to figure something out, and they have long since gone to bed or have a life, hopefully. And it was one of those areas where despite a conscious effort to avoid this problem, it was very clear that they did not have an equal voice in the team dynamic, in the team functioning, in the team culture, and in many cases, some of the decisions we ultimately reached as an outgrowth of those sidebar conversations. This led to something of an almost religious belief for me, for at least a while, was that either everyone’s distributed or no one is because otherwise you wind up with the unequal access problem. But it’s clearly worked for you folks. How have you gotten around that?
Mike: For Honeycomb, it was a conscious decision not long before the Covid pandemic that the team would be distributed first; the whole organization will be distributed first. So, a number of months before that happened, the intention was that anybody across the organization—which at the time, was only North America-based staff—would be able to do their job outside of the office. Because I think around the end of 2019 to the beginning of 2020, a lot of the staff were based in the San Francisco area and that was starting to grow, and want more staff to come into the business. And there were more opportunities for people outside of that area to join the business, so the business decided that if we’re going to do this, if we’re going to hire people outside of the local area, then we do want to make sure that, as you said, that everybody has an equal access, everyone has equal opportunity, they can participate, and everybody has the same opportunity to do those things. And that has definitely fed through pandemic, and then even when the office reopened and people can go back into the office. More than—I think there’s only… maybe 25% of the company now is even in Pacific Time Zone. And then the office space itself is not very large considering the size of the company, so we couldn’t fit everybody into our office space if we wanted to.
Corey: Yeah, that’s one of the constant growing challenges, too, that I understand that a lot of companies do see value in the idea of getting everyone together in a room. I know that I, for example, I’m a lot more effective and productive when I’m around other people. But I’m really expensive to their productivity because I am Captain Interrupter, which, you know, we have to recognize our limitations as we encounter them. But that also means that the office expense exceeds the AWS bill past a certain point of scale, and that is not a small thing. Like, I try not to take too much of a public opinion on should we be migrating everyone back to return-to-office as a mandate, yes, no, et cetera.
I can see a bunch of different perspectives on this that are nuanced and I don’t think it lends itself to my usual reactionary take on the Twitters, as it were, but it’s a hard problem with no easy answer to it. Frankly, I also think it’s a big mistake to do full-remote only for junior employees, just because so much of learning how the workforce works is through observation. You don’t learn a lot about those unspoken dynamics in any other way than observing it directly.
Mike: Yes, I fully agree. I think the stage that Honeycomb was at when I joined and has continued to be is that I think a very junior person joining an organization that is fully distributed is more challenging. It has different challenges, but it has more challenges because it doesn’t have those… you can’t just see something happening and know that that’s the norm or that that’s the expectation. You’ve got to push yourself into those in those different arenas, those different conversations, and it can be quite daunting when you’re new to an organization, especially if you are not experienced in that organization or experienced in the role that you’re currently occupying. Yeah, I think the distributed organizations is—fully distributed has its challenges and I think that’s something that we do at Honeycomb is that we intentionally do that twice a year, maybe three times a year, bring in the people that do work very closely, bringing them together so they have that opportunity to work together, build those social interactions like I mentioned earlier, and then do some work together as well.
And it builds a stronger trust relationship because of that, as well because you’re reinforcing the social side with the work side in a face-to-face context. And there’s just, there’s no direct replacement for face-to-face. If you worked for somebody and never met them for over a year, it’d be very difficult to then just be in a room together and have a normal conversation.
Corey: It takes a lot of effort because there’s so much to a company culture that is not meetings or agenda-driven or talking about the work. I mean, companies get this wrong with community all the time where they think that a community is either a terrible option of people we can sell things to or more correctly, a place where users of our product or service or offering or platform can gather together to solve common challenges and share knowledge with each other. But where they fall flat often is it also has to have a social element. Like ohh, having a conversation about your lives is not on topic for this community Slack team is, great, that strangles community before it can even form, in many cases. And work is no different.
Mike: Yeah, I fully agree. We see that with the Honeycomb Pollinators Slack channel. So, we use that as a primary way of community members to participate, talk to each other, share their experiences, and we can definitely see that there is a high level of social interaction alongside of that. They connect because they’ve got a shared interest or a shared tool or a shared problem that they’re trying to solve, but we do see, like, people, the same people, reconnecting or re-communicating with each other because they have built that social connection there as well.
And I think that’s something that as organizations—like, OpenTelemetry is a community is more welcoming to that. And then you can participate with something that then transcends different organizations that you may work for as well because you’re already part of this community. So, if that community then reaches to another organization, there’s an opportunity to go, to move between organizations and then maintain a level of connection.
Corey: That seems like one of the better approaches that people can have to this stuff. It’s just a—the hard part, of course, is how do you change culture? I think the easy way to do it—the only easy way to do it—is you have to build the culture from the beginning. Every time I see companies bringing in outsiders to change the corporate culture, I can’t help but feel that they’re setting giant piles of money on fire. Culture is one of those things that’s organic and just changing it by fiat doesn’t work. If I knew how to actually change culture, I would have a much more lucrative target for my consultancy than I do today. You think AWS bills are a big problem? Everyone has a problem with company cultures.
Mike: Yeah, I fully agree. I think that culture is something that you’re right is very organic, it naturally happens. I think the value when organizations go through, like, a retrospective, like, what is our culture? How would we define it? What are the core values of that and how do we articulate that to people that might be coming into the organization, that’s very valuable, too, because those core values are very useful to communicate to people.
So, one of the bigger core values that we’ve got at Honeycomb is that—we refer to as, “We hire adults,” meaning that when somebody needs to do something, they just can go and do it. You don’t have to report to somebody, you don’t have to go and tell somebody, “I need a doctor appointment,” or, “I’ve got to go and pick up the kids from school,” or something like that. You’re trusted to do your job to the highest level, and if you need additional help, you can ask for it. If somebody requires something of you they ask for it. They do it in a humane way and they expect to be treated like a human and an adult all of the time.
Corey: On some level, I’ve always found, for better or worse, that people will largely respond to how you treat them and live up or down to the expectation placed upon them. You want a bunch of cogs who are going to have to raise their hand to go to the bathroom? Okay, you can staff that way if you want, but don’t be surprised when those teams don’t volunteer to come up with creative solutions to things either. You can micromanage people to death.
Mike: Yeah. Yeah, definitely. I’ve been in organizations, like, fresh out of college and had to go to work at a particular place and it was very time-managed. And I had inbound sales calls and things like that and it was very, like, you’ve spent more than three minutes on a wrap-up call from having a previous call, and if you don’t finish that call within three minutes, your manager will call your phone to say, “You need to go on to the next call.” And it’s… you could have had a really important call or you could have had a very long call. They didn’t care. They just wanted—you’ve had your time now move on to the next one and they didn’t care.
Corey: One last question I want to ask you about before we wind up calling this an episode, and it distills down to I guess, effectively, your history, for lack of a better term. You have done an awful lot of Go maintenance work—Go meaning the language, not the imperative command, to be clear—but you also historically were the .NET SDK maintainer for something or other. Do you find those languages to be similar or… how did that come to be? I mean, to be clear, my programming languages of choice are twofold: both brute force and enthusiasm. Most people take a slightly different path.
Mike: Yeah, I worked with .NET for a very long time, so that was, like, the place—the first place that I joined as a real organization after finishing college was .NET and it just sort of stuck. I enjoyed the language. At the time, sort of, what 15 year—12, 15 years ago, the language itself was moving pretty well, there was things being added to it, it was enjoyable to use.
I think Go takes away some of that because if you don’t know those ecosystems or if you don’t know those tools, you can still solve the problem fairly quickly and fairly simply. Tools will help but they’re not required. .NET is probably on the boundary for me. It’s still very easy to use, I enjoy using it, but it just… I found that it’s not that long ago, I would say that I’ve switched from thinking like a .NET developer, so whenever I’m forming code in my head, like, how I would solve a problem, for a very long time, it was in .NET and C#.
I’d probably say in the last 12 months or so, it’s definitely moved more to Go just because of the simplicity. And it’s also the tool that is most used within Honeycomb, especially, so if you’re talking about Go code, you’ve got a wider audience to bounce ideas off, to talk to, communicate, get ideas from. .NET is not a very well used language within Honeycomb and probably even, like… even maybe West Coast-based organizations, it seems to be very high-level organizations that are willing to pay their money up for, like, Microsoft support. Like, Go is something that a lot of developers use because it’s very simple, very quick, can move quick.
Corey: I found that it was very easy for me to pick up Go to build out something ridiculous a few years back when I need to control my video camera through its ‘API’ to use the term charitably. And it just works in a way that made an awful lot of sense. But I still find myself reaching for Python or for—God help me—TypeScript if I’m doing some CDK work these days. And honestly, they all tend to achieve more or less the same outcome. It’s just different approaches to—well, to be unkind—dependency management in some cases, and also the ecosystem around it and what is done for you.
I don’t think there’s a bad language to learn. I don’t want this to be interpreted as language snobbery, but I haven’t touched anything in the Microsoft ecosystem for a long time in production, so .NET was just never on my radar. But it’s clear they have an absolutely massive community ecosystem built around it and that is no small thing. I’d say it rivals Java.
Mike: Yeah definitely. I think over the last ten years or so, the popularity of .NET as a language to be built from enterprise, especially at larger-scale organizations have taken it on, and then, like, six, seven years ago, they introduced the .NET Core Framework, which allowed it to run on non-Windows platforms, and that accelerated the language dramatically, so they have a consistent API that can be used on Windows, on Linux, Mac, and that makes a huge difference for creating a larger audience for people to interact with it. And then also, with Azure becoming much more popular, they can have all of these—this language that people are typically used to using Linux as an operating system that runs infrastructure, but not being forced to use Windows is probably quite a big thing for Azure as well.
Corey: I really want to thank you for taking the time to talk about what you’re up to over there. If people want to learn more, where’s the best place for them to go find you?
Mike: Typically, I use Twitter, so it’s Mike_Goldsmith. I create blogs on the Honeycomb blog website, which I’ve done a few different things; I’ve got a new one coming up soon to talk about different ways of collecting data. So yeah, those are the two main places. LinkedIn is usual as ever, but that’s a little bit more work-focused.
Corey: It does seem to be. And we’ll put links to all of that in the [show notes 00:31:11]. Thank you so much for being so generous with your time, and of course, thank you Honeycomb for sponsoring this episode of my ridiculous podcast.
Mike: Yeah, thank you very much for having me on.
Corey: Mike Goldsmith, staff software engineer at Honeycomb. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an insulting comment that we will then have instrumented across the board with a unified observability platform to keep our days eventful.
Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.