Whiteboard Confessional: How Cluster SSH Almost Got Me Fired

Episode Summary

Join me as I continue a new series called Whiteboard Confessional with a deep dive into Cluster SSH, how I landed my first role in a production-style environment at a university, how engineering work is much different in academia than in the for-profit world, the journey that led me to find Cluster SSH and how the tool works, how Unix admins generally get interested in backups right after they really need to have backups that are working, why restores are harder than backups, why systems that are doing configuration management need to understand the concept of idempotence, tools to use instead of Cluster SSH, and more.

Episode Show Notes & Transcript

About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.



Corey: On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.

So, once upon a time, way back in the mists of antiquity, was a year called 2006. This is before many folks listening to this podcast were involved in technology. And I admit as well that it is also several decades after other folks listening to this podcast got involved in technology. But that’s not the point of this story. It was my first real job working in anything resembling a production-style environment. I’d dabbled before this, running various environments on Windows desktop style support. I’d played around with small business servers for running Windows-style environments. And then I decided there wasn’t much of a future in technology and spent some time as a technical recruiter, spent a little bit more time working in a sales role, which I was disturbingly good at, but I was selling tape drives to people. But that’s not the interesting part of the story. What is, is that I somehow managed to luck my way into a job interview for a university, helping to run their Linux and Unix systems. 

Cool. Turns out that interviewing is a skill like any other. The technical reviewer was out sick that day, and they really liked both the confidence of my answers, as well as my personality. That’s two mistakes right there. One; my personality is exactly what you would expect it to be. And two; hiring the person who sounds the most confident is exactly what you don’t want to do. It also tends to lend credence to people who look exactly like me. So I had converted some systems over in the first few months for that role to FreeBSD, which is like Linux, except it’s not Linux. It’s a Unix and it’s far older, derived from the Berkeley software distribution. and managing a bunch of those systems at scale was a challenge. Now understand, in this era scale meant something radically different than it does today. I had somewhere between 12 and 15 nodes that I had to care about. Some more mail servers. Some were NTP servers, of all things. Utility boxes here and there, the omnipresent web servers that we all dealt with, the Cacti box whose primary job was to get compromised and serve as an attack vector for the rest of the environment, etcetera. 

This was a university. Mistakes didn’t necessarily mean the same thing there as they would in revenue-generating engineering activities. I was also young, foolish, and the statute of limitations is almost certainly expired by now. So, running the same command on every box was annoying. This was in the days before configuration management was really a thing. BCFG2 was out there and incredibly complex. And CFEngine was also out there, which required an awful lot of in-depth arcane knowledge that I frankly didn’t have. Remember, I bluffed my way into this job and was learning on the fly. So I did a little digging and, lo and behold, I found a tool that solved my problems. called ClusterSSH. And oh, was it a cluster. The way that this works was that it would spin up different xterm windows on your screen that you could then provide a list of hosts for, and it would open one for every host you gave it. 

Great. So now I’m logged into all of those boxes at once. If this is making you cringe already, it probably should, because this is not a great architectural pattern. But here we are, we’re telling this story, so you probably know how that worked out. One of the intricacies of FreeBSD is that instead of running systems that turn things on or turn things off, as far as services to start on boot. For example, with Red Hat derived systems, before the dark times of systemd, you could write things like chkconfig, that’s C-H-K, the word config, and then you could give a service and tell it to turn it on or off at certain run levels. This is how you would tell it to, for example, start the webserver when you boot, otherwise, you reboot the system, the webserver does not start, and you wonder why TCP now terminates on the ground. This was all controlled via a single file—/etc/rc.conf. That controlled which services were allowed to start, as well as which services were going to be started automatically on boot. It would generally be a boolean value provided to the particular service name. 

Well, I was trying to do something, probably, I want to say, NTP related, but don’t quote me on that, where I wanted to enable a certain service to start on all of the systems at once. So I typed a command, specifically echoing the exact string that I wanted in quotes, so it would be quoted appropriately, and then with the right angle bracket, to that file—/etc/rc.conf, and then I pressed enter. Now, for those who are unaware of Unix-isms and how things work shell, a single right angle bracket means overwrite this file, two angle brackets say append to the end of this file. I was trying to get the second one, and instead, I wound up getting the first. So suddenly, I had just rewritten all of those files across every server. Great plan, huh? Well, I realized what I’d done as soon as I checked my work to validate that the system had taken the update appropriately, it had not, it had taken something horrifying up instead. What happened next? Great question.

But first, in the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. 

So, I’m suddenly staring at a whole bunch of systems that now have a corrupted configuration. Fortunately, this hadn’t taken anything down, at the moment. And it wouldn’t until one of these systems was restarted. Now, these are Unix boxes, so they don’t tend to get restarted all that often. But it’s got to be fixed and immediately because one, power outages always happen when you least expect them to, and two, leaving a landmine like that for someone else is what we call a career-limiting move in almost every shop, even a university, which is not typically known as a place that’s easy to get fired from. But I could’ve managed if I’d left that lying around. So the trick that I found to fixing all of this was logging into every one of those boxes by hand and taking a look to see what services were currently running on those boxes and then reconstructing what that file should have looked like, which was just an absolute treasure and a joy. 

Now well, hang on a second, why didn’t I restore from the backups that were being taken of these systems? What part of “first Unix admin job” are you not hearing? Backups were a thing that were on my list to get to eventually. You get really interested in backups right after you really needed to have backups that were working. Also, it turns out backups are super easy. It’s restores that are difficult and if you can’t restore, you don’t really have a backup. So at the end of going through all of those nodes one by one, over the course of about four hours, I’d managed to successfully reconstruct each of their files. Then what I wound up doing was very carefully restarting each one in sequence during a maintenance window later that afternoon, and validating, once I got in, that they continued to do the things that they were doing. I would compare what was currently running as a process versus what had been running before I restarted them. Suddenly, I’m very diligent about taking backups and keeping an eye on what exactly was running on a particular box. And by the time I got through that rotation, I was a) lot more careful, and b) everything had been restored, and there was no customer-facing impact. 

Now, all of that’s a very long story. But what does it have to do with the Whiteboard Confessional? What was the architectural problem here? The problem fundamentally, was that I was managing a fleet, even a small one, of systems effectively by hand. And this sort of mistake is incredibly common when you run the wrong command on the wrong box. There was no programmatic element to it, there was no rollback strategy at all. And there’s a lot of different directions that this could have gone through. For instance, I could have echoed that command first, just from a safety perspective, and validated what it did. I could have backed up the files before making a change to it. I could have tested this on a single machine instead of the entire production fleet. But most relevantly to the architectural discussion here, I could have not used freakin’ ClusterSSH. The problem, of course, is that instead of having a declarative state that you’re defining what your system should look like, you’re saying run this arbitrary command through what’s known as an imperative style of configuration management. This pattern continues to exist today across a wide variety of different systems and different environments. If you take a look at what Ansible does under the hood, this is functionally what it does—any config management system does—it runs a series of commands and drops files in place to make sure a system looks a certain way. 

If you’re just telling it to go ahead and run a particular command, like “create a user,” every time that command runs, it’s going to create a new user and you wind up with a whole bunch of users that don’t belong there, that don’t need to exist. Thousands upon thousands of users on a system from one dating back to every time the configuration management system runs. That’s how you open bank accounts at Wells Fargo, not how you intelligently managed systems at significant scale. So, making sure that your systems that are doing configuration management understand a concept of idempotence is absolutely critical. The idea being that I should be able to run the same thing multiple times and it not wind up destroying or duplicating or going around in circles in any meaningful way. That is the big lesson of configuration management. And today, systems that AWS offers, like AWS Systems Manager Session Manager, can have this same problem. The same with their EC2 Instance Connect. You can run a whole bunch of scripts and one-liners on a variety of nodes, but you’ve got to make sure that you test those things. You’ve got to make sure that there’s a rollback. You have to test on a subset of things, or you’re finding yourself recording embarrassing podcasts like this one, years later, once the statute of limitations has expired. No one is born knowing this, and none of these things are intuitively obvious, until the second time. Remember, when you don’t get what you want, you get experience instead, and experience builds character. 

I am Cloud Economist Corey Quinn, and I am a character. Thank you for listening to this episode of the AWS Morning Brief Whiteboard Confessional. Please leave a five-star review on iTunes if you’ve enjoyed it. If you didn’t, please leave a five-star review on iTunes via a script that continues to write a five-star review on iTunes every time you run it.

Announcer: This has been a HumblePod production.

Stay humble.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.