All Along the of Automation with Anurag Gupta

Episode Summary

This week Corey is joined by Anurag Gupta, founder and CEO of Anurag guides us through the large variety of services he helped launch to include RDS, Aurora, EMR, Redshift and other. The result? Running things almost like a start-up—but with some distinct differences. Eventually Anurag ended up back in the testy waters of start-ups. He and Corey discuss the nature of that transition to get back to solving holistic problems, tapping into conveying those stories, and what Anurag was able to bring to his team at where automation is king. Anurag goes into the details of what Shoreline is and what they do. Stay tuned for me.

Episode Show Notes & Transcript

This week Corey is joined by Anurag Gupta, founder and CEO of Anurag guides us through the large variety of services he helped launch to include RDS, Aurora, EMR, Redshift and other. The result? Running things almost like a start-up—but with some distinct differences. 

Eventually Anurag ended up back in the testy waters of start-ups. He and Corey discuss the nature of that transition to get back to solving holistic problems, tapping into conveying those stories, and what Anurag was able to bring to his team at where automation is king. Anurag goes into the details of what Shoreline is and what they do. Stay tuned for me.


Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Your company might be stuck in the middle of a DevOps revolution without even realizing it. Lucky you! Does your company culture discourage risk? Are you willing to admit it? Does your team have clear responsibilities? Depends on who you ask. Are you struggling to get buy in on DevOps practices? Well, download the 2021 State of DevOps report brought to you annually by Puppet since 2011 to explore the trends and blockers keeping evolution firms stuck in the middle of their DevOps evolution. Because they fail to evolve or die like dinosaurs. The significance of organizational buy in, and oh it is significant indeed, and why team identities and interaction models matter. Not to mention weither the use of automation and the cloud translate to DevOps success. All that and more awaits you. Visit: to download your copy of the report now!

Corey: If your familiar with Cloud Custodian, you’ll love Stacklet. Which is made by the same people who made Cloud Custodian, but put something useful on top of it so you don’t have to be a need to be a YAML expert to work with it. They’re hosting a webinar called “Governance as Code: The Guardrails for Cloud at Scale” because its a new paradigm that enables organizations to use code to manage and automate various aspects of governance. If you’re interested in exploring this you should absolutely make it a point to sign up, because they’re going to have people who know what they’re talking about—just kidding they’re going to have me talking about this. Its doing to be on Thursday, July 22nd at 1pm Eastern. To sign up visit and I’ll talk to you on Thursday, July 22nd.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. This promoted episode is brought to you by Shoreline, and I’m certain that we’re going to get there, but first, I’m notorious for telling the story about how Route 53 is in fact a database, and anyone who disagrees with me is wrong. Now, AWS today is extraordinarily tight-lipped about whether that’s accurate or not, so the next best thing, of course, is to talk to the person who used to run all of AWS’s database offerings and start off there and get it from the source. Today, of course, he is not at an Amazon, which means he’s allowed to speak with me. My guest is Anurag Gupta, the founder and CEO of Anurag, thank you for joining me.

Anurag: Thanks for having me on the show, Corey. It’s great to be on, and I followed you for a long time. I think of you as AWS marketing, frankly.

Corey: The running gag has been that I am the de facto head of AWS marketing as a part-time gag because I wandered past and saw an empty seat and sat down and then got stuck with the role. I mostly kid, but there does seem to be, at times, a bit of a challenge as far as expressing stories and telling those stories in useful ways. And some mistakes just sort of persist stubbornly forever. One of them is in the list of services, Route 53 shows up as ‘networking and content delivery,’ which I think regardless of the answer, it doesn’t really fit there. I maintain it’s a database, but did you have oversight into that along with Glue, Athena, all the RDS options, managed blockchain—for some reason—as well. Was it considered a database internally, or was that not really how they viewed it?

Anurag: It’s not really how they view it. I mean, certainly there’s a long IP table, right, and routing tables, but I think we characterized it in a whole different org. So, I had responsibility for Analytics, Redshift, Glue, EMR, et cetera, and transactional databases: Aurora, RDS, stuff like that.

Corey: Very often when you have someone who was working at a very large company—and yes, Amazon has a bunch of small teams internally, but let’s face it, they’re creeping up on $2 trillion in valuation at the time of this recording—it’s fairly common to see that startups are, “Oh, this person was at Amazon for ages.” As if it’s some sort of amazing selling point because a company with, what is it, 1.2 million people give or take is absolutely like a relatively small just-founded startup culturally, in terms of resources, all the rest. Conversely, when you’re working at scales like that, where the edge case becomes the common case, and the corner case becomes something that happens 18 times an hour, it informs the way you think about things radically differently. And your reputation does precede you, so I’m going to opt for assuming that this is, rather than being the story about, “Oh, we’re just going to try and turn this company into the second coming of Amazon,” 
that there’s something that you saw while you were at AWS that you thought it was an unmet need in the ecosystem, and that’s what 
Shoreline is setting out to build. Is that slightly accurate? Or no you’re just basic—there’s a figurehead because the Amazon name is great for getting investors.

Anurag: No, that’s very astute. So, when I joined AWS, they gave me eight people and they asked me to go disrupt data warehousing and transaction processing. So, those turned into Redshift and Aurora, respectively, and gradually I added on more services. But in that sense, Amazon does operate like a startup. They really believe in restricting the number of resources you get so that you have time and you’re forced to think and be creative.

That said, you don’t really wake up at night sweating about whether you’re going to hit payroll. This is, sort of, my fourth startup at this point and there are sleepless nights at a startup and it’s different. I’d go launch a service at AWS and there’ll be 1000 people who are signed up to the beta the next day, and that’s not the way startups work. But there are advantages as well.

Corey: I can definitely empathize with that. My last job before I started this place was at a small scrappy startup which was great for three months and then BlackRock bought us, and then, oh, large regulated finance company combined with my personality ended about the way you think it would. And where, so instead of having the fears and the challenges that I dealt with then, I’m going to go start my own company and have different challenges. And yeah, they are definitely different. I never laid awake at night worrying about how I was going to make payroll, for example.

There’s also the freedom, in some ways, at large companies where whatever function needs to get done, whatever problem you have, there is some department somewhere that handles that almost exclusively, whereas in scrappy startup land, it’s, well, whatever problem needs to get done today, that is your job right now. And your job description can easily fill six pages by the end of month two. It’s a question of trade-offs and the rest. What did you see that gave you the idea to go for startup number four?

Anurag: So, when I joined AWS thinking I was going to build a bunch of database engines—and I’ve done that before—what I learned is that building services is different than building products. And in particular, nobody cares about your performance or features if your service isn’t up. Inside AWS, we used to talk about utility computing, you know, metering and providing compute storage database the way, you know, my local utility provider, PG&E, provides power and gas. And if I call up PG&E and say that the power is out at my house, I don’t really want to hear, “Oh, did you know that we have six nines power availability in the state of California?” I mean, the power is still out; go come over here and fix it. And I don’t really care about fancy new features they’re doing back at the plant. Really, all I care about is cost and availability.

Corey: The idea of utility computing got into that direction, too, in a lot of ways, in some strange nuances, too. The idea that when I flip the light switch, I don’t stop and wonder, is the light going to turn on? You know, until I installed IoT switches and then everything’s a gamble in the wild times again. And if the light doesn’t come on, I assume that the fuse is out, or the light bulb is blown. “Did PG&E wind up dropping service to my neighborhood?” Is sort of the last question that I have done that list. It took a while for cloud to get there, but at this point, if I can’t access something in AWS, my default assumption is that is my local internet, not the cloud provider. That was hard-won.

Anurag: That’s right. And so I think a lot of other SaaS companies—or anybody operating in the cloud—are now working and struggling to get that same degree of availability and confidence to supply to their customers. And so that’s really the reason for Shoreline.

Corey: There’s been a lot of discussion around the idea of availability and what that means for a business outcome where, I still tell the story from time to time that back in 2012 or so, I was going to buy a pair of underpants on, where I buy everything, and instead of completing the purchase, it threw one of the great pictures of staff dogs up. Now, if you listen to a lot of reports on availability, then for one day out of the week, I would just not wear underwear. In practice, I waited an hour, tried it again, the purchase went through and it was fine. However, if that happened every third time I tried to make a purchase, I would spend a lot more money at Target.

There has to be a baseline level of availability. That doesn’t mean that your site is never down, period, because that is, in many cases, an unrealistic aspiration and it turns every outage that winds up coming up down the road into an all-hands-on-deck five-alarm fire, which may not be warranted. But you do need to have a certain level of availability that meets or exceeds your customer’s expectations of same. At least that’s the way that I’ve always viewed it.

Anurag: I think that’s exactly right. I also think it’s important to look at it from a customer perspective, not a fleet perspective. So, a lot of people do inward-facing SRE measurements of fleet-wide availability. Now, your customer really cares about the region they’re in, or perhaps even the particular host they’re on. And that’s even more true if they’ve got data. So, for example, an individual database failing, it’ll take a long time for it to come back up elsewhere. That’s different than something more ephemeral, like an instance, which you can move more easily.

Corey: Part of the challenge that I’ve noticed as well when dealing with large cloud providers, a recurring joke has been the AWS status page: it is the purest possible expression of a static site because it never changes. And people get upset when things go down and the status page isn’t updated, but the challenge is when you’re talking about something that is effectively global scale, it stops being a question of is it up or is it down and transitions long before then into how up or how down is it? And things that impact one customer may very well completely miss another. If you’re being an absolutist, it will always be a sea of red, which doesn’t tell people anything useful. Whereas if a customer is down and their site is off, they don’t really care that most other customers aren’t affected.

I mean, on some level, you kind of want everyone to be down because that differs headline risk, as well as if my site is having a problem, it 
could be days before someone gets around to fixing a small bug, whereas if everything is down, oh, this will be getting attention very rapidly.

Anurag: That’s exactly right. Sounds like you’ve done ops before.

Corey: Oh, yes. You can tell that because I’m cynical and bitter about everything.

Anurag: [laugh].

Corey: It doesn’t take long working in operationally-focused roles to get there. I appreciate your saying that though. Usually, people say, “Let me guess. You used to be an ops person.” “How can you tell?” “Because your code is garbage,” is the other way that people go down that path.

And yeah, credit where due; they’re not wrong. You mentioned that back when you were in Amazon, you were given a team of eight people and told to disrupt the data warehouse. Yeah, I’ve disrupted the data warehouse as a single person before so it doesn’t seem that hard. But I’m guessing you mean something beyond causing an outage. It’s more about disrupting the space, presumably.

Anurag: [crosstalk 00:10:57].

Corey: And I think, looking back from 2021, it’s hard to argue that Amazon hasn’t disrupted the data warehouse space and fifteen other spaces besides.

Anurag: Yeah, so that’s what we were all about, sort of trying to find areas of non-consumption. So clearly, data was growing; data warehousing was not growing at the same rate. We figured that had to do with either a cost problem, or it had to do with a simplicity problem, or something else. Why aren’t people analyzing the data that they’re collecting? So, that led to Redshift. A similar problem in transaction processing led to Aurora and various other things.

Corey: You also said a couple of minutes ago that Amazon tends to talk more about features than they do about products, and building a product at a startup is a foundationally different experience. I think you’re absolutely on to something there. Historically, Amazon has folks get on stage at re:Invent and talk about this new thing that got released, and it feels an awful lot like a company saying, “Yeah, here’s some great bricks you can use to build a house.” “Well, okay. What kind of house can I build with those bricks?” “Here to talk about the house that they built as our guest customer speaker from Netflix.”

And it seems like they sort of abdicated, in many respects, the storytelling portion to a number of their customers. It is a very rare startup that has the luxury of being able to just punt on building a product and its product story that goes along with it. Have you found that your time at Amazon made storytelling something that you wound up missing a bit more, or retelling stories internally that we just don’t get to see from the outside, or is, “Oh, wow. I never learned to tell a story before because at Amazon, no one does that, and I have to learn how to do that now that I’m at a startup again?”

Anurag: No, I think it really is a storytelling experience. I mean, it’s a narrative-based culture there, which is, in many ways, a storytelling experience. So, we were trying to provide a set of capabilities so that people could build their own things, you know, much as Kindle allows people to self-publish books; we’re not really writing books of our own. And so I think that was the experience there. Outside, you are trying to solve more holistic problems, but you’re still only a puzzle piece in the experience that any given customer has, right? You don’t satisfy all of their needs, you know, soup to nuts.

Corey: And part of the challenge too, is that if I’m a small, scrappy startup, trying to get something out the door for the first time, the problems that I’m experiencing and the challenges that I have are radically different than something that has attained hyperscale and now has whole optimization stories or series of stories going on. It’s, will this thing even work at all is my initial focus. And in some ways, it feels like conference-ware cuts against a lot of that because it’s hard not to look at the aspirational version of events that people tell on stage at every event I’ve ever seen, and not come away with a takeaway of, “Oh. What I’ve built is actually terrible, and depressing, and sad.” One of the things that I find that resonates about what you’re building over at Shoreline is, it’s not just about the build things from scratch and get them provisioned for the first time. It’s about the ongoing operationalization, I think—if that’s a word—about that experience, and how to wind up handling the care and feeding of something that exists and is running, but is also subject to change because all things are continually being iterated on.

Anurag: That’s right. I feel like operation is sort of an increasingly important but underappreciated part of the service delivery experience much as, maybe, QA was a couple of decades ago. And over time we’ve gone and we built pipelines to automate our test infrastructure, we have deployment tools to deploy it, to configure it, but what’s weird is that there are two parts of the puzzle that are still highly manual: developing software and operating that software in production. And the other thing that’s interesting about that is that you can decide when you are working on developing a piece of code, or testing it, or deploying it, or configuring it. You don’t get to decide when the disk goes down or something breaks. That’s why you have 24/7 on-call.

And so the whole point of Shoreline is to break that into two problems: the things that are automatable, and make it easy, as trivial to automate those things away so you don’t wake up to do something for the tenth time; and then for the remaining things that are novel, to make diagnosing and repairing your fleet, as simple and straightforward as diagnosing and repairing a single box. And we do a lot of distributed systems [techs 00:16:01] underneath the covers to make that the case. But those are the two things that we do, and so hopefully that reduces people’s downtime and it also brings back a lot of time for the operators so they can focus on higher-value things, like working with you to reduce their AWS bill.

Corey: Yeah, for better or worse, working on the AWS bill is always sort of a backseat function, or a backburner function, it’s never the burning priority unless things have gone seriously awry. It’s a good governance thing; it’s the idea of where, let’s optimize this fixed unit economics. It is rarely the number one most pressing area of business for a company. Nor should it be; I think people are sometimes surprised to hear me say that. You want to be reasonable stewards of the money entrusted to you and you obviously want to continue to remain in business by not losing money on everything you sell, but trying to make it up in volume. But at some point, it’s time to stop cutting and focus instead on revenue growth. That is usually the path to success for almost every company I’ve ever spoken to, unless they are either very out of kilter, or in a very strange spot in the industry.

Anurag: That’s true, but it does belong, I think, in the ops function to do optimization of your experience, whether—and, you know, improving your resources, improving your security posture, all of those sorts of things fall into production ops landscape, from my perspective. But people just don’t have time for it because their fleets are growing far, far faster than their headcount is. So, the only solution to that is automation.

Corey: And I want to talk to you about that. Historically, the idea has been that you have monitoring—or observability these days, which I consider to be hipster monitoring—figuring out what’s going on in your environment. Then you wind up with incidents being declared when certain things wind up triggering, which presumably are things that actually matter and not, you’re waking someone up for vague reasons like ‘load average is high on these nodes,’ which tells you nothing in isolation whatsoever. So, you have the incident management portion of that [next 00:18:03], and that handles a lot of the waking folks up and getting everyone onto the call. You’re focusing on, I guess, a third tranche here, which is the idea of incident automation. Tell me about that.

Anurag: That’s exactly right. So, having been in the trenches, I never got excited about one more dashboard to look at, or someone routing a ticket to the right person, per se, because it’ll get there, right?

Corey: Oh, yeah. Like, one of the most depressing things you’ll ever see in a company is the utilization numbers from the analytics on the dashboards you build for people. They look at them the day you build them and hand it off, and then the next person visiting it is you while running this report to make sure the dashboard is still there.

Anurag: Yeah. I mean, they are important things. I mean, you get this huge sinking feeling something is wrong and your observability tool is also down like CloudWatch was in some large-scale events. Or if your ticketing system is down and you don’t even notify somebody and you don’t even know to wake up. But what did excite me—so you need those things; they’re necessary, but they’re not sufficient.

What I think is also needed is something that actually reduces the number of tickets, not just lets you observe them or find the right person to act upon it. So, automation is the path to reducing tickets, which is when I got excited because that was one less thing to wake up on that gave me more time back to wo—do things, and most importantly, it improved my customer availability because any individual issue handled manually is going to take an hour or two or three to deal with. The issue being done by a computer is going to take a few seconds or a few minutes. It’s a whole different thing. It’s the difference between a glitch and having to go out on an apology tour to your customers.

Corey: I really love installing, upgrading, and fixing security agents in my cloud estate! Why do I say that? Because I sell things, because I sell things for a company that deploys an agent, there's no other reason. Because let’s face it. Agents can be a real headache. Well, now Orca Security gives you a single tool that detects basically every risk in your cloud environment -- and that’s as easy to install and maintain as a smartphone app. It is agentless, or my intro would’ve gotten me into trouble here, but  it can still see deep into your AWS workloads, while guaranteeing 100% coverage. With Orca Security, there are no overlooked assets, no DevOps headaches, and believe me you will hear from those people if you cause them headaches. and no performance hits on live environments. Connect your first cloud account in minutes and see for yourself at Thats “Orca” as in whale, “dot” security as in that things you company claims to care about but doesn’t until right after it really should have.

Corey: Oh, yes. I feel like those of us who have been in the ops world for long enough, we always have a horror story or to have automation around incidents run amok. A classic thing that we learned by doing this, for example, is if you have a primary and a secondary, failover should be automated. Failing back should not be, or you wind up in these wonderful states of things thrashing back and forth. And in many cases in data center land, if you have a phantom router ready to step in, if the primary router goes offline, more outages are caused by a heartbeat failure between those two devices, and they both start vying for power.

And that becomes a problem. Same story with a lot of automation approaches. For example, if oh, every time a disc winds up getting full, all right, we’re going to fire off something automatically expand the volume. Well, without something to stop that feedback loop, you’re going to potentially wind up with an unbounded growth problem and then you wind up with having no more discs to expand the volume to, being the way that winds up smacking into things. This is clearly something you’ve thought about, given that you have built a company out of this, and 
this is not your first rodeo by a long stretch. How do you think about those things?

Anurag: So, I think you’re exactly right there, again. So, the key here is to have the operator, or the SRE, define what needs to happen on an individual box, but then provide guardrails around them so that you can decide, oh, a lot of these things have happened at the same time; I’m going to put a rate limiter or a circuit breaker on it and then send it off to somebody else to look at manually. As you said, like failover, but don’t flap back and forth, or limit the number of times, but something is allowed to fail before you send it [unintelligible 00:21:44]. Finally, everything grounds that a human being looking at something, but that’s not a reason not to do the simple stuff automatically because wasting human intelligence and time on doing just manual stuff again, and again, and again, is pointless, and also increases the likelihood that they’re going to cause errors because they’re doing something mundane rather than something that requires their intelligence. And so that also is worse than handing it off to be automated.

But there are a lot of guardrails that can be put around this—that we put around it—that is the distributed systems part of it that we provide. In some sense, we’re an orchestration system for automation, production ops, the same way that other people provide an orchestration system for deployments, and automated rollback, and so forth.

Corey: What technical stacks do you wind up supporting for stuff like this? Is it anything you can effectively SSH into? Does it integrate better with certain cloud providers than others? Is it only for cloud and not for folks with data center environments? Where do you start? Where do you stop?

Anurag: So, we have started with AWS, and with VMs and Kubernetes on AWS. We’re going to expand to the other major cloud providers later this year and likely go to VMware on-prem next year. But finally, customers tell us what to do.

Corey: Oh, yeah. Looking for things that have no customer usage is—that’s great and all, but talking to folks who are like, “Yeah, it’d be nice if it had this.” “Will you buy it if it does?” “No.” “Yeah, let’s maybe put that one on the backlog.”

Anurag: And you’ve done startups, too, I see that.

Corey: Oh, once or twice. Talk to customers; I find that’s one of those things that absolutely is the most effective use of your time you can do. Looking at your site— for those who want to follow along at home—it lists a few different remediations that you give as examples. And one of them is expanding disk volumes as they tend to run out of space. I’m assuming from that perspective alone, that you are almost certainly running some form of Agent.

Anurag: We are running an Agent. So, part of that is because that way, we don’t need credentials so that you can just run inside the customer environment directly and without your having to pass credentials to some third party. Part of it is also so you can do things quickly. So, every second, we’ll scrape thousands of metrics from the Prometheus exporter ecosystem, calculate thousands more, compare them against hundreds of alarms, and then take action when necessary. And so if you run on-box, that can be done far faster than if you go on off-box.

And also, a lot of the problems that happen in the production environment are related to networking, and it’s not like the box isn’t accessible, but it may be that the monitoring path is not accessible. So, you really want to make sure that the box can protect itself even if there’s some issues somewhere in the fleet. And that really becomes an important thing because that’s the only time that you need incident automation: when something’s gone wrong.

Corey: I assume that Agent then has specific commands or tasks it’s able to do, or does it accept arbitrary command execution?

Anurag: Arbitrary command execution. Whatever you can type in at the Linux command prompt, whether it’s a call to the AWS CLI, Kube control, Linux commands like top, or even shell scripts, you can automate using Shoreline.

Corey: Yeah. That was one of the ways that Nagios got it wrong, once upon a time, with their NRP, their Nagios Remote Plugin engine, where you would only be allowed to run explicit things that had been pre-approved and pushed out to things in advance. And it’s one of the reasons, I suspect, why remediation in those days never took off. Now, we’ve learned a lot about observability and monitoring, and keeping an eye on things that have grown well beyond host-based stuff, so it’s nice to see that there is growth in that. I’m much more optimistic about it this time around, based upon what you’re saying.

Anurag: I hope you’re right because I think the key thing also is that I think a lot of these tools vendors think of themselves as the center of the universe, whereas I think Shoreline works the best if it’s entirely invisible. That’s what you want from a feedback control system, from a automation system: that it just give you time back and issues are just getting fixed behind the scenes. That’s actually what a lot of AWS is doing behind the scenes. You’re not seeing something whenever some rack goes down.

Corey: The thing that is always taken me back—and I don’t know how many times I’m going to have to learn this lesson before it sticks—I fall into the common trap of take any one of the big internationally renowned tech companies, and it’s easy to believe that oh, everything inside is far future wizardry of, everything works super well, the automation is flawless, everything is pristine, and your environment compared to that is relative garbage. It turns out that every company I’ve ever spoken with and taken SREs from those companies out to have way too many drinks until they hit honesty levels, they always talk about it being a sad dumpster fire in a bunch of different ways. And we’re talking some of the companies that people laud as the aspirational, your infrastructure should be like these companies. And I find it really important to continue to socialize that point, just because the failure mode otherwise is people think that their company just employs terrible engineers and if people were any good, it would be seamless, just like they say on conference stages. It’s like comparing your dating life to a romantic comedy; it’s not an accurate depiction of how the world works.

Anurag: Yeah, that’s true. That said, I’d say that, like, the average DBA working on-prem may be managing a hundred databases; the average DBA in RDS—or somebody on call—might be managing a hundred thousand.

Corey: At that point, automation is no longer optional.

Anurag: Yeah. And the way you get there is, every week you squash and extinguish one thing forever, and then you start seeing less and less frequent things because one in a million is actually occurring to you. But if it was one in a hundred, that would just crush you. And so you just need to, you know, very diligently every week, every day, remove something. Yeah, Shoreline is in many ways the product I wish I had had at AWS because it makes automating that stuff easy, a matter of minutes, rather than months. And so that gives you the capability to do automation. Everyone wants automation, but the question is, why don’t they do it? And it’s just because it takes so much time and we’re so busy, as operators.

Corey: Absolutely. I don’t mean to say that these large companies working at hyperscale have not solved for these problems and done truly impressive things, but there’s always sharp edges, there’s always things that are challenging and tricky. On this show, we had Dr. Christina Maslach recently as an expert on burnout, given that she spent her entire career studying occupational burnout as an academic. And it turns out that it’s not—to equate this to the operations world—it’s not waking up at two in the morning to have to fix a problem—generally—that burns people out. It’s being woken up to fix a problem at 2 a.m. consistently, and it’s always the same problem and nothing ever seems to change. It’s the worst ops jobs I’ve ever seen are the ones where you have to wake up to fix a thing, but you’re not empowered to actually fix the cause, just the symptom.

Anurag: I couldn’t agree more and that’s the other aspect of Shoreline is to allow the operators or SREs to build the remediations rather than just put a ticket into some queue for some developer to get prioritized alongside everything else. Because you’re on the sharp edge when you’re doing ops, right, to deal with all the consequences of the issues that are raised. And so it’s fine that you say, “Okay, there’s this memory leak. I’ll create a ticket back to dev to go and fix it.” But I need something that helps me actually fix it here and now. Or if there’s a log that’s filling up my disk, it’s fine to tell somebody about it, but you have to grow your disk or move that log off the disk. And you don’t want to have to wake up for those things.

Corey: No. And the idea that everything like this gets fixed is a bit of a misnomer. One of my hobbies is whenever a site goes down and it is uncovered—sometimes very publicly, sometimes in RCEs—that the actual reason everything broke was due to an expired certificate.

Anurag: Yep.

Corey: I like to go and schedule out a couple of calendar reminders on that one for myself, of check it in 90 days, in case they’re using a refresh from Let’s Encrypt, and let’s check it as well in one year and see if there’s another outage just like that. It has a non-zero success rate because as much as we want to convince ourselves that, oh, that bit me once, and I’ll never get bitten like that again, that doesn’t always hold true.

Anurag: Certificates are a very common source of very widespread outages. And it’s actually one of the remediations we provide out of the box. So, alongside making it possible for people to create these things quickly, we also provide what we call Op Packs, which are basically getting started things which have the metrics, alarms, actions, bots, so they can just fix it forever without actually having to do very much other than review what we have done.

Corey: And that’s, on some level, I think, part of the magic is abstracting away the toil so that people are left to solve interesting problems and think about these things, and guiding them down a path where, okay, what should I do on an automatic basis if the disk fills up? Well, I should extend the volume. Yeah. But maybe you should alert after the fifth time in an hour that you have to extend the same volume because—just spitballing here—maybe there’s a different problem here that putting a bandaid on isn’t going to necessarily solve. It forces people to think about what are those triggers that should absolutely result in human intervention because you don’t necessarily want to solve things like memory leaks, for example, oh our application leaks memory so we have to restart it once a day.

Now, in practice, the right way to solve that is to fix the application. In practice, there are so many cron jobs out there that are set to restart things specifically for that reason because cron jobs are quick and easy and application developer time is absolutely not easy to come by in many of these shops. It just comes down to something that helps enforce more of a process, more of a rigor. I like the idea quite a bit; it aligns both with where people are and how a better tomorrow starts to look. I really do think you’re onto something here.

Anurag: I mean, I think it’s one of these things where you just have to understand it’s not either-or, that it’s not a question of operator pain or developer pain. It’s, let’s go and address it in the here and now and also provide the information, also through an automated ticket generation, to where someone can look to fix it forever, at source.

Corey: Oh, yeah. It’s always great of the user experience, too. Having those tickets created automatically is also sometimes handy because the worst way to tell someone you don’t care about their problem when they come to you in a panic is, “Have you opened a ticket?” And yes, of course, you need a ticket to track these things, but maybe when someone is ghost pale and scared to death about what they think just broke the data, maybe have a little more empathy there. And yeah, the process is important, but there should be automatic ways to do that. These things all have APIs. I really like your vision of operational maturity and managing remediation, in many cases, on an automatic basis.

Anurag: I think it’s going to be so much more important in a world where deployments are more frequent. You have microservices, you have multiple clouds, you have containers that give a 10x increase in the number of things you have to manage. There’s a lot for operators to have to keep in their heads. And things are just changing constantly with containers. Every minute, someone comes and one goes. So, you just really need to—even if you’re just doing it for diagnosis, it needs to be collecting it and putting it aside, is really critical.

Corey: If people want to learn more about what you’re building and how you think about these things, where can they find you?

Anurag: They can reach out to me on LinkedIn at awgupta, or of course, they can go to and reach out there, where I’m also [email protected] if they want to reach out directly. And we’d love to get people demos; we know there’s a lot of pain out there. Our mission is to reduce it.

Corey: Thank you so much for taking the time to speak with me today. I really appreciate it.

Anurag: Yeah. This was a great privilege to talk to you.

Corey: Anurag Gupta, CEO and founder of I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with a comment telling me that I’m wrong and that Amazonians are the best at being on call 
because they carry six pagers.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit to get started.

Announcer: This has been a HumblePod production. Stay humble.
Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.