I’ve been using a series of Lambdas Function, APIs Gateway, Dynamos DB, and other sundry “serverless” services that I pluralize very strangely to build a series of microservices that combine to form the newsletter production pipeline that powers the thing that you all know as “Last Week in AWS.” Recently it was time to make a few fixes to it, wherein I stumbled across a particular variety of challenge that while not new, definitely feels like the serverless value proposition exacerbates it.
“Serverless” has become a catch-all term that’s been watered down enthusiastically, particularly by AWS offerings that look an awful lot like “serverfull” products to most of its customers, so let me clarify what I mean here. There are challenges around the margins, but basically I’m talking about services that are fully managed by the provider, charge only for what you use while they’re running, and scale to zero. Part of the benefit here is that once you have a service built on top of these technologies working, AWS handles the runtime patching, the care and feeding of the environment, and in practice I find myself not touching these things again for years at a time.
Until suddenly I do.
When I started out as an independent consultant in 2016, I spun up an AWS account. When I took on a business partner and reformed as The Duckbill Group, that account became the core of our AWS Organization, and today we largely view this as legacy / a pile of technical debt. These days everything we build gets its own dedicated AWS account or series of accounts, but this original account has a bunch of things in it that are for a variety of reasons challenging to move.
That means that it’s time once again to go delving into the archeological dig that is my legacy AWS environment and holy hell is this a bikeshed full of yaks in desperate need of shaving.
The more recently built services use the CDK to construct the infrastructure, but the older stuff uses mostly the Serverless Framework, and there’s also an experiment or two that uses sam-cli. That of course bypasses a couple of things where the tried-and-true ClickOps approach of “using the console, then lying about it” served me well. The problem is that while my infrastructure was frozen in time like a fly trapped in amber, these deployment tools absolutely did not hold still in the least.
Every software offering handles breaking changes in different ways. Some have auto-upgrades of configurations to support the new way they do business, others throw errors complaining about the old version, and still more die mysteriously and refuse to work again. As a result, there’s a confusing array of deployment errors that leads to the joyful question series of “is this the new account being misconfigured? Is there a hard dependency on something account-specific like a named S3 bucket or a Route 53 zone? Is there something manually configured that’s implicitly assumed to be there? Is this a breaking change in whatever framework I’m using? Wait, how did this ever work at all?”
When attempting to do a deploy to a new account, I’m first beset by the usual permissions issues; originally I set the deployment role to have Administrator permissions, swearing to go back and fix it later. I confess dear reader that “later” never came; this is the peril of complex permissions structures that get in the customer’s way; they never get used and everything winds up overscoped until there’s one day a problem, and AWS wags its finger at the customer and makes noises about the Shared Responsibility Model in ways that aren’t even slightly amusing.
I think that most of this is my fault for not treating these services as “production quality” from the get-go–but in my defense, this newsletter started as an experiment! I had no confidence that it would still be going in six months, let alone six years after I started. Barring failure, I believe that every service grows until it eventually violates the constraints with which it was initially designed; it’s probably time for a full rewrite, except that saying yes to doing that on something that’s largely working means saying no to something that creates something new and exciting.
Of course I shouldn’t be writing Lambda functions like they’re bash scripts triggered by a cron job, then ignoring them for the rest of time or until something breaks–but that’s how I use them. That’s how lots of people use them. I bet that you do too, whether you realize it or not.
What I’d Do Differently
I think that today I’d automatically start any new project with a staging environment as well as a production environment, and build CI/CD workflows around them that automatically deploy on a schedule. When there’s an upstream change that breaks the deployment, it should fire off an alert. The problem is that I have a lot of services that do this, so building out the blueprint for all of this is decidedly non-trivial, as well as being very workload specific since I use a lot of different architectural patterns. To round that out, the deployment flow is going to be radically different for different companies with different requirements imposed upon them. Ideally this is what AWS Proton is designed to work around; unfortunately for small companies trying to throw a lot of stuff at the wall to see what sticks, investing time in getting it tuned in for the approaches they take when those approaches themselves haven’t solidified is a fairly big ask. As soon as doing the right thing becomes more work than taking shortcuts, people make the decision you wish they wouldn’t; this is why guardrails need to remove friction, not add to it.
Why is this a Serverless problem?
If I had done this on a traditional set of web, job, and database servers I would have been engaging with the infrastructure a lot more frequently. Leaving systems unpatched may be common, but it’s also a terrible plan. When updating the infrastructure in ways that require rebooting and validation that you didn’t just destroy something, doing a redeploy around them as a validation pass is de rigueur; you’re keeping current by dint of the other stuff you have to do in order to responsibly run an application in a serverfull way. Lambda removes a lot of the undifferentiated heavy lifting–but in turn, that heavy lifting helps keep us honest!
I am absolutely not suggesting you avoid Lambda; far from it! I love that service and you can’t take it away from me. If nothing else, I’m going to set down a new Best Practice that you can all nibble me to death over like a waddling of ducks: schedule a redeployment of your serverless workloads on a schedule, at least to a staging environment. Don’t only do it when the code changes, because a lot of backend systems won’t see their code touched again for years. Do it on a schedule, during business hours, and have failures reported somewhere you’ll see them. It’s a lot easier to update between minor versions rather than trying to leap six major versions all at once and hunting down exactly what change it was that broke your specific implementation.
These are my thoughts; I welcome yours.