Picture this: You’re in your swivel chair, feet propped up on your standing desk because you are a glorious acrobat, and you’re looking over your company’s Amazon EC2 fleet utilization report. You’re captivated by the custom colorful dashboard, carefully tuned to a 1st-grade reading level. You see the overall number in its soft, non-threatening font, and you say to yourself, “We’re operating at 70% capacity — we’re golden!”

That metric is usually nothing more than a feel-good security blanket that doesn’t give you better insight into the efficiency of your spend. Why? Because the number at the top of your report is simply your CPU utilization, which is being passed off as a standalone metric for fleet utilization. It’s the cloud’s equivalent of ordering a pizza based on the box’s size, without giving a second thought to what’s actually inside.

What’s more disturbing is the number of cost optimization vendors who similarly adhere to that metric without context; it’s in danger of becoming a de facto “best practice” that will lead you down the primrose path.

The limits of CPU utilization metrics

CPU utilization tells you, perhaps obviously, how much of your CPU resources are being used at any given time. Any cloud provider can easily query, “How busy is the compute on this instance?”

In a theoretical world where disk, RAM, burst capacity, network throughput, and latency are all irrelevant, then — yes! OK! terrific! — the question of utilization would strictly come down to how many CPU cores can you throw at the problem, and then using CPU as a proxy for utilization is great. If that’s you, stop reading now, go buy some Last Week in AWS mugs or something, and carry on with your charmed existence. If it isn’t you, keep reading.

Huh, we just lost a couple HPC folks and several of the more naive analyst firms, but the rest of you are all still here. Imagine that …

For the rest of us, the problem is that CPU utilization is a single data point that doesn’t tell much of a story. It’s impossible to say at first glance whether your CPU numbers are worrying or an indicator that all is well, regardless of what the actual numbers are.

High CPU usage could mean:

  • your applications are working efficiently, or
  • they’re straining under the load, desperately crying out for relief.

Low CPU usage could mean:

  • your instances are idling, wasting precious cloud dollars,
  • your applications are well-optimized and aren’t CPU bound, or
  • you need idle capacity to burst into when a bunch of your users all show up at once.

The CPU utilization metric blissfully ignores other critical aspects of your instances’ operation, such as network activity, disk I/O, and memory usage. A high CPU usage with low network activity could signal a performance bottleneck that leads to a data-starved instance, or it could be an application that barely needs to talk to other things on the internet. A low CPU usage with high memory utilization could mean your application is inefficiently coded, or that it’s a database that lives in RAM for latency purposes.

The risks of relying on CPU metrics

This reductionism of cloud instance health to CPU utilization stems from its ease of access. It’s readily available, easy to measure, and undeniably simplistic to interpret. Cloud providers can grab it via API, slap it onto a pretty graph, and voila, they’ve got themselves a utilization report. And the resulting CPU metric seems to level the playing field to reason about workloads that are remarkably diverse, making it easier to benchmark yourself against other companies (which you should not do). But easy access doesn’t equal quality insight.

Take a look at the fact that a c7g.large instance in EC2 is about 6% more expensive than a c6g.large instance. Amazon points out that the price/performance of that instance means you get improved price/performance, but that assumes an awful lot of things about your workload. If you need a cluster of 10 nodes to chew on a problem because that’s how your application works, then your cluster just got 6% more expensive if you upgrade to the latest generation — without a clear upside benefit that accrues to you.

How to actually determine your fleet utilization

A nuanced approach, taking into account a bouquet of metrics including network I/O, disk read/write speeds, and memory usage alongside CPU utilization, provides a holistic picture of your cloud instance fleet. Those metrics require a lot more insight into the environment and, in the case of memory, an agent running on the actual instances themselves. Cloud providers could deliver these kinds of nuanced reports, but the effort required from them is likely too high.

So, next time you’re in your swivel chair, resist the temptation to rely solely on the CPU utilization column. Dive deeper, venture beyond, and ask probing questions. In so doing, uncover the true health of your server fleets. Because in the world of cloud economics, ignorance isn’t bliss; it’s just expensive.