Keeping the Cloudwatch with Ewere Diagboya

Episode Summary

This week Corey is joined by Ewere Diagboya, Head of Cloud at Mycloudseries, and multifaceted blogger and author, and the first AWS Hero from Africa. Ewere’s book on CloudWatch is the first of its kind, and certainly a valuable asset to the community. Ewere’s passion for passing on lessons learned is very prevalent in his writing, and a core ethos in his drive to write. Ewere goes into the background of how his book, Infrastructure Monitoring with Amazon Cloudwatch, came to be. It starts with setting up memory, then building from there. Ewere talks about integrating Cloudwatch with EKS, Kubernetes clusters, containers and more AWS services. Ewere also talks about his future writing plans and passion to pass on the lessons he has learned. Whats more, there are other books he wants to write that will follow up Infrastructure. Tune in for the rest!

Episode Show Notes & Transcript

About Ewere

Cloud, DevOps Engineer, Blogger and Author

Links:

Infrastructure Monitoring with Amazon CloudWatch: https://www.amazon.com/Infrastructure-Monitoring-Amazon-CloudWatch-infrastructure-ebook/dp/B08YS2PYKJ
LinkedIn: https://www.linkedin.com/in/ewere/
Twitter: https://twitter.com/nimboya
Medium: https://medium.com/@nimboya
My Cloud Series: https://mycloudseries.com

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by Honeycomb. When production is running slow, it's hard to know where problems originate: is it your application code, users, or the underlying systems? I’ve got five bucks on DNS, personally. Why scroll through endless dashboards, while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter? Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other, which one is up to you. Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business: production. With Honeycomb, you guess less and know more. Try it for free at Honeycomb.io/screaminginthecloud. Observability, it’s more than just hipster monitoring.

Corey: This episode is sponsored in part by Liquibase. If you’re anything like me, you’ve screwed up the database part of a deployment so severely that you’ve been banned from touching every anything that remotely sounds like SQL, at at least three different companies. We’ve mostly got code deployments solved for, but when it comes to databases we basically rely on desperate hope, with a roll back plan of keeping our resumes up to date. It doesn’t have to be that way. Meet Liquibase. It is both an open source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails to ensure you’ll still have a company left after you deploy the change. No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at liquibase.com. Offer does not apply to Route 53.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. I periodically make observations that monitoring cloud resources has changed somewhat since I first got started in the world of monitoring. My experience goes back to the original Call of Duty. That’s right: Nagios.

When you set instances up, it would theoretically tell you when they were unreachable or certain thresholds didn’t work. It was janky but it kind of worked, and that was sort of the best we have. The world has progressed as cloud has become more complicated, as technologies have become more sophisticated, and here today to talk about this is the first AWS Hero from Africa and author of a brand new book, Ewere Diagboya. Thank you for joining me.

Ewere: Thanks for the opportunity.

Corey: So, you recently published a book on CloudWatch. To my understanding, it is the first such book that goes in-depth with not just how to wind up using it, but how to contextualize it as well. How did it come to be, I guess is my first question?

Ewere: Yes, thanks a lot, Corey. The name of the book is Infrastructure Monitoring with Amazon CloudWatch, and the book came to be from the concept of looking at the ecosystem of AWS cloud computing and we saw that a lot of the things around cloud—I mostly talked about—most of this is [unintelligible 00:01:49] compute part of AWS, which is EC2, the containers, and all that, you find books on all those topics. They are all proliferated all over the internet, you know, and videos and all that.

But there is a core behind each of these services that no one actually talks about and amplifies, which is the monitoring part, which helps you to understand what is going on with the system. I mean, knowing what is going on with the system helps you to understand failures, helps you to predict issues, helps you to also envisage when a failure is going to happen so that you can remedy it and also [unintelligible 00:02:19], and in some cases, even give you a historical view of the system to help you understand how a system has behaved over a period of time.

Corey: One of the articles that I put out that first really put me on AWS’s radar, for better or worse, was something that I was commissioned to write for Linux Journal, back when that was a print publication. And I accidentally wound up getting the cover of it with my article, “CloudWatch is of the devil, but I must use it.” And it was a painful problem that people generally found resonated with them because no one felt they really understood CloudWatch; it was incredibly expensive; it didn’t really seem like it was at all intuitive, or that there was any good way to opt out of it, it was just simply there, and if you were going to be monitoring your system in a cloud environment—which of course you should be—it was just sort of the cost of doing business that you then have to pay for a third-party tool to wind up using the CloudWatch metrics that it was gathering, and it was just expensive and unpleasant all around. Now, a lot of the criticisms I put about CloudWatch’s limitations in those days, about four years ago, have largely been resolved or at least mitigated in different ways. But is CloudWatch still crappy, I guess, is my question?

Ewere: Um, yeah. So, at the moment, I think, like you said, CloudWatch has really evolved over time. I personally also had that issue with CloudWatch when I started using CloudWatch; I had the challenge of usability, I had the challenge of proper integration, and I will talk about my first experience with CloudWatch here. So, when I started my infrastructure work, one of the things I was doing a lot was EC2, basically. I mean, everyone always starts with EC2 at the first time.

And then we had a downtime. And then my CTO says, “Okay, [Ewere 00:04:00], check what’s going on.” And I’m like, “How do I check?” [laugh]. I mean, I had no idea of what to do.

And he says, “Okay, there’s a tool called CloudWatch. You should be able to monitor.” And I’m like, “Okay.” I dive into CloudWatch, and boom, I’m confused again. And you look at the console, you see, it shows you certain metrics, and yet [people 00:04:18] don’t understand what CPU metric talks about, what does network bandwidth talks about?

And here I am trying to dig, and dig, and dig deeper, and I still don’t get [laugh] a sense of what is actually going on. But what I needed to find out was, I mean, what was wrong with the memory of the system, so I delved into trying to install the CloudWatch agent, get metrics and all that. But the truth of the matter was that I couldn’t really solve my problem very well, but I had [unintelligible 00:04:43] of knowing that I don’t have memory out of the box; it’s something that has to set up differently. And trust me, after then I didn’t touch CloudWatch [laugh] again. Because, like you said, it was a problem, it was a bit difficult to work with.

But fast forward a couple of years later, I could actually see someone use CloudWatch for a lot of beautiful stuff, you know? It creates beautiful dashboards, creates some very well-aggregated metrics. And also with the aggregated alarms that CloudWatch comes with, [unintelligible 00:05:12] easy for you to avoid what to call incident fatigue. And then also, the dashboards. I mean, there are so many dashboards that simplified to work with, and it makes it easy and straightforward to configure.

So, the bootstrapping and the changes and the improvements on CloudWatch over time has made CloudWatch a go-to tool, and most especially the integration with containers and Kubernetes. I mean, CloudWatch is one of the easiest tools to integrate with EKS, Kubernetes, or other container services that run in AWS; it’s just, more or less, one or two lines of setup, and here you go with a lot of beautiful, interesting, and insightful metrics that you will not get out of the box, and if you look at other monitoring tools, it takes a lot of time for you to set up, for you to configure, for you to consistently maintain and to give you those consistent metrics you need to know what’s going on with your system from time to time.

Corey: The problem I always ran into was that the traditional tools that I was used to using in data centers worked pretty well because you didn’t have a whole lot of variability on an hour-to-hour basis. Sure, when you installed new servers or brought up new virtual machines, you had to update the monitoring system. But then you started getting into this world of ephemerality with auto-scaling originally, and later containers, and—God help us all—Lambda now, where it becomes this very strange back-and-forth story of, you need to be able to build something that, I guess, is responsive to that. And there’s no good way to get access to some of the things that CloudWatch provides, just because we didn’t have access into AWS’s systems the way that they do. The inverse, though, is that they don’t have access into things running inside of the hypervisor; a classic example has always been memory: memory usage is an example of something that hasn’t been able to be displayed traditionally without installing some sort of agent inside of it. Is that still the case? Are there better ways of addressing those things now?

Ewere: So, that’s still the case, I mean, for EC2 instances. So before, now, we had an agent called a CloudWatch agent. Now, there’s a new agent called Unified Cloudwatch Agent which is, I mean, a top-notch from CloudWatch agent. So, at the moment, basically, that’s what happens on the EC2 layer. But the good thing is when you’re working with containers, or more or less Kubernetes kind of applications or systems, everything comes out of the box.

So, with containers, we’re talking about a [laugh] lot of moving parts. The container themselves with their own CPU, memory, disk, all the metrics, and then the nodes—or the EC2 instance of the virtual machines running behind them—also having their own unique metrics. So, within the container world, these things are just a click of a button. Everything happens at the same time as a single entity, but within the EC2 instance and ecosystem, you still find this there, although the setup process has been a bit easier and much faster. But in the container world, that problem has totally been eliminated.

Corey: When you take a look at someone who’s just starting to get a glimmer of awareness around what CloudWatch is and how to contextualize it, what are the most common mistakes people make early on?

Ewere: I also talked about this in my book, and one of the mistakes people make in terms of CloudWatch, and monitoring in generalities: “What am I trying to figure out?” [laugh]. If you don’t have that answer clearly stated, you’re going to run into a lot of problems. You need to answer that question of, “What am I trying to figure out?” I mean, monitoring is so broad, monitoring is so large that if you do not have the answer to that question, you’re going to get yourself into a lot of trouble, you’re going to get yourself into a lot of confusion, and like I said, if you don’t understand what you’re trying to figure out in the first place, then you’re going to get a lot of data, you’re going to get a lot of information, and that can get you confused.

And I also talked about what I call alarm fatigues or incident fatigues. This happens when you configure so many alarms, so many metrics, and you’re getting a lot of alarms hitting and notification services—whether it’s Slack, whether it’s an email—and it causes fatigue. What happens here is the person who should know what is going on with the system gets a ton of messages and in that scenario can miss something very important because there’s so many messages coming in, so many integrations coming in. So, you should be able to optimize appropriately, to be able to, like you said, conceptualize what you’re trying to figure out, what problems are you trying to solve? Most times you really don’t figure this out for a start, but there are certain bare minimums you need to know about, and that’s part of what I talked about in the book.

One of the things that I highlighted in the book when I talked about monitoring of different layers is, when you’re talking about monitoring of infrastructure, say compute services, such as virtual machines, or EC2 instances, the certain baseline and metrics you need to take note of that are core to the reliability, the scalability, and the efficiency of your system. And if you focus on these things, you can have a baseline starting point before you start going deeper into things like observability and knowing what’s going on entirely with your system. So, baseline understanding of—baseline metrics, and baseline of what you need to check in terms of different kinds of services you’re trying to monitor is your starting point. And the mistake people make is that they don’t have a baseline. So, we do not have a baseline; they just install a monitoring tool, configure a CloudWatch, and they don’t know the problem they’re trying to solve [laugh] and that can lead to a lot of confusion.

Corey: So, what inspired you from, I guess, kicking the tires on CloudWatch—the way that we all do—and being frustrated and confused by it, all the way to the other side of writing a book on it? What was it that got you to that point? Were you an expert on CloudWatch before you started writing the book, or was it, “Well, by the time this book is done, I will certainly know [laugh] more about the service than I did when I started.”

Ewere: Yeah, I think it’s a double-edged sword. [laugh]. So, it’s a combination of the things you just said. So, first of all, I have experienced with other monitoring tools; I have love for reliability and scalability of a system. I started Kubernetes at some of the early times Kubernetes came out, when it was very difficult to deploy, when it was very difficult to set up.

Because I’m looking at how I can make systems a little bit more efficient, a little bit more reliable than having to handle a lot of things like auto-scaling, having to go through the process of understanding how to scale. I mean, that’s a school of its own that you need to prepare yourself for. So, first of all, I have a love for making sure systems are reliable and efficient, and second of all, I also want to make sure that I know what is going on with my system per time, as much as possible. The level of visibility of a system gives you the level of control and understanding
of what your system is doing per time. So, those two things are very core to me.

And then thirdly, I had a plan of a streak of books I want to write based on AWS, and just like monitoring is something that is just new. I mean, if you go to the package website, this is the first book on infrastructure monitoring AWS with CloudWatch; it’s not a very common topic to talk about. And I have other topics in my head, and I really want to talk about things like networking, and other topics that you really need to go deep inside to be able to appreciate the value of what you see in there with all those scenarios because in this book, every chapter, I created a scenario of what a real-life monitoring system or what you need to do looks like. So, being that I have those premonitions, I know that whenever it came to, you know, to share with the world what I know in monitoring, what I’ve learned in monitoring, I took a [unintelligible 00:12:26]. And then secondly, as this opportunity for me to start telling the world about the things I learned, and then I also learned while writing the book because there are certain topics in the book that I’m not so much of an expert in things, like big data and all that.

I had to also learn; I had to take some time to do more research, to do more understanding. So, I use CloudWatch, okay? I’m kind of good in CloudWatch, and also, I also had to do more learning to be able to disseminate this information. And also, hopefully, X-Ray some parts of monitoring and different services that people do not really pay so much attention into.

Corey: What do you find that is still the most, I guess, confusing to you as you take a look across the ecosystem of the entire CloudWatch space? I mean, every time I play with it, I take a look, and I get lost in, “Oh, they have contributor analyses, and logs, and metrics.” And it’s confusing, and every time I wind up, I guess, spiraling out of control. What do you find that, after all of this, is a lot easier for you, and what do you find that’s a lot more understandable?

Ewere: I’m still going to go back to the containers part. I’m sorry, I’m in love containers. [laugh].

Corey: No, no, it’s fair. Containers are very popular. Everyone loves them. I’m just basically anti-container based upon no better reason than I’m just stubborn and bloody-minded most of the time.

Ewere: [laugh]. So, pretty much like I said, I kind of had experience with other monitoring tools. Trust me, if you want to configure proper container monitoring for other tools, trust me, it’s going to take you at least a week or two to get it properly, from the dashboards, to the login configurations, to the piping of the data to the proper storage engine. These are things I talked about in the book because I took monitoring from the ground up. I mean, if you’ve never done monitoring before, when you take my book, you will understand the basic principles of monitoring.

And [funny 00:14:15], you know, monitoring has some big data process, like an ETL process: extraction, transformation, and writing of data into an analytic system. So, first of all, you have to battle that. You have to talk about the availability of your storage engine. What are you using? An Elasticsearch? Are you using an InfluxDB? Where do you want to store your data? And then you have to answer the question of how do I visualize the data? What method do I realize this data? What kind of dashboards do I want to use? What methods of representation do I need to represent this data so that it makes sense to whoever I’m sharing this data with. Because in monitoring, you definitely have to share data with either yourself or with someone else, so the way you present the data needs to make sense. I’ve seen graphs that do not make sense. So, it requires some level of skill. Like I said, I’ve [unintelligible 00:15:01] where I spent a week or two having to set up dashboards. And then after setting up the dashboard, someone was like, “I don’t understand, and we just need, like, two.” And I’m like, “Really?” [laugh]. You know? Because you spend so much time. And secondly, you discover that repeatability of that process is a problem. Because some of these tools are click and drag; some of them don’t have JSON configuration. Some do, some don’t. So, you discover that scalability of this kind of system becomes a problem. You can’t repeat the dashboards: if you make a change to the system, you need to go back to your dashboard, you need to make some changes, you need to update your login, too, you need to make some changes across the layer. So, all these things is a lot of overhead [laugh] that you can cut off when you use things like Container Insights in CloudWatch—which is a feature of CloudWatch. So, for me, that’s a part that you can really, really suck out so much juice from in a very short time, quickly and very efficiently. On the flip side, when you talk about monitoring for big data services, and monitoring for a little bit of serverless, there might be a little steepness in the flow of the learning curve there because if you do not have a good foundation in serverless, when you get into [laugh] Lambda Insights in CloudWatch, trust me, you’re going to be put off by that; you’re going to get a little bit confused. And then there’s also multifunction insights at the moment. So, you need to have some very good, solid foundation in some of those topics before you can get in there and understand some of the data and the metrics that CloudWatch is presenting to you. And then lastly, things like big data, too, there are things that monitoring is still being properly fleshed out. Which I think that in the coming months and years to come, they will become more proper and they will become more presentable than they are at the moment.

Corey: This episode is sponsored by our friends at Oracle HeatWave is a new high-performance accelerator for the Oracle MySQL Database Service. Although I insist on calling it “my squirrel.” While MySQL has long been the worlds most popular open source database, shifting from transacting to analytics required way too much overhead and, ya know, work. With HeatWave you can run your OLTP and OLAP, don’t ask me to ever say those acronyms again, workloads directly from your MySQL database and eliminate the time consuming data movement and integration work, while also performing 1100X faster than Amazon Aurora, and 2.5X faster than Amazon Redshift, at a third of the cost. My thanks again to Oracle Cloud for sponsoring this ridiculous nonsense.

Corey: The problem I’ve always had with dashboards is it seems like managers always want them—“More dashboards, more dashboards”—then you check the usage statistics of who’s actually been viewing the dashboards and the answer is, no one since you demoed it to the execs eight months ago. But they always claim to want more. How do you square that?I guess, slicing between what people asked for and what they actually use.

Ewere: [laugh]. So yeah, one of the interesting things about dashboards in terms of most especially infrastructure monitoring, is the dashboards people really want is a revenue dashboards. Trust me, that’s what they want to see; they want to see the money going up, up, up, [laugh] you know? So, when it comes to—

Corey: Oh, yes. Up and to the right, then everyone’s happy. But CloudWatch tends to give you just very, very granular, low-level metrics of thing—it’s hard to turn that into something executives care about.

Ewere: Yeah, what people really care about. But my own take on that is, the dashboards are actually for you and your team to watch, to know what’s going on from time to time. But what is key is setting up events across very specific and sensitive data. For example, when any kind of sensitive data is flowing across your system and you need to check that out, then you tie a metric to that, and in turn alarm to it. That is actually the most important thing for anybody.

I mean, for the dashboards, it’s just for you and your team, like I said, for your personal consumption. “Oh, I can see all the RDS connections are getting too high, we need to upgrade.” Oh, we can see that all, the memory, there was a memory spike in the last two hours. I know that’s for you and your team to consume; not for the executive team. But what is really good is being able to do things like aggregate data that you can share.

I think that is what the executive team would love to see. When you go back to the core principles of DevOps in terms of the DevOps Handbook, you see things like a mean time to recover, and change failure rate, and all that. The most interesting thing is that all these metrics can be measured only by monitoring. You cannot change failure rates if you don’t have a monitoring system that tells you when there was a failure. You cannot know your release frequency when you don’t have a metric that measures number of deployments you have and is audited in a particular metric or a particular aggregator system.

So, we discovered that the four major things you measure in DevOps are all tied back to monitoring and metrics, at minimum, to understand your system from time to time. So, what the executive team actually needs is to get a summary of what’s going on. And one of the things I usually do for almost any company I work for is to share some kind of uptime system with them. And that’s where CloudWatch Synthetics Canary come in. So, Synthetic Canary is a service that helps you calculate that helps you check for uptime of the system.

So, it’s a very simple service. It does a ping, but it is so efficient, and it is so powerful. How is it powerful? It does a ping to a system and it gets a feedback. Now, if the status code of your service, it’s not 200 or not 300, it considers it downtime.

Now, when you aggregate this data within a period of time, say a month or two, you can actually use that data to calculate the uptime of your system. And that uptime [unintelligible 00:19:50] is something you can actually share to your customers and say, “Okay, we have an SLA of 99.9%. We have an SLA of 99.8%.” That data should not be doctored data; it should not be a data you just cook out of your head; it should be based on your system that you have used, worked with, monitored over a period of time so that the information you share with your customers are genuine, they are truthful, and they are something that they can also see for themselves.

Hence companies are using [unintelligible 00:20:19] like status page to know what’s going on from time to time whenever there is an incident and report back to their customers. So, these are things that executives will be more interested in than just dashboards, [laugh] dashboards, and more dashboards. So, it’s more or less not about what they really ask for, but what you know and what you believe you are going to draw value from. I mean, an executive in a meeting with a client and says, “Hey, we got a system that has 99.9% uptime.”

He opens the dashboard or he opens the uptime system and say, “You see our uptime? For the past three months, this has been our metric.” Boom. [snaps fingers]. That’s it. That’s value, instantly. I’m not showing [laugh] the clients and point of graphs, you know? “Can you explain the memory metric?” That’s not going to pass the message, send the message forward.

Corey: Since your book came out, I believe, if not, certainly by the time it was finished being written and it was in review phase, they came out with Managed Prometheus and Managed Grafana. It looks almost like they’re almost trying to do a completely separate standalone monitoring stack of AWS tooling. Is that a misunderstanding of what the tools look like, or is there something to that?

Ewere: Yeah. So, I mean by the time those announced at re:Invent, I’m like, “Oh, snap.” I almost told my publisher, “You know what? We need to add three more chapters.” [laugh]. But unfortunately, we’re still in review, in preview.

I mean, as a Hero, I kind of have some privilege to be able to—a request for that, but I’m like, okay, I think it’s going to change the narrative of what the book is talking about. I think I’m going to pause on that and make sure this finishes with the [unintelligible 00:21:52], and then maybe a second edition, I can always attach that. But hey, I think there’s trying to be a galvanization between Prometheus, Grafana, and what CloudWatch stands for. Because at the moment, I think it’s currently on pre-release, it’s not fully GA at the moment, so you can actually use it. So, if you go to Container Insights, you can see that you can still get how Prometheus and Grafana is presenting the data.

So, it’s more or less a different view of what you’re trying to see. It’s trying to give you another perspective of how your data is presented. So, you’re going to have CloudWatch: it’s going to have CloudWatch dashboards, it’s going to have CloudWatch metrics, but hey, this different tools, Prometheus, Grafana, and all that, they all have their unique ways of presenting the data. And part of the reason I believe AWS has Prometheus and Grafana there is, I mean, Prometheus is a huge cloud-native open-source monitoring, presentation, analytics tool; it packs a lot of heat, and a lot of people are so used to it. Everybody like, “Why can’t I have Prometheus in CloudWatch?”

I mean—so instead of CloudWatch just being a simple monitoring tool, [unintelligible 00:22:54] CloudWatch has become an ecosystem of monitoring tool. So, we got—we’re not going to see cloud [unintelligible 00:23:00], or just [unintelligible 00:23:00] log, analytics, metrics, dashboards, no. We’re going to see it as an ecosystem where we can plug in other services, and then integrate and work together to give us better performance options, and also different perspectives to the data that is being collected.

Corey: What do you think is next, as you take a look across the ecosystem, as far as how people are thinking about monitoring and observability in a cloud context? What are they missing? Where’s the next evolution lead?

Ewere: Yeah, I think the biggest problem with monitoring, which is part of the introduction part of the book, where I talked about the basic types of monitoring—which is proactive and reactive monitoring—is how do we make sure we know before things happen? [laugh]. And one of the things that can help with that is machine learning. There is a small ecosystem that is not so popular at the moment, which talks about how we can do a lot of machine learning in DevOps monitoring observability. And that means looking at historic data and being able to predict on the basic level.

Looking at history, [then are 00:24:06] being able to predict. At the moment, there are very few tools that have models running at the back of the data being collected for monitoring and metrics, which could actually revolutionize monitoring and observability as we see it right now. I mean, even the topic of observability is still new at the moment. It’s still very integrated. Observability just came into Cloud, I think, like, two years ago, so it’s still being matured.

But one thing that has been missing is seeing the value AI can bring into monitoring. I mean, this much [unintelligible 00:24:40] practically tell us, “Hey, by 9 p.m. I’m going to go down. I think your CPU or memory is going down. I think I’m line 14 of your code [laugh] is a problem causing the bug. Please, you need to fix it by 2 p.m. so that by 6 p.m., things can run perfectly.” That is going to revolutionize monitoring. That’s going to revolutionize observability and bring a whole new level to how we understand and monitor the systems.

Corey: I hope you’re right. If you take a look right now, I guess, the schism between monitoring and observability—which I consider to be hipster monitoring, but they get mad when I say that—is there a difference? Is it just new phrasing to describe the same concepts, or is there something really new here?

Ewere: In my book, I said, monitoring is looking at it from the outside in, observability is looking at it from the inside out. So, what monitoring does not see under, basically, observability sees. So, they are children of the same mom. That’s how I put it. One actually needs the other and both of them cannot be separated from each other.

What we’ve been working with is just understanding the system from the surface. When there’s an issue, we go to the aggregated results that come out of the issue. Very basic example: you’re in a Java application, and we all know Java is very memory intensive, on the very basic layer. And there’s a memory issue. Most times, infrastructure is the first hit with the resultant of that.

But the problem is not the infrastructure, it’s maybe the code. Maybe garbage collection was not well managed; maybe they have a lot of variables in the code that is not used, and they’re just filling up unnecessary memory locations; maybe there’s a loop that’s not properly managed and properly optimized; maybe there’s a resource on objects that has been initialized that has not been closed, which will cause a heap in the memory. So, those are the things observability can help you track. Those are the things that we can help you see. Because observability runs from within the system and send metrics out, while basic monitoring is about understanding what is going on on the surface of the system: memory, CPU, pushing out logs to know what’s going on and all that.

So, on the basic level, observability helps gives you, kind of, a deeper insight into what monitoring is actually telling you. It’s just like the result of what happened. I mean, we are told that the symptoms of COVID is coughing, sneezing, and all that. That’s monitoring. [laugh].

But before we know that you actually have COVID, we need to go for a test, and that’s observability. Telling us what is causing the sneezing, what is causing the coughing, what is causing the nausea, all the symptoms that come out of what monitoring is saying. Monitoring is saying, “You have a cough, you have a runny nose, you’re sneezing.” That is monitoring. Observability says, “There is a COVID virus in the bloodstream. We need to fix it.” So, that’s how both of them act.

Corey: I think that is probably the most concise and clear definition I’ve ever gotten on the topic. If people want to learn more about what you’re up to, how you view about these things—and of course, if they want to buy your book, we will include a link to that in the [show notes 00:27:40]—where can they find you?

Ewere: I’m on LinkedIn; I’m very active on LinkedIn, and I also shared the LinkedIn link. I’m very active on Twitter, too. I tweet once in a while, but definitely, when you send me a message on Twitter, I’m also going to be very active.

I also write blogs on Medium, I write a couple of blogs on Medium, and that was part of why AWS recognized me as a Hero because I talk a lot about different services, I help with comparing services for you so you can choose better. I also talk about setting basic concepts, too; if you just want to get your foot wet into some stuff and you need something very summarized, not AWS documentation per se, something that you can just look at and know what you need to do with the service, I talk about them also in my blogs. So yeah, those are the two basic places I’m in: LinkedIn and Twitter.

Corey: And we will, of course, put links to that in the [show notes 00:28:27]. Thank you so much for taking the time to speak with me. I appreciate it.

Ewere: Thanks a lot.

Corey: Ewere Diagboya, head of cloud at My Cloud Series. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice along with a comment telling me how many more dashboards you would like me to build that you will never look at.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

Keeping the Cloudwatch with Ewere Diagboya

Episode Summary

Episode Show Notes & Transcript

You might also like

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

The Latest State of IaC with Ido Neeman

Get the Newsletter

Sponsor an Episode