Join our Blackbird API Development Webinar on September 19th — Register now for an early sneak peek! Register Now

Back to blog

Beyond Monitoring: Top Observability Best Practices to Reduce Costs

Jake Beck
July 11, 2024 | 7 min read


“Every single company all around the world says one thing. We are making data-driven decisions. And this is such a common theme that we have forgotten what it actually means."
Vasil Kaftandzhiev, Product Manager at Grafana.

On the latest Livin’ on the Edge podcast (and my last episode for Season 3), I interviewed Vasil Kaftandzhiev, a fellow Product Manager at Grafana. We explored the importance of observability in IT systems, particularly in the context of cloud infrastructure and Kubernetes management.

Observability is the constant monitoring of the system and of business KPIs with the goal of understanding why something is happening. It goes beyond the here-and-now of just monitoring (which itself is key to observability) and extends to the analysis and understanding of broader problems or issues, the underlying system and root causes.

Vasil’s team at Grafana has been focused on building opinionated observability solutions that are based on technology best practices and observability best practices. He shared a few of those best observations with us below. We also dove into the actual difference between plain monitoring and true observability, the role of AI and ML within observability, and, of course, the importance of resource utilization and cost management in Kubernetes.

Monitoring Doesn’t Equal Observability

“If you understand what you're monitoring, that’s great. But having a solid observability solution is still going to buy you that time to start understanding, to start reading, and to start making the right decisions quicker. And it’s worth noting that observability cannot make decisions for you, it can just suggest them,” shares Vasil.

There's a difference between plain monitoring and true observability. While monitoring gives an initial hunch of where things stand and signals an issue, true observability involves understanding and predicting trends and patterns from that monitored data.

One thing I hate is when everyone assumes that observability is synonymous with monitoring, because it’s really about adding that data-driven decision-making on top of the monitoring you’re conducting. Vasil emphasized this idea as well, differentiating between mere monitoring of systems, which often leads to gut feelings, and true observability, which is based on thresholds, alerts, machine learning predictions, and other data-driven factors.

“Rarely data-driven decisions are actually data-driven. Because sometimes ‘observability’ is conducted from the back of your head, rather than done on a sheet of paper or done without a sufficient amount of understanding…that’s just plain monitoring," explained Vasil. He further added, "True data-driven decisions are observability. Those decisions are your SLOs, thresholds, breaches of thresholds, alerting, machine learning, predictions, etc.”

You need to truly understand the technology being observed alongside following best observability practices to see true success. Vasil mentioned that blending these two aspects together results in an effective observability solution rather than just gazing at colorful graphs on a screen.

“True observability is usually based on best practices, both on the observability side and on the monitoring side, which represents the signal you’re basing your observations on,” shares Vasil.

Implementing Proper Observability Helps Cut Costs

“Starting from there, if we think about best practices to managing Kubernetes costs, we will usually fall into the idea that plain dashboards will make it, and that this is the only thing that we need, and this is not the case,” says Vasil.

Vasil discussed how implementing proper observability can help cut and contain Kubernetes costs. Effective cost management requires continuous monitoring of resource usage. Tools that provide visibility into CPU and memory usage, as well as application-specific metrics, are essential. Automated reporting and alerts can help your team react quickly to cost anomalies and see a greater ROI over time. Here’s how else observability can lead to cost savings:

  1. Right-Sizing Resources: Overprovisioning is common in Kubernetes environments. Resources, such as memory and storage, are often shared between teams using the same cluster. But without visibility into everyone’s environments, you’re at risk of over-allocating your resources and overspending on your cloud bills. Regularly profiling applications to understand their actual resource needs helps in right-sizing resources and reducing unnecessary costs. Observe frequently and you’ll right size just as frequently.

"up to 30% of all resources related to Kubernetes go to waste. And this is year after year. These resources not only mean funds in your pocket, they mean CO2, they mean coal, they mean Linux service standing there and buzzing their fans off."
Vasil notes

  1. Autoscaling: Implementing autoscaling mechanisms such as Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and a Cluster Autoscaler can dynamically adjust resources based on workload demands. Resource quotas, limits, and requests are great first steps in containing Kubernetes costs, but they fall short of addressing the dynamic nature of resource demands. That’s why you also need to incorporate horizontal pod autoscaling (HPA) and vertical pod autoscaling (VPA) into your strategy. These tools help in maintaining efficiency by scaling up during peak times and scaling down when demand is low. For more on mastering horizontal scaling, check out our blog.
  2. Cluster and Workload Optimization: Downsizing clusters by eliminating underutilized resources and optimizing node types based on application requirements can significantly cut costs. Using tools for visualizing resource utilization and integrating monitoring solutions can aid in identifying idle resources that can be scaled down or right sized.

And what happens if you see a cost spike? Vasil suggested treating cost spikes as incidents and using machine learning and AI for automation and forecasting.

AI’s Role in Observability Moving Forward

Speaking of AI–It’s worth noting that AI cannot replace the need for understanding how technology is being monitored, but you should still train yourself on AI best practices and how that data is fed to an AI system like ChatGPT to get better observability results. Use it as a tool to enhance the power of your observability solution, but don’t allow it to replace your efforts solely.

If you’re looking for more tips on how to master observability, check this out. I appreciate Vasil coming on the show.

Edge Stack API Gateway

Enhance Your API Gateway with Advanced Observability