Back to Podcasts
LIVIN' ON THE EDGE PODCAST
Ep. 16

S3 Ep16: Cutting Costs with Observability - Beyond Monitoring Best Practices

About

Discover top observability best practices to cut costs and enhance performance. Go beyond monitoring to optimize your cloud-native applications.

Episode Guests

Vasil Kaftandzhiev
Product Manager at Grafana
Over 11 years of combined experience in product management and IT operations. Expert in customer centric strategies, product development & agile methodologies. Laser focused in developing products that "customers will love - and that will work for the business". Passionate about innovation, entrepreneurship, K8s & AI. Influenced by leaders like Marty Cagan, Guy Kawasaki, Seth Godin & Geoffrey Moore.

On the latest Livin’ on the Edge podcast we interviewed Vasil Kaftandzhiev, a Product Manager at Grafana. We explored the importance of observability in IT systems, particularly in the context of cloud infrastructure and Kubernetes management.

Observability is the constant monitoring of the system and of business KPIs with the goal of understanding why something is happening. It goes beyond the here-and-now of just monitoring (which itself is key to observability) and extends to the analysis and understanding of broader problems or issues, the underlying system and root causes.

Vasil’s team at Grafana has been focused on building opinionated observability solutions that are based on technology best practices and observability best practices. He shared a few of those best observations with us below. We also dove into the actual difference between plain monitoring and true observability, the role of AI and ML within observability, and, of course, the importance of resource utilization and cost management in Kubernetes.

Transcript

00:02.04
Jake Beck
Welcome back to another episode of the Living on the Edge podcast. I am here with Vasil from Grafana. ah If you wanna go ahead and tell us a little bit about yourself, that would be great.

00:12.78
Vasil Kaftandzhiev
Jake, thank you very much for having me. I'm Fassil Kavtanjeev, and I'm a staff product manager in Grafana Labs. I'm based out of Sofia Bulgaria, spent the last 10 or so years developing solutions for IT, for endpoint and cloud management, sir and finally landed here ah doing cloud infrastructure observability. I like to joke that My background is from the barest metal straight to the cloud.

00:48.34
Jake Beck
That's awesome. so So what brought you to like the observability land, right? im If you start, like bare metal like observability is one of the most important things into all of our systems that often gets neglected, I feel. So I'm curious what led you to wanting to work with Grafana and work on them so the like one of the like most commonly used observability platforms.

01:15.22
Vasil Kaftandzhiev
I can, I can tell you that. So every single company all around the world says one thing. We are doing data-driven decisions. And this is such a common theme that we have forgotten what it actually means. And, uh, rarely data-driven decisions are actually data-driven. So this is because observability done ah from from the back of your head, that done on on a sheet of paper, or just ah done without sufficient amount of understanding or experience is plain monitoring. And plain monitoring ah does not drive you to data-driven decisions. It actually gives you nice initial hunch of where you want to be, but the hunch just transforms into a gut feeling and you go with your gut feeling. Where data-driven decisions are observability. Data-driven decisions are SLOs,

02:14.57
Vasil Kaftandzhiev
Data-driven decisions are thresholds, bridgeches breaches of thresholds, alerting, ah machine learning, predictions, and all of these kinds of stuff. So I actually started with with developing solutions for endpoints. where managing endpoints it can be rather tricky if you don't have those thresholds and where you don't have an actual baseline which you can refer to. And yeah, um having to do quick data-driven decisions to avoid or react to an incident that happens is something that brought me to observability because it made me think that the

03:03.48
Vasil Kaftandzhiev
The better you do observability, the less late night evenings you will do on call. There you go.

03:11.25
Jake Beck
Which I think we all love, right? Is no one wants that 1 a.m., 2 a.m. page saying the the system's down or not not scaling properly or whatever. What I find interesting, I guess I hadn't thought too much about until you kind of brought it up, is differentiating between like monitoring and observability. And i I'm curious to like what What helps you like since since you did like describe like two different like like monitoring is just like you have dashboards or whatever and you're going in and like manually looking at them and like not actually like you're using that data to like go look at hunches right but it's not like the SLOs or the actually data driven what helps you get down the path of moving from I'm just monitoring my applications to having like true observability.

03:54.91
Vasil Kaftandzhiev
True observability is usually based on best practices, both on the

03:59.85
Vasil Kaftandzhiev
groups of true observability is usually based on best practices both on the observability side on the monitoring side, representing the signal from, ah bubbling up the signal from the noise, but also understanding the technology that you're trying to observe, regardless of what the technology is.

04:13.34
Jake Beck
Hey.

04:17.56
Vasil Kaftandzhiev
so Blending best practices from managing, um I'll take Kubernetes, for an example, as as a reference. Making observability products for Kubernetes means that you need to understand both the technology that lies behind Kubernetes and understanding observability best practices or monitoring best practices. Blending them together is ah providing you with a natural observability solution rather than just glazing at your screen and seeing nice shiny colorful graphs going up to the right or doing something else.

04:53.45
Vasil Kaftandzhiev
This is the easiest way that I can i can explain it. the Yeah.

04:59.54
Jake Beck
Yeah, no, I think that makes sense. um You brought up Kubernetes and I was actually reading through a couple of your articles about observing and monitoring like your Kubernetes and Kubernetes is growing so rapidly and being used by pretty much any of the top players in the tech world at this point. um the Your article about cutting costs within Kubernetes is I found really interesting because like implementing proper observability. Do you want to speak to that a little bit more of like how you actually can like use observability to help cut those costs that everyone in this current market is super worrisome about?

05:38.89
Vasil Kaftandzhiev
Absolutely. And here is where I'm going to tie to understanding Kubernetes technology best practices and managing your fleet, your infrastructure well enough so you can cut costs.

05:46.53
Jake Beck
ah

05:52.12
Vasil Kaftandzhiev
So let's start from let's start from the from the beginning. Now, if you want to manage costs in Kubernetes, you need to have good resource utilization. What does that mean is to be aiming towards some good key performance indicators or key indicators, key metrics that you want to monitor and to understand well. Now, starting with efficiency to decrease cost means that you need to be utilizing all of those resources that you're feeding towards your nodes and your fleet in general. So you can balance between over-provisioning resources that you later pay for daily with all of the cloud provider bills going

06:38.46
Vasil Kaftandzhiev
up into the right if we're talking in in graphical visualizations. So you need to have a nice utilization somewhere between 70 towards the 90th percentage of utilization of all of the resources that you're having. This includes CPU, RAM, GPU, and even network suggestion if you must. So how can you do that? And let's make it simple and talk mainly about CPU and RAM. they The first and most important thing that you need to do is have these blended into your observability solution one way or the other with requests and limits. This touches the idea of knowing the

07:25.26
Vasil Kaftandzhiev
ah the best practices towards managing your Kubernetes fleet. Now, starting from there, if we think about best practices to managing Kubernetes costs, we will usually fall into the idea that plain dashboards will make it, and that this is the only thing that we need, and this is not the case. Now, you'd first need to, of course, have something, and dashboards may be fundamental, but Next, you need to make sure that you have requests and limits set along your containers. So requests and limits generally represent the guardrails of the resources and how they're utilizing your ah Kubernetes fleet. The lower the requests and limits are, the less resources you use,

08:18.29
Vasil Kaftandzhiev
The lower your cloud provider bill is because you pay essentially for, for resources and not for utilize resources.

08:24.99
Jake Beck
Mm hmm.

08:26.63
Vasil Kaftandzhiev
You give the, uh, you give the cloud provider exactly as much as you want to. And as much as you ask from them, it really doesn't matter how much you utilize from it. It doesn't matter how much you use from it. It stays there and you pay for it. That's it. So. ah starting starting Starting from there and thinking of the best practices that you can use to lower your cost. is A, have your request and limit set, B, aim for good resource utilization, and C, treat costs is and cost spikes as incidents. This is a nice thing that you can think about. Now, if you imagine a situation where a person gives gets this ring at 2 a.m. in the morning saying the cluster is down or this application in particular doesn't work,

09:24.82
Vasil Kaftandzhiev
We have always been in this situation. But starting now, or starting maybe two years ago, cost has been as vital as incidence to the feasibility of application of modern application development. I'm saying this. Let's imagine that you're working into ah early stage startup. Or let's imagine that you are an SRE into a newly formed ah cloud infrastructure administration unit into a big enterprise. What essentially happens is that if you are not utilizing your resources well enough, and if you ah if you do not utilize as the majority of actual enterprises do, 30% of all Kubernetes related resources go to waste. So if we if we imagine that you're doing more of that, you do 30 or 40 or 50 or see even 60% I've seen that.

10:20.48
Vasil Kaftandzhiev
you put in jeopardy the feasibility of the software that you're developing because the cloud cost but can get so quickly out of hand that the cost for maintaining an actual stable and reliable software can exceed what you can bear as cost in, for an example, an early startup, or it can decrease the ah ROI of your SRE BU so drastically that can start raising questions about whether or not we should actually even have such division in our enterprise. So, yeah.

10:59.91
Vasil Kaftandzhiev
this is These are some of the things that you really need to to consider first. And then ah yeah having in mind that this is so vital for the current ecosystem and for the current economical system, we need to treat ah cost management ah as incidents. And what do we do with incidents so we can treat them as such? We have alerting, for instance. And we have SLOs and error budgets, for instance. So we surely need to have alerts for cost, resource utilization, and SLOs for those two that I mentioned. What does that mean in practice? Imagine that you're sitting on your desk, it's Tuesday, it's 2 p.m. and you receive a ring saying, half of your workloads do not comply to the policies that we have in this startup.

11:52.52
Vasil Kaftandzhiev
And those deployments are exceeding the normal users that they had for the last 30 days by 100%. So that's a problem. You need to go there, see what's going on, who who has designed and deployed ah these these deployments and why they're costing your company so much. And on the other side of the coin, what happens if all of your nodes are running at 30% and you are essentially paying 70% more than you should have to to your cloud service provider for all of that and use Millicores and PRAM in your fleet.

12:35.59
Jake Beck
Yeah, no, I think lot it made a lot of great points there. um The thing about like requests and limits, I find so important. It's a couple of different places I've worked now. You'd look at any of the Kubernetes config and none of the deployments or anything have the requests or limits set on them. And it's just it's just a free-for-all at that point, right? So it's like you're also, if you're in a multi-tenant environment, you're a bad tenant. as well as like your applications can just run away with how much resources they can use, which in turn costs you a lot of money. So like it really lowers that value proposition, like you said. What I found the most like interesting was the incident spike on cost. It's not something I'd guess I'd necessarily thought of, but it makes a lot of sense, right? is like

13:26.89
Jake Beck
If costs start to if you for the most part right is like if an application runs in a stable state and everything your cost is for the most part pretty static. So if things start to like you said go up and to the right in like a graphical. place, something's wrong right and you should treat it as such that this is an incident we need to figure out what's happening here and if it's going back and re-architecting those or even adding those resort limits those resource limits or requests like that's often one of the solutions you should have had in the first place that can prevent those incidents in the future now.

13:50.97
Vasil Kaftandzhiev
Thank you.

14:02.65
Jake Beck
so I think that's really interesting and like even having with those post-mortems which are so important for incidents right to make sure that they don't happen again of how do we not cost us more money because i we at ambassador are a startup so like all of those dollars count at the end of the day and if we can prevent those from happening those spikes like that's huge for us.

14:27.18
Vasil Kaftandzhiev
Absolutely. And if you think about it, automation is key with the all of the auto scaling that has happened, both vertical, horizontal, et cetera. Things get get hairy quite quickly and it can be sometimes not inevitable, but it's really hard to turn around.

14:37.67
Jake Beck
Yeah.

14:46.33
Vasil Kaftandzhiev
the diving even deeper sometimes one month of data is absolutely insufficient so you can see those trendings because as you said ah costs are usually static and we can treat spikes as incidents but also we need to understand what happened what happens in the present but also in the future and in the past the hardest thing about Kubernetes it it is that it is really hard to manage that which you cannot see

14:46.53
Jake Beck
Yeah.

15:13.51
Jake Beck
Mhm.

15:13.43
Vasil Kaftandzhiev
all of the pods and all of the containers get constantly killed and recreated. And if you want to manage them, you definitely need to be able to see in the past and measure and manage all of those non-existent pods in containers ah so you can so you can forecast for the future.

15:26.91
Jake Beck
Mhm.

15:34.23
Vasil Kaftandzhiev
So automation is key. You need ML, you need forecasting, you need AI, you need ah even linear forecast would do, just so you can ah you can you can do all of these things. And you said, I think about requests and limits. So ah CNCF's report, the latest CNCF report says that, 37% of organizations have 50 or more percent workloads that need container rightsizing. This means that they either

16:12.45
Vasil Kaftandzhiev
throttle CPUs and kill performance because they want to save money, or they do the opposite and pay a lot of money to to the cloud service providers. Additionally, this year, 67% have more than 11% of workloads impacted by missing CPU requests and limits, which is actually better. In 2023, we had 78%, which is the majority. It's always the majority. of organizations not having requests and limits, so you're spot on. This is something that we need to think, and even if we're getting better, we need those observability these observability solutions just to show us. It will be so easy if when you sit down ah ah behind your your keyboard, you just see something that says, hey

17:03.09
Vasil Kaftandzhiev
set up these requests and limits, the approximate amount of CPU and RAM that you're using is that. And you might want to tailor it that way. Several clicks, voila, it's done. And even if you don't have to click, but it does it for you, that can be the future of Kubernetes observability. But we can talk to it to this later, of course.

17:25.55
Jake Beck
Yeah, for sure. Well, I think it's cascading too, right? Of like, I know my application as a, as a like software engineer or whatever. I know my application, we only get allocated whatever that allocation is. Like a lot of the times, like, especially with the way that like cloud has gone and everything, it's like. ah a lot of like lost art of resource management is kind of just forgotten. And this kind of brings that back in, right? Of like, I actually need to manage my application that it doesn't just run away or like the database connections. I don't just have unlimited database connections. Like I actually need to manage those resources. And I think Kubernetes helps bring a little bit of that back with when you start setting those like requests and limits, right? Of like, I only get eight CPU, right? Like i I don't just get to write code that just uses tons and tons of CPU.

18:13.02
Jake Beck
And it kind of brings that art back to the engineers of like, Hey, you need to write code that is efficient. And like for a while in the web, right?

18:20.73
Vasil Kaftandzhiev
That's absolutely.

18:22.15
Jake Beck
Like it kind of just was a forgotten thing, right? It was like, it's cheap.

18:27.13
Vasil Kaftandzhiev
That's definitely true, but at the end of the day, it should not be ah developers' responsibility, at least in my head, to manage infrastructure. And this is what they say.

18:36.94
Jake Beck
Correct.

18:37.27
Vasil Kaftandzhiev
ah ah At least 82% of the engineers in the ah in in another report that I read recently ah the What was it called? The spectral cloud reports of 2024, say 84% of the developers say, in my head, I should not be managing infrastructure. Every third developer is actually doing it. So this is, again, where we come to the point that we need more things things out of the box.

19:02.92
Jake Beck
and

19:08.12
Vasil Kaftandzhiev
We need things to help us cover the basics, which are not so basic anymore, because everything is going fast.

19:08.57
Jake Beck
and

19:14.99
Vasil Kaftandzhiev
Everything is, as you said, expensive. And I'm sure that if we cover the basics, it will be the 80-20 rule.

19:22.33
Jake Beck
Oh,

19:22.47
Vasil Kaftandzhiev
Yes, we'll have 20% of the cost that we would not be able to account for, but 80% of the cost can, ah yeah, go well managed, which is fantastic in my head.

19:32.05
Jake Beck
yeah. Yeah, no, I make perfect sense, right? Like ah exactly the 82. I don't think I've heard that for cost before, but like that makes sense, right? Like you should know, like there's always unexpected costs, right? Like you never, like there's always going to be something that happens, but that shouldn't be the majority of what you're paying for. Right? Like if that's the majority. Something's wrong. Like you're either not observing properly or you don't have a proper grasp on your applications of how they're running. Maybe you aren't running the proper perf tests in lower environments to understand like the resource allocation needed for them.

20:07.89
Jake Beck
But yeah, I think they're like 80 20, right? Like, like you said, like you can't eliminate everything, but if you're sitting at that 20%, you're doing a good job. I think that's really.

20:16.05
Vasil Kaftandzhiev
yeah And now we can imagine explaining this to a CFO. And then things start getting out of hand. We're talking ah to a really seasoned SRE. And he was like, it's I can do it. I can build everything almost out of ah but of i can do build almost everything related to observability myself. I can ah leverage some of your knowledge and we have a perfect solution. How can I explain this to to our CFO? So here we're talking of another perspective of cost management and cost monitoring.

20:57.70
Vasil Kaftandzhiev
ah we On one hand, and this has three hands, I guess we'll see how many hands we'll end up with, but on one hand, you need to solve for technical excellence. You need to know well ah all of the best practices related to the infrastructure. Second, you need to solve ah for ah Second, you need to solve for operational excellence when you when you treat the the cost issues like incidents. And finally, you need to solve for ah simplicity. So you'll be able to explain in an easier fashion what limits request spot containers, deployments, thresholds,

21:41.93
Vasil Kaftandzhiev
everything this Everything of ah this means to your CFO, you need to be able to demonstrate the ah ROI either of your BU or of you as the manager of this infrastructure towards those finance people in a way that they can easily consume, in a way that digits talks for themselves. And here, here ah when we're talking about observability solutions, a good observability solution in my head needs to do all of these three things. It needs to be a Shiva-like creature with at least three hands that serves all of these different people in their day. So when I'm a CFO, and when I sit down and look on the in this report,

22:26.20
Vasil Kaftandzhiev
I know where all of this department is going and how I can accommodate ah one, two, three, more maybe five new developers so we can develop actual software that it is meaningful, reliable, and it is it makes economic sense to the ah venture that I'm working in.

22:49.68
Jake Beck
Yeah, i'm gonna I'm gonna quote you on that now that anyone that's complaining that they're understaffed, all they have to do is properly implement observability and they'll get those extra developers they need.

23:02.28
Vasil Kaftandzhiev
If this might or might not be the case, if I was the ceo if i were the CEO of this company, maybe yes, maybe yes, why not?

23:10.55
Jake Beck
But it's something to take into account right is like if you can spend extra time doing these things because it's going to save you money maybe it will get you that extra developer right or it lets you market your product it gives marketing more budget so they can market and your your revenue goes up and you still get that extra developer right like Those cost savings go a long way. So like implementing proper channels to not even just like view them in live the reporting. like You've mentioned it a couple of times of like the long-term reporting aspect is super important, especially for people like CFOs and stuff like.

23:46.75
Jake Beck
One month bucket doesn't mean that much. It's the six month the year utilization of like, that's where you truly see like the trends and everything and you can understand like start to use forecasts, which I want to get back to as well you've mentioned like ML and AI and forecasting. So I'm i am curious about like, you can't get away from AI in 2024 right like it's, it's the topic of conversation for everybody. I am curious of like what you recommend or ah what's out there for like implementing AI into your ah or ML into your observability and like how you can leverage that to make better decisions and help it scale properly even.

24:28.83
Vasil Kaftandzhiev
I can give an and an unpopular answer to this.

24:32.51
Jake Beck
I love unpopular answers.

24:32.52
Vasil Kaftandzhiev
ah The major.

24:34.06
Jake Beck
so

24:33.95
Vasil Kaftandzhiev
Fantastic. Okay, so ah the majority or every single one of us is trying to implement within AI within our solutions. Observability solutions that we do day-to-day in any company that we work with, regardless if it is Grafana or any other company out there. So as you said, you cannot escape from AI. What I would recommend actually is the other thing. This is if you're designing an observability solution or if you are evaluating an observability solution to buy. So buy or build.

25:09.97
Vasil Kaftandzhiev
Create the GPT for yourself. Feed it with several books related to the technology. And the as the AI, what are the best things that I should be looking at? Because if you understand what you're monitoring, you'll be monitoring right. But even if you have the best observability solution and that is there, it's going to buy you that time. It's going to buy you that time to start understanding, to start reading, to start Making the right decisions, because observability cannot make decisions for you, it can just suggest it. So if you understand the technology that you're monitoring well enough, do that rather than ah ah the other way around. You're going to get AI out of the box in one, two, or three months, or even like now. So yeah.

25:53.99
Vasil Kaftandzhiev
stop

25:54.68
Jake Beck
interesting yeah like the whole training yourself is like there's obviously caveats to doing that but i think it it makes sense especially in like observability right like no one platform is the same right so if you have used even like gpt4 or something right like that's ingesting everyone's information So like you're going to get skewed data versus like if you feed one, like you said yourself with just your data, it's going to make much better decisions because it knows how it's it's trained on your platform, not everybody's platform.

26:12.83
Vasil Kaftandzhiev
Yep.

26:25.10
Jake Beck
So I think that's interesting. it's It's like it makes a lot of sense, though.

26:29.66
Vasil Kaftandzhiev
Great. So several if we're talking about the AI and ML, can we step back?

26:39.31
Jake Beck
Mm hmm.

26:39.38
Vasil Kaftandzhiev
Can we just step back because there is so much time and talk to the new person that it is starting. The person that waited so much for practical of guidance because they're new to monitoring Kubernetes and they want to know what are the common, let's say pitfalls or things that they should be wary about when they start

27:01.63
Jake Beck
Mmhmm.

27:05.56
Vasil Kaftandzhiev
So these are my several takes to you. And at any point that the chatter in your head starts to to tell you these, ah think of this podcast and solve for greatness rather than ah taking the ah procrastination path. So here they are. First is, I really don't need to monitor Kubernetes. If you hear this in your head, you're most probably wrong. You most probably will pay a lot, or most probably something will crash, and you end up on a call in 2 AM in the morning. Second, I only need dashboards to monitor Kubernetes. Kubernetes is faster than you. It's smarter than me, for sure. But even if it is not smarter than you, it's faster. so you need and So you need good timing and being alerted on what's happening in your own infrastructure.

27:58.43
Vasil Kaftandzhiev
ah I only need to see in the future, the past is not important. Please remember, you cannot manage this what you what you cannot see. In Kubernetes, what you see is all there is, is absolutely not applicable. You need to know what happened with all of these deleted resources. Fourth and last, I don't need resource and limit set. You need to set your resource and limits, not a like 67% of all of the organizations around the world.

28:36.09
Vasil Kaftandzhiev
I'm going to say it again, 30% of all resources related to Kubernetes go to waste. And this is year after year. These resources not only mean funds in your pocket, they mean CO2, they mean coal, they mean Linux service standing there and buzzing their fans off. And yeah, ah apart from money, this harms the planet. And if you do a good job at the end of the day, we'll have feasible software that will be profitable for your organization and will do no harm to the environment, which is everything that we all want, isn't it?

29:15.38
Jake Beck
Yeah, absolutely. And I think that's a great place to probably end this is those four pillars and then the the invite the the environment part. right Like you said, it's data centers aren't great for the environment. So the more you can do to help, the better. And that's resource utilization. right like Don't just have those servers running fans for no reason. So on that note, if you want to give a shout out of where Grafana is going to be, what you're going to be doing, um let us know.

29:43.83
Vasil Kaftandzhiev
Fantastic. So Grafana is taking on a journey to building opinionated observability solutions that are going that are based on technology best practices and observability best practices that we don't only know we build. So be with us, check us out. You can always visit play dot.grafana dot.org. and play with everything new that it is coming out. And if by any chance you see something that you like or dislike for the Kubernetes monitoring or cloud service provider monitoring, oh give me a ping. You can, ah yeah, ping me on LinkedIn for an example. Vassil Kifton Gf is the name. Thank you very much. It was a huge pleasure, Jake.

30:29.58
Jake Beck
Awesome, man it's always a pleasure. Thank you.


Featured Episodes

Platform Engineering

Podcast

S3 Ep14: Four P's of Platform Engineering: Key to Prosperity

Explore platform engineering with Erik Wilde as he discusses the Four P’s for success in our latest podcast.

Observability

Podcast

S3 Ep16: Cutting Costs with Observability - Beyond Monitoring Best Practices

Discover top observability best practices to cut costs and enhance performance. Go beyond monitoring to optimize your cloud-native applications.

Cloud Computing

Podcast

S4 Ep1: Overcoming Cloud Challenges: Exploring the Future of Cloud Computing

Dive into cloud computing’s future with Kunal Khushwaha on 'Livin’ on the Edge.' Discuss Multicloud, AI, K8s & the challenges of the cloud that many organizations are facing.