Observability in K8s

Get hands-on! Observability with Prometheus & Grafana

Observability in K8s: Metrics, Logs, and Traces

Observability in K8s

It's impossible to understand what is going wrong in an incident, and getting to its root causes, without having clear observability and visibility. In traditional development models, monitoring focused on infrastructure and relied on logs. In the cloud-native space, an entire ecosystem must be monitored and understood. It becomes essential to understand a combination of metrics, traces and logging stacks, focusing not just on the infrastructure but also on user experience, performance, applications and infrastructure.


What are observability and visibility in this context?

Observability and visibility are the two primary ways for identifying and understanding what is going wrong during an incident and digging into it after the incident. When incidents occur in a cloud-native environment, the complexity of the infrastructure makes it difficult to get clear visibility into what happened and how the incident might be fixed. Observability and visibility go hand-in-hand, providing ways not only to inspect, understand, and fix incidents as they are happening but also to inform ongoing incident-prevention work, dive into systemic or root causes, and, most of all, build greater resilience.


What is visibility?

Visibility mirrors what would traditionally be thought of as monitoring. Its primary function is to indicate that something is wrong and provide the basic metrics needed for troubleshooting. Monitoring has traditionally been the domain of ops engineers, but this has shifted to become a developer concern as well. With the complexity of containerized applications, developers are best positioned to understand what might be going wrong, and in parallel, visibility has expanded to introduce new tools and techniques for investigating and diagnosing issues, in many cases, specifically for developers.


For example, a service catalog provides a centralized "source of truth", listing services, their ownership and dependencies, resources, and other metadata, essentially delivering on the "single pane of glass" concept where a developer can gain instant visibility into the full picture. Another example is the need for distributed tracing. A distributed system spans multiple services, and to locate an issue, a single logical trace that can span these services is necessary.


What is observability?

Observability is the constant monitoring of system and of business KPIs with the goal of understanding why something is happening. It goes beyond the here-and-now of visibility (which itself is key to observability) and extends to the analysis and understanding of broader problems or issues, the underlying system and root causes. There is considerable overlap in visibility and observability. Observability just encompasses more and different things, including insight and potential actionability.


Observability and debugging for developers: Using traces to locate issues

Distributed tracing can be a very useful tool to enable a developer to locate issues within a complicated graph of microservices. For “deep systems”, where a single user’s request is often handled by multiple layers of services before returning a result, it is essential to be able to observe the path the request took through the system.


The why of observability for developers

The why of observability can inform both the post-incident rundown and postmortem, but of potentially much greater, lasting value, can drive the way applications are created from the outset. That is, cloud-native developers can practice "observability-driven development (ODD)": "defining instrumentation to determine what is happening in relation to a requirement before any code is written". “Just as you wouldn’t accept a pull-request without tests, you should never accept a pull-request unless you can answer the question, “how will I know when this isn’t working?””


Observability can become a part of the development process itself. because it's possible to flip the script on development and use production to drive better code. How can this deliver benefits for developers? The insight gathered from shipping and running applications will strengthen future development by:

  • Enabling more data-driven development and product decisions
  • Helping avoid future incidents and issues by identifying root causes or systemic problems
  • Gathering more granular performance analysis
  • Contributing to evidence-based decision-making throughout the development process.


Hands-On! Observability with Prometheus and Grafana


Prerequisites for hands-on tutorial with Prometheus and Grafana

  • A DigitalOcean Account, which you can create with $100 in credits automatically applied
  • A credit card (no purchase necessary, the card is needed to create an account) and your DigitalOcean credit code
  • A working version of these three command line tools:
  • kubectl (version 1.21 or higher)
  • (v3.6.0 or higher)
  • (1.64.0 or higher)


Follow the step-by-step instructions located to configure Prometheus and Grafana on DigitalOcean managed Kubernetes.

Time for your next lesson!