Debugging with a Service Mesh
Service Mesh Status Checks
Service Proxy Status Checks
Service Route Metrics
Service Route Configuration for Issue Mitigation
Request Logging
Service Proxy Logging
Injecting a Debug Container
Using the Telepresence Tool
Here we talk about using a service mesh to debug and mitigate some types of app failures. We’ll be looking at several of the capabilities that service meshes may provide. Each service mesh technology supports a unique set of such capabilities.
We use examples from Linkerd to illustrate the capabilities that service meshes may provide, but the fundamental concepts discussed here will apply to any service mesh.
Service Mesh Status Checks
In many situations, it’s helpful to first do a check of the status of the service mesh components. If the mesh itself is having a failure--such as its control plane not working--then app failures you are seeing may actually be caused by a larger problem, and not an issue with the app itself.
Below is an example of possible output from running the
linkerd check
1. Can the service mesh communicate with Kubernetes? (kubernetes-api checks)
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
2. Is the Kubernetes version compatible with the service mesh version? (kubernetes-version checks)
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
3. Is the service mesh installed and running? (linkerd-existence checks)
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
4. Is the service mesh’s control plane properly configured? (linkerd-config checks)
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
5. Are the service mesh’s credentials valid and up to date? (linkerd-identity checks)
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
6. Is the API for the control plane running and ready? (linkerd-api checks)
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
7. Is the service mesh installation up to date? (linkerd-version checks)
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
8. Is the service mesh control plane up to date? (control-plane-version checks)
control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match
If any of the status checks failed, you would see output similar to the example below. It would indicate which check failed, and often also provide you with additional information on the nature of the failure to help you with troubleshooting it.
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
Error calling Prometheus from the control plane:
server_error: server error: 503
see https://linkerd.io/checks/#l5d-api-control-api for hints
If all of the status checks passed and no issues were detected, the last line of the output will be like this one:
Status check results are √
Service Proxy Status Checks
Sometimes you may want to do a status check for other aspects of the service mesh, in addition to or instead of the ones you’ve just reviewed. For instance, you may just want to check the status of the service proxies that your app is supposed to be using. In Linkerd, you can do that by adding the
--proxy
linkerd check
Note that Linkerd refers to “service proxies” as “data plane proxies.”
Below is an excerpt of possible output from running
linkerd check --proxy
linkerd check
linkerd check
--proxy
9. Are the credentials for each of the data plane proxies valid and up to date? (linkerd-identity-data-plane)
linkerd-identity-data-plane
---------------------------
√ data plane proxies certificate match CA
10. Are the data plane proxies running and working fine? (linkerd-identity-data-plane)
linkerd-data-plane
------------------
√ data plane namespace exists
√ data plane proxies are ready
√ data plane proxy metrics are present in Prometheus
√ data plane is up-to-date
√ data plane and cli versions match
Service Route Metrics
If your service mesh status checks don’t report any problems, a common next step in troubleshooting is to look at the metrics for the app’s service routes in the mesh. In other words, you want to see measurements of how each of the routes within the mesh that the app uses are performing. These measurements are often useful for purposes other than troubleshooting, such as determining how the performance of an app could be improved.
Let’s suppose that you’re troubleshooting an app that is experiencing intermittent slowdowns and failures. You could look at the per-route metrics for the app to see if there’s a particular route that is the cause or is involved somehow.
The
linkerd routes
ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
GET / webapp 100.00% 0.6rps 15ms 20ms 20ms
By default, the metrics are for inbound requests only. This example shows the performance of requests made over the
“GET /”
webapp
The three latency metrics indicate how much time it took to handle the requests based on percentiles. P50 refers to the 50th percentile--the median time, in this case 15 milliseconds. P95 refers to the 95th percentile, which indicates that the app is handling 95 percent of the requests as fast as or faster than 20 milliseconds. P99 provides the same type of measurement for the 99th percentile.
Viewing the metrics for each route can indicate where slowdowns or failures are occurring within the mesh, and also where things are functioning normally. It can narrow down what the problem might be, or in some cases point you right to the culprit.
Service Route Configuration for Issue Mitigation
You may want to change the configuration of particular service routes to mitigate problems that occur. For example, you could change the timeouts and automatic retries for a particular route so that attempts to a problematic pod switch to another pod more quickly. That could reduce delays for users while you continue to troubleshoot the problem or while a developer changes code to address the underlying issue.
In Linkerd, the mechanism for configuring a route is called a service profile. As mentioned earlier, service profiles can also be used to specify which routes to provide metrics on for the
linkerd routes
For more information on creating and using service profiles, see https://linkerd.io/2/features/service-profiles/.
Request Logging
If you need more detail about requests and responses than you can get from the service route metrics, you may want to do logging of the individual requests and responses.
Caution: logging requests can generate a rapidly growing amount of log data. In many cases you will only need to see a few logged requests, and not massive volumes of them.
Here is an example of a few logged requests. This log was generated by running the
linkerd tap
The first one shows what the request was, and the second shows the status code that was returned (in this case, a 503, Service Unavailable). The second and third entries both contain metrics for how this request was handled. This additional information, beyond what could be seen in route-level metrics, may help you to narrow your search for the problem.
req id=9:49 proxy=out src=10.244.0.53:37820 dst=10.244.0.50:7001 tls=true :method=HEAD :authority=authors:7001 :path=/authors/3252.json
rsp id=9:49 proxy=out src=10.244.0.53:37820 dst=10.244.0.50:7001 tls=true :status=503 latency=2197µs
end id=9:49 proxy=out src=10.244.0.53:37820 dst=10.244.0.50:7001 tls=true duration=16µs response-length=0B
For more information on the
linkerd tap
Service Proxy Logging
Sometimes you want to better understand what is happening within a particular service proxy. You may be able to do that by increasing the extent of the logging that the service proxy is performing, such as recording more events or recording more details about each event.
Be very careful before altering service proxy logging because it can negatively impact the proxy’s performance, and the volume of the logs themselves can also be overwhelming.
Linkerd allows its service proxy log level to be changed in various ways. For more information, see https://linkerd.io/2/tasks/modifying-proxy-log-level/ and https://linkerd.io/2/reference/proxy-log-level/ .
Here is an example of what Linkerd’s service proxy logs may look like.
[ 326.996211471s] WARN inbound:accept{peer.addr=10.244.0.111:55288}:source{target.addr=10.244.0.131:7002}:http1{name=books.booksapp.svc.cluster.local:7002 port=7002 keep_alive=true wants_h1_upgrade=false was_absolute_form=false}:profile{addr=books.booksapp.svc.cluster.local:7002}:daemon:poll_profile: linkerd2_service_profiles::client: Could not fetch profile error=grpc-status: Unavailable, grpc-message: "proxy max-concurrency exhausted"
Injecting a Debug Container
If you need to take an even closer look at what’s happening inside a pod, you may be able to have your service mesh inject a debug container into that pod. A debug container is designed to monitor the activity within the pod and to collect information on that activity, such as capturing network packets.
In Linkerd, you can inject a debug container by adding the
--enable-debug-sidecar
linkerd inject
For more information on using Linkerd’s debug container (called the “debug sidecar,”) see https://linkerd.io/2/tasks/using-the-debug-container/ .
Using the Telepresence Tool
Telepresence is a debugging tool that is hosted by the Cloud Native Computing Foundation (CNCF). It can be used instead of or in addition to injecting a debug container so you can examine what’s happening inside a pod.
Using Telepresence, you can run a single process (a service or debug tool) locally, and a two-way network proxy enables that local service to effectively operate as part of the remote Kubernetes cluster. This architecture means that Telepresence is usually not run in production clusters; it’s intended for use in testing or staging.
Because you are running a service locally, you can use whatever debugging or testing tools you’d like to monitor and probe the executing service. You can also edit your service using whatever tool you choose.