First Steps in Troubleshooting
Check Status of Cluster:
- Use
kubectl get nodes
to make sure that all nodes are in the ‘Ready’ state.
Review what happened:
kubectl get events -n namespace
shows what has been going on in the cluster recently.
Access Logs:
- Use
kubectl logs <pod-name>
to look at recent logs and learn about possible problems.
Troubleshooting at the node level
Node Conditions:
- Use
kubectl describe node <node-name>
to see what’s going on. Check for DiskPressure, MemoryPressure, or PID Pressure.
Use of the Resource:
- Use
top
orfree
right on the node to keep track of how resources are being used.
Kubelet Status:
- Make sure that the node’s kubelet service is working. Use
journalctl -u kubelet
to look at the logs.
Troubleshooting at the pod level
state:
- Type
kubectl describe pod <pod-name>
to find out more about the state and events of a pod.
Container Inspections:
- Use
kubectl logs <pod-name> -c <container-name>
to look at the logs for a particular container.
Pod Restart Problems:
- Check the logs for crash loops and think about using
kubectl describe
.
Network Troubleshooting
Service Discovery:
- Use
kubectl get endpoints
to make sure services are properly pointing to pods.
Pod-to-Pod Communication:
- Use tools like
ping
orcurl
from inside a pod to test how well they work.
Network Policies:
- Use
kubectl get networkpolicy
to look over the policies and make sure that the desired network traffic paths are allowed.
DNS Problems:
- Make sure the CoreDNS or kube-dns service is running and looking up service names properly.
Storage Troubleshooting
PV & PVC Status:
- Use
kubectl get pv,pvc
to check the binding status of persistent volumes and claims.
Access Modes:
- Make sure the access mode of the pod fits the access mode of the provisioned PV.
Storage Class Problems:
- Make sure the right storage class is given and the provisioner is working.
Mount Problems:
- Use
kubectl describe pod
to see any problems with mounts that are caused by pods.
Advanced Troubleshooting in Kubernetes
Kubernetes, as the leading container orchestration platform, presents multiple intricate components that could lead to potential issues in various scenarios. Efficiently diagnosing these problems is an essential skill. Let’s dive into some common troubleshooting scenarios and their resolutions.
Preliminary Troubleshooting Steps
1. Examining Cluster Health
-
Description: In this scenario, we’ll intentionally taint a node to make it unschedulable and then inspect its state.
-
Scenario Creation:
kubectl taint nodes test-node key=value:NoSchedule
-
Troubleshooting:
kubectl get nodes kubectl describe node test-node
2. Log Analysis
-
Description: A simulated faulty application will be deployed, which will exit immediately after logging an error.
-
Scenario Creation:
kubectl run faulty-app --image=busybox --command -- /bin/sh -c "echo 'Error: Something went wrong!' && exit 1"
-
Troubleshooting:
kubectl logs faulty-app
Node-level Troubleshooting
1. Node Resource Exhaustion
-
Description: A pod demanding a high amount of memory will be scheduled, potentially leading to resource exhaustion on the node.
-
Scenario Creation:
kubectl run resource-hog --image=busybox --requests='memory=800Mi' -- /bin/sh -c "while true; do sleep 1; done"
-
Troubleshooting:
kubectl describe node test-node
Pod-level Troubleshooting
1. Crashing Pods
-
Description: Investigate the reasons behind a crashing pod (this scenario has been previously set up in the log analysis example).
-
Troubleshooting:
kubectl describe pod faulty-app
2. Pod Access
-
Description: Deploy a simple pod and access its shell, ensuring that there are no access-related issues.
-
Scenario Creation:
kubectl run simple-pod --image=busybox --command -- /bin/sh -c "sleep 3600"
-
Troubleshooting:
kubectl exec -it simple-pod -- /bin/sh
Network Troubleshooting
1. Networking Issues
-
Description: Create two pods in different namespaces and attempt to communicate between them.
-
Scenario Creation:
kubectl create namespace ns1 kubectl create namespace ns2 kubectl run nginx1 --image=nginx --namespace=ns1 kubectl run nginx2 --image=nginx --namespace=ns2
-
Troubleshooting:
kubectl exec -it -n ns1 nginx1 -- curl nginx2.ns2.svc.cluster.local
2. Network Policies
-
Description: Establish a network policy that blocks incoming traffic and diagnose its impact.
-
Scenario Creation:
kubectl apply -f- <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-all spec: podSelector: {} policyTypes: - Ingress EOF
-
Troubleshooting:
kubectl get networkpolicies kubectl describe networkpolicy block-all
Storage Troubleshooting
1. PV and PVC Binding
-
Description: Simulate a mismatch between the configurations of a PersistentVolume and a PersistentVolumeClaim.
-
Scenario Creation:
- Generate
PersistentVolume
YAML:
kubectl create pv example-pv --storage-class=manual --capacity=storage=1Gi --access-mode=ReadWriteOnce --host-path=path="/tmp" --dry-run=client -o yaml > example-pv.yaml
Edit
example-pv.yaml
to ensure thehostPath
section is:hostPath: path: "/tmp"
Apply the configuration:
kubectl apply -f example-pv.yaml
- Generate
PersistentVolumeClaim
YAML:
kubectl create pvc example-pvc --storage-class=manual --access-mode=ReadWriteMany --resources=requests=storage=1Gi --dry-run=client -o yaml > example-pvc.yaml
Apply the configuration:
kubectl apply -f example-pvc.yaml
- Generate
-
Troubleshooting:
kubectl describe pvc example-pvc