Objective:
Simulate three common pod failure scenarios and write a Bash script to detect and restart those pods automatically.
Estimated Time: 15–20 minutes
Prerequisites:
- A running Kubernetes cluster (e.g. Minikube, kind, GKE, etc.)
kubectlaccess- Basic knowledge of YAML and Bash
Save the following to a file called wile-e-pods.yaml:
apiVersion: v1
kind: Pod
metadata:
name: wile-e-coyote
spec:
containers:
- name: fail-fast
image: busybox
command: ["sh", "-c", "echo 'Failing...'; sleep 1; exit 1"]
restartPolicy: Always
---
apiVersion: v1
kind: Pod
metadata:
name: wile-e-coyote2
spec:
containers:
- name: memory-hog
image: python:3.9
command: ["python3", "-c", "a = ' ' * (10**9)"]
resources:
limits:
memory: "50Mi"
restartPolicy: Always
---
apiVersion: v1
kind: Pod
metadata:
name: wile-e-coyote3
spec:
containers:
- name: probe-fail
image: busybox
command: ["sh", "-c", "sleep 3600"]
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
ports:
- containerPort: 8080
restartPolicy: Alwayskubectl apply -f wile-e-pods.yamlkubectl get podsYou should see all three enter CrashLoopBackOff.
#!/bin/bash
# Namespace to monitor
NAMESPACE="default"
NOW=$(date +"%Y-%m-%d %H:%M:%S")
echo "[$NOW] Starting CrashLoopBackOff scan in namespace: $NAMESPACE"
echo "=============================================================="
# Find all pods in CrashLoopBackOff within the specified namespace
crashing_pods=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Running \
| awk '$3=="CrashLoopBackOff" {print $1}')
# Exit early if no crashing pods found
if [[ -z "$crashing_pods" ]]; then
echo "[$(date +"%Y-%m-%d %H:%M:%S")] No crashing pods found."
exit 0
fi
# Loop through each crashing pod
for pod in $crashing_pods; do
echo
echo "--------------------------------------------------------------"
echo "[$(date +"%Y-%m-%d %H:%M:%S")] 📛 Pod: $pod"
echo "--------------------------------------------------------------"
echo "▶ Log Snippet:"
kubectl logs -n $NAMESPACE $pod --tail=5 2>/dev/null || echo "(No logs found)"
echo
echo "▶ Restart Count:"
kubectl get pod -n $NAMESPACE $pod -o jsonpath='{.status.containerStatuses[0].restartCount}' || echo "(N/A)"
echo -e "\n"
echo "▶ Last Crash Reason:"
kubectl get pod -n $NAMESPACE $pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}' || echo "(N/A)"
echo -e "\n"
echo "▶ Liveness Probe Configuration:"
PROBE=$(kubectl get pod -n $NAMESPACE $pod -o jsonpath='{.spec.containers[0].livenessProbe}' 2>/dev/null)
if [[ -z "$PROBE" ]]; then
echo "(No liveness probe configured)"
else
echo "$PROBE"
fi
echo
echo "▶ Recent Events:"
kubectl describe pod -n $NAMESPACE $pod | awk '/Events:/,/^$/'
echo
echo "▶ Action: Deleting pod to trigger restart..."
kubectl delete pod -n $NAMESPACE $pod
echo "--------------------------------------------------------------"
done
echo "[$(date +"%Y-%m-%d %H:%M:%S")] ✅ Scan complete."chmod +x acme.sh./acme.shYou’ll see output for each pod including:
- Recent logs
- Restart count
- Crash reason
- Liveness probe (if applicable)
- Events
- A deletion trigger
To run this every 5 minutes automatically:
crontab -eAdd:
*/5 * * * * /full/path/to/acme.sh >> /tmp/acme.log 2>&1By the end of this lab, you’ve:
- Simulated 3 common pod failure scenarios
- Written a script to detect and remediate them
- Practiced using logs, JSONPath,
kubectl describe, and event parsing - Built a script that can be reused across namespaces or production environments
- Modify the script to only restart pods that have restarted more than 3 times
- Add Slack or Discord alerts using a webhook
- Turn this into a Kubernetes Job or CronJob inside the cluster itself