Troubleshooting Guide

Common issues and solutions for Charon infrastructure.

Quick Diagnostics

# Check all pods
kubectl get pods -n dev

# Check services
kubectl get svc -n dev

# Check ingresses
kubectl get ingress -n dev

# Check certificates
kubectl get certificate -n dev

# Check PVCs
kubectl get pvc -n dev

Pod Issues

Pods Stuck in Pending

Symptoms: Pods show Pending status indefinitely

Diagnosis:

kubectl describe pod <pod-name> -n dev
# Look for events at bottom

Common Causes:

No storage class

kubectl get storageclass
# Should show at least one storage class

Fix: Configure storage provisioner for your cluster

Insufficient resources
```
kubectl describe nodes
# Check Allocated resources section
```
Fix: Add nodes or reduce resource requests
PVC binding failed
```
kubectl get pvc -n dev
# Look for Pending PVCs
```
Fix: Check storage provisioner logs

Pods in CrashLoopBackOff

Symptoms: Pod restarts repeatedly

Diagnosis:

kubectl logs <pod-name> -n dev -c <container-name>
kubectl logs <pod-name> -n dev -c <container-name> --previous

Common Causes:

Application configuration error
- Check environment variables
- Verify secrets exist
- Check config maps
Dependency not ready
- Database not accessible
- Required service not running

Health check failing

kubectl describe pod <pod-name> -n dev
# Look for failed liveness/readiness probes

ImagePullBackOff

Symptoms: Cannot pull container image

Diagnosis:

kubectl describe pod <pod-name> -n dev
# Check Events for image pull errors

Fixes:

Verify image name and tag are correct
Check registry credentials if using private registry
Verify network connectivity to registry

DNS Issues

Services Not Resolving

Symptoms: Cannot resolve service.example.com

Diagnosis:

# From VPN-connected device
dig service.example.com

# Check Cloudflare records
curl -X GET \
  "https://api.cloudflare.com/client/v4/zones/$CLOUDFLARE_ZONE_ID/dns_records" \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | jq

Common Causes:

DNS not updated yet
- Wait 1-2 minutes for DNS update script
- Check terraform output
Wrong zone ID
- Verify cloudflare_zone_id in terraform.tfvars
API token permissions
- Token needs Zone:DNS:Edit permission

Manual Fix:

# Re-run DNS update
cd scripts
python dns/update_service_dns.py \
  --zone-id $CLOUDFLARE_ZONE_ID \
  --namespace dev \
  --services '[{"name":"service","hostname":"service.example.com","record_name":"service","enabled":true}]'

MagicDNS Not Working

Symptoms: VPN connected but can't resolve service hostnames

Diagnosis:

tailscale status
# Should show "MagicDNS: enabled"

Fix:

# Restart Tailscale
sudo tailscale down
sudo tailscale up --login-server https://vpn.example.com

Certificate Issues

Certificates Not Provisioning

Symptoms: Certificate shows Ready: False

Diagnosis:

kubectl describe certificate <cert-name> -n dev
kubectl get challenges -A
kubectl describe clusterissuer letsencrypt-prod

Common Causes:

DNS-01 challenge failing
- Cloudflare API token lacks permissions
- Zone ID incorrect
Rate limiting
- Let's Encrypt rate limits hit
- Use staging issuer for testing

cert-manager not ready

kubectl get pods -n cert-manager
kubectl logs -n cert-manager -l app=cert-manager

Fix:

# Delete and recreate certificate
kubectl delete certificate <cert-name> -n dev
# Terraform will recreate it
terraform apply

VPN/Headscale Issues

Cannot Connect to VPN

Symptoms: tailscale up fails or hangs

Diagnosis:

# Check Headscale is running
kubectl get pods -n dev headscale-0

# Check Headscale logs
kubectl logs -n dev headscale-0

# Test Headscale endpoint
curl -I https://vpn.example.com/health

Common Causes:

External ingress not accessible

kubectl get svc -n ingress-nginx-external
# Should have EXTERNAL-IP

DNS not pointing to LoadBalancer

dig vpn.example.com
# Should return LoadBalancer IP

Firewall blocking
- Allow HTTPS (443) to LoadBalancer IP
- Allow UDP 41641 for WireGuard

Node Shows Offline

Symptoms: Device enrolled but shows offline

Diagnosis:

kubectl exec -n dev headscale-0 -- headscale nodes list

Fixes:

Check device Tailscale daemon is running
Check firewall allows UDP 41641
Restart Tailscale on device

LDAP/FreeIPA Issues

LDAP Authentication Failing

Symptoms: Cannot login with LDAP credentials

Diagnosis:

# Check FreeIPA is running
kubectl get pods -n dev freeipa-0

# Test LDAP bind
kubectl exec -n dev freeipa-0 -c freeipa -- \
  ldapsearch -x -H ldaps://localhost:636 \
  -D "uid=admin,cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local" \
  -w "$FREEIPA_ADMIN_PASSWORD" \
  -b "cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local" \
  "(uid=*)"

Common Causes:

Password expired

FreeIPA passwords expire by default
Reset password:

kubectl exec -n dev freeipa-0 -c freeipa -- \
  ipa user-mod username --password-expiration='20261127142802Z'

Wrong base DN
- Check LDAP configuration uses correct domain
- Should be dc=dev,dc=svc,dc=cluster,dc=local
CA certificate not trusted
- Check FreeIPA CA is mounted
- Verify cert in /etc/ssl/certs/freeipa-ca.pem

Terraform Issues

Dependency Errors

Symptoms: Terraform fails with "resource doesn't exist"

Diagnosis: Look for [0] indexing in error message

Fix: See Dependency Patterns Guide

State Lock

Symptoms: "Error acquiring the state lock"

Fix:

# If you're sure no other terraform is running:
terraform force-unlock <lock-id>

Performance Issues

Slow Pod Startup

Diagnosis:

kubectl describe pod <pod-name> -n dev
# Check Events for slow image pulls or volume mounts

Fixes:

Use faster storage class
Pre-pull images to nodes
Increase resource limits

High Resource Usage

Diagnosis:

kubectl top pods -n dev
kubectl top nodes

Fixes:

Increase node resources
Adjust pod resource limits
Scale horizontally if supported

Monitoring Issues

Grafana Dashboards Show "No Data"

Symptoms: All Grafana dashboards show "No data" after Thanos deployment

Diagnosis:

# Check if Thanos is enabled
kubectl get pods -n monitoring | grep thanos

# Check Grafana data source configuration
kubectl exec -n monitoring grafana-0 -c grafana -- \
  cat /etc/grafana/provisioning/datasources/datasources.yaml

# Test Prometheus endpoint from Grafana
kubectl exec -n monitoring grafana-0 -c grafana -- \
  curl http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up

Root Cause:

Dashboards configured to query Thanos but thanos_enabled = false, or vice versa.

Solution:

If Thanos is disabled (check terraform.tfvars):

Update Grafana data source to point to Prometheus
Update dashboard queries to remove thanos_ metric prefixes

If Thanos is enabled:

Ensure Grafana data source points to http://thanos-query.monitoring.svc.cluster.local:9090
Wait for Thanos to compact initial data (can take 2-4 hours)

Check Thanos Query logs:

kubectl logs -n monitoring deployment/thanos-query

See Monitoring Guide for detailed troubleshooting.

Prometheus High Memory Usage

Symptoms: Prometheus pod OOMKilled or running near memory limit

Diagnosis:

# Check current usage
kubectl top pod -n monitoring prometheus-server-*

# Check retention and scrape config
kubectl get configmap -n monitoring prometheus-server -o yaml

Fixes:

Reduce retention:

# terraform/prometheus.tf
server.retention = "7d"  # Default: 15d

Increase memory:

# terraform.tfvars
prometheus_memory_limit = "8Gi"  # Default: 4Gi

Reduce scrape frequency:

# terraform/prometheus.tf
server.global.scrape_interval = "60s"  # Default: 30s

Loki Not Showing Logs

Symptoms: Grafana Explore shows no logs from Loki

Diagnosis:

# Check Loki pod
kubectl get pods -n monitoring | grep loki

# Check Loki logs
kubectl logs -n monitoring deployment/loki

# Check Promtail daemonset
kubectl get daemonset -n monitoring promtail

# Check Promtail logs
kubectl logs -n monitoring daemonset/promtail --tail=50

Common Causes:

Loki pod restarted (emptyDir storage is ephemeral)
- Expected behavior with current config
- Logs only persist until pod restart

Promtail not collecting:

# Check Promtail targets
kubectl exec -n monitoring deployment/loki -- \
  curl localhost:3100/ready

Grafana data source misconfigured:
- URL should be: http://loki.monitoring.svc.cluster.local:3100

Thanos Compactor Not Running (if enabled)

Symptoms: Old metrics not being compacted, storage usage growing

Diagnosis:

# Check compactor pod
kubectl get pods -n monitoring | grep thanos-compactor

# Check compactor logs
kubectl logs -n monitoring statefulset/thanos-compactor

# Check PVC
kubectl get pvc -n monitoring

Fixes:

Check object storage access:

kubectl exec -n monitoring thanos-compactor-0 -- ls -la /data/thanos

Verify retention settings:

kubectl logs -n monitoring thanos-compactor-0 | grep retention

Restart compactor if stuck:

kubectl delete pod -n monitoring thanos-compactor-0

AlertManager Not Firing Alerts

Symptoms: Prometheus rules firing but no notifications

Diagnosis:

# Check AlertManager pod
kubectl get pods -n monitoring | grep alertmanager

# Check active alerts
kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
# Open http://localhost:9093

# Check Prometheus alert rules
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open http://localhost:9090/alerts

Common Causes:

No notification receivers configured
- Update terraform/prometheus.tf with receiver config
- Add Slack webhook, email SMTP, etc.
Alert inhibition rules:
- Check if higher-severity alert is inhibiting

AlertManager config syntax error:

kubectl logs -n monitoring deployment/prometheus-alertmanager | grep -i error

Getting Help

If you're still stuck:

Check logs: Every issue leaves a trail

kubectl logs -n monitoring <pod> -c <container> --tail=100

Describe resources: Events often explain issues

kubectl describe <resource-type> <name> -n monitoring

Check GitHub Issues: Project Issues
Review Documentation:
- Monitoring Guide - Complete observability stack documentation
- Architecture
- Dependency Patterns
- LDAP Troubleshooting

Navigation: 📚 Documentation Index | 🏠 Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide

Quick Diagnostics

Pod Issues

Pods Stuck in Pending

Pods in CrashLoopBackOff

ImagePullBackOff

DNS Issues

Services Not Resolving

MagicDNS Not Working

Certificate Issues

Certificates Not Provisioning

VPN/Headscale Issues

Cannot Connect to VPN

Node Shows Offline

LDAP/FreeIPA Issues

LDAP Authentication Failing

Terraform Issues

Dependency Errors

State Lock

Performance Issues

Slow Pod Startup

High Resource Usage

Monitoring Issues

Grafana Dashboards Show "No Data"

Prometheus High Memory Usage

Loki Not Showing Logs

Thanos Compactor Not Running (if enabled)

AlertManager Not Firing Alerts

Getting Help

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting Guide

Quick Diagnostics

Pod Issues

Pods Stuck in Pending

Pods in CrashLoopBackOff

ImagePullBackOff

DNS Issues

Services Not Resolving

MagicDNS Not Working

Certificate Issues

Certificates Not Provisioning

VPN/Headscale Issues

Cannot Connect to VPN

Node Shows Offline

LDAP/FreeIPA Issues

LDAP Authentication Failing

Terraform Issues

Dependency Errors

State Lock

Performance Issues

Slow Pod Startup

High Resource Usage

Monitoring Issues

Grafana Dashboards Show "No Data"

Prometheus High Memory Usage

Loki Not Showing Logs

Thanos Compactor Not Running (if enabled)

AlertManager Not Firing Alerts

Getting Help