Skip to content

Latest commit

Β 

History

History
565 lines (385 loc) Β· 11.4 KB

File metadata and controls

565 lines (385 loc) Β· 11.4 KB

Troubleshooting Guide

Common issues and solutions for Charon infrastructure.

Quick Diagnostics

# Check all pods
kubectl get pods -n dev

# Check services
kubectl get svc -n dev

# Check ingresses
kubectl get ingress -n dev

# Check certificates
kubectl get certificate -n dev

# Check PVCs
kubectl get pvc -n dev

Pod Issues

Pods Stuck in Pending

Symptoms: Pods show Pending status indefinitely

Diagnosis:

kubectl describe pod <pod-name> -n dev
# Look for events at bottom

Common Causes:

  1. No storage class

    kubectl get storageclass
    # Should show at least one storage class

    Fix: Configure storage provisioner for your cluster

  2. Insufficient resources

    kubectl describe nodes
    # Check Allocated resources section

    Fix: Add nodes or reduce resource requests

  3. PVC binding failed

    kubectl get pvc -n dev
    # Look for Pending PVCs

    Fix: Check storage provisioner logs

Pods in CrashLoopBackOff

Symptoms: Pod restarts repeatedly

Diagnosis:

kubectl logs <pod-name> -n dev -c <container-name>
kubectl logs <pod-name> -n dev -c <container-name> --previous

Common Causes:

  1. Application configuration error

    • Check environment variables
    • Verify secrets exist
    • Check config maps
  2. Dependency not ready

    • Database not accessible
    • Required service not running
  3. Health check failing

    kubectl describe pod <pod-name> -n dev
    # Look for failed liveness/readiness probes

ImagePullBackOff

Symptoms: Cannot pull container image

Diagnosis:

kubectl describe pod <pod-name> -n dev
# Check Events for image pull errors

Fixes:

  • Verify image name and tag are correct
  • Check registry credentials if using private registry
  • Verify network connectivity to registry

DNS Issues

Services Not Resolving

Symptoms: Cannot resolve service.example.com

Diagnosis:

# From VPN-connected device
dig service.example.com

# Check Cloudflare records
curl -X GET \
  "https://api.cloudflare.com/client/v4/zones/$CLOUDFLARE_ZONE_ID/dns_records" \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | jq

Common Causes:

  1. DNS not updated yet

    • Wait 1-2 minutes for DNS update script
    • Check terraform output
  2. Wrong zone ID

    • Verify cloudflare_zone_id in terraform.tfvars
  3. API token permissions

    • Token needs Zone:DNS:Edit permission

Manual Fix:

# Re-run DNS update
cd scripts
python dns/update_service_dns.py \
  --zone-id $CLOUDFLARE_ZONE_ID \
  --namespace dev \
  --services '[{"name":"service","hostname":"service.example.com","record_name":"service","enabled":true}]'

MagicDNS Not Working

Symptoms: VPN connected but can't resolve service hostnames

Diagnosis:

tailscale status
# Should show "MagicDNS: enabled"

Fix:

# Restart Tailscale
sudo tailscale down
sudo tailscale up --login-server https://vpn.example.com

Certificate Issues

Certificates Not Provisioning

Symptoms: Certificate shows Ready: False

Diagnosis:

kubectl describe certificate <cert-name> -n dev
kubectl get challenges -A
kubectl describe clusterissuer letsencrypt-prod

Common Causes:

  1. DNS-01 challenge failing

    • Cloudflare API token lacks permissions
    • Zone ID incorrect
  2. Rate limiting

    • Let's Encrypt rate limits hit
    • Use staging issuer for testing
  3. cert-manager not ready

    kubectl get pods -n cert-manager
    kubectl logs -n cert-manager -l app=cert-manager

Fix:

# Delete and recreate certificate
kubectl delete certificate <cert-name> -n dev
# Terraform will recreate it
terraform apply

VPN/Headscale Issues

Cannot Connect to VPN

Symptoms: tailscale up fails or hangs

Diagnosis:

# Check Headscale is running
kubectl get pods -n dev headscale-0

# Check Headscale logs
kubectl logs -n dev headscale-0

# Test Headscale endpoint
curl -I https://vpn.example.com/health

Common Causes:

  1. External ingress not accessible

    kubectl get svc -n ingress-nginx-external
    # Should have EXTERNAL-IP
  2. DNS not pointing to LoadBalancer

    dig vpn.example.com
    # Should return LoadBalancer IP
  3. Firewall blocking

    • Allow HTTPS (443) to LoadBalancer IP
    • Allow UDP 41641 for WireGuard

Node Shows Offline

Symptoms: Device enrolled but shows offline

Diagnosis:

kubectl exec -n dev headscale-0 -- headscale nodes list

Fixes:

  1. Check device Tailscale daemon is running
  2. Check firewall allows UDP 41641
  3. Restart Tailscale on device

LDAP/FreeIPA Issues

LDAP Authentication Failing

Symptoms: Cannot login with LDAP credentials

Diagnosis:

# Check FreeIPA is running
kubectl get pods -n dev freeipa-0

# Test LDAP bind
kubectl exec -n dev freeipa-0 -c freeipa -- \
  ldapsearch -x -H ldaps://localhost:636 \
  -D "uid=admin,cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local" \
  -w "$FREEIPA_ADMIN_PASSWORD" \
  -b "cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local" \
  "(uid=*)"

Common Causes:

  1. Password expired

    • FreeIPA passwords expire by default
    • Reset password:
    kubectl exec -n dev freeipa-0 -c freeipa -- \
      ipa user-mod username --password-expiration='20261127142802Z'
  2. Wrong base DN

    • Check LDAP configuration uses correct domain
    • Should be dc=dev,dc=svc,dc=cluster,dc=local
  3. CA certificate not trusted

    • Check FreeIPA CA is mounted
    • Verify cert in /etc/ssl/certs/freeipa-ca.pem

Terraform Issues

Dependency Errors

Symptoms: Terraform fails with "resource doesn't exist"

Diagnosis: Look for [0] indexing in error message

Fix: See Dependency Patterns Guide

State Lock

Symptoms: "Error acquiring the state lock"

Fix:

# If you're sure no other terraform is running:
terraform force-unlock <lock-id>

Performance Issues

Slow Pod Startup

Diagnosis:

kubectl describe pod <pod-name> -n dev
# Check Events for slow image pulls or volume mounts

Fixes:

  • Use faster storage class
  • Pre-pull images to nodes
  • Increase resource limits

High Resource Usage

Diagnosis:

kubectl top pods -n dev
kubectl top nodes

Fixes:

  • Increase node resources
  • Adjust pod resource limits
  • Scale horizontally if supported

Monitoring Issues

Grafana Dashboards Show "No Data"

Symptoms: All Grafana dashboards show "No data" after Thanos deployment

Diagnosis:

# Check if Thanos is enabled
kubectl get pods -n monitoring | grep thanos

# Check Grafana data source configuration
kubectl exec -n monitoring grafana-0 -c grafana -- \
  cat /etc/grafana/provisioning/datasources/datasources.yaml

# Test Prometheus endpoint from Grafana
kubectl exec -n monitoring grafana-0 -c grafana -- \
  curl http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up

Root Cause:

Dashboards configured to query Thanos but thanos_enabled = false, or vice versa.

Solution:

If Thanos is disabled (check terraform.tfvars):

  1. Update Grafana data source to point to Prometheus
  2. Update dashboard queries to remove thanos_ metric prefixes

If Thanos is enabled:

  1. Ensure Grafana data source points to http://thanos-query.monitoring.svc.cluster.local:9090
  2. Wait for Thanos to compact initial data (can take 2-4 hours)
  3. Check Thanos Query logs:
    kubectl logs -n monitoring deployment/thanos-query

See Monitoring Guide for detailed troubleshooting.

Prometheus High Memory Usage

Symptoms: Prometheus pod OOMKilled or running near memory limit

Diagnosis:

# Check current usage
kubectl top pod -n monitoring prometheus-server-*

# Check retention and scrape config
kubectl get configmap -n monitoring prometheus-server -o yaml

Fixes:

  1. Reduce retention:

    # terraform/prometheus.tf
    server.retention = "7d"  # Default: 15d
  2. Increase memory:

    # terraform.tfvars
    prometheus_memory_limit = "8Gi"  # Default: 4Gi
  3. Reduce scrape frequency:

    # terraform/prometheus.tf
    server.global.scrape_interval = "60s"  # Default: 30s

Loki Not Showing Logs

Symptoms: Grafana Explore shows no logs from Loki

Diagnosis:

# Check Loki pod
kubectl get pods -n monitoring | grep loki

# Check Loki logs
kubectl logs -n monitoring deployment/loki

# Check Promtail daemonset
kubectl get daemonset -n monitoring promtail

# Check Promtail logs
kubectl logs -n monitoring daemonset/promtail --tail=50

Common Causes:

  1. Loki pod restarted (emptyDir storage is ephemeral)

    • Expected behavior with current config
    • Logs only persist until pod restart
  2. Promtail not collecting:

    # Check Promtail targets
    kubectl exec -n monitoring deployment/loki -- \
      curl localhost:3100/ready
  3. Grafana data source misconfigured:

    • URL should be: http://loki.monitoring.svc.cluster.local:3100

Thanos Compactor Not Running (if enabled)

Symptoms: Old metrics not being compacted, storage usage growing

Diagnosis:

# Check compactor pod
kubectl get pods -n monitoring | grep thanos-compactor

# Check compactor logs
kubectl logs -n monitoring statefulset/thanos-compactor

# Check PVC
kubectl get pvc -n monitoring

Fixes:

  1. Check object storage access:

    kubectl exec -n monitoring thanos-compactor-0 -- ls -la /data/thanos
  2. Verify retention settings:

    kubectl logs -n monitoring thanos-compactor-0 | grep retention
  3. Restart compactor if stuck:

    kubectl delete pod -n monitoring thanos-compactor-0

AlertManager Not Firing Alerts

Symptoms: Prometheus rules firing but no notifications

Diagnosis:

# Check AlertManager pod
kubectl get pods -n monitoring | grep alertmanager

# Check active alerts
kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
# Open http://localhost:9093

# Check Prometheus alert rules
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open http://localhost:9090/alerts

Common Causes:

  1. No notification receivers configured

    • Update terraform/prometheus.tf with receiver config
    • Add Slack webhook, email SMTP, etc.
  2. Alert inhibition rules:

    • Check if higher-severity alert is inhibiting
  3. AlertManager config syntax error:

    kubectl logs -n monitoring deployment/prometheus-alertmanager | grep -i error

Getting Help

If you're still stuck:

  1. Check logs: Every issue leaves a trail

    kubectl logs -n monitoring <pod> -c <container> --tail=100
  2. Describe resources: Events often explain issues

    kubectl describe <resource-type> <name> -n monitoring
  3. Check GitHub Issues: Project Issues

  4. Review Documentation:


Navigation: πŸ“š Documentation Index | 🏠 Home