Common issues and solutions for Charon infrastructure.
# Check all pods
kubectl get pods -n dev
# Check services
kubectl get svc -n dev
# Check ingresses
kubectl get ingress -n dev
# Check certificates
kubectl get certificate -n dev
# Check PVCs
kubectl get pvc -n devSymptoms: Pods show Pending status indefinitely
Diagnosis:
kubectl describe pod <pod-name> -n dev
# Look for events at bottomCommon Causes:
-
No storage class
kubectl get storageclass # Should show at least one storage classFix: Configure storage provisioner for your cluster
-
Insufficient resources
kubectl describe nodes # Check Allocated resources sectionFix: Add nodes or reduce resource requests
-
PVC binding failed
kubectl get pvc -n dev # Look for Pending PVCsFix: Check storage provisioner logs
Symptoms: Pod restarts repeatedly
Diagnosis:
kubectl logs <pod-name> -n dev -c <container-name>
kubectl logs <pod-name> -n dev -c <container-name> --previousCommon Causes:
-
Application configuration error
- Check environment variables
- Verify secrets exist
- Check config maps
-
Dependency not ready
- Database not accessible
- Required service not running
-
Health check failing
kubectl describe pod <pod-name> -n dev # Look for failed liveness/readiness probes
Symptoms: Cannot pull container image
Diagnosis:
kubectl describe pod <pod-name> -n dev
# Check Events for image pull errorsFixes:
- Verify image name and tag are correct
- Check registry credentials if using private registry
- Verify network connectivity to registry
Symptoms: Cannot resolve service.example.com
Diagnosis:
# From VPN-connected device
dig service.example.com
# Check Cloudflare records
curl -X GET \
"https://api.cloudflare.com/client/v4/zones/$CLOUDFLARE_ZONE_ID/dns_records" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | jqCommon Causes:
-
DNS not updated yet
- Wait 1-2 minutes for DNS update script
- Check terraform output
-
Wrong zone ID
- Verify
cloudflare_zone_idin terraform.tfvars
- Verify
-
API token permissions
- Token needs
Zone:DNS:Editpermission
- Token needs
Manual Fix:
# Re-run DNS update
cd scripts
python dns/update_service_dns.py \
--zone-id $CLOUDFLARE_ZONE_ID \
--namespace dev \
--services '[{"name":"service","hostname":"service.example.com","record_name":"service","enabled":true}]'Symptoms: VPN connected but can't resolve service hostnames
Diagnosis:
tailscale status
# Should show "MagicDNS: enabled"Fix:
# Restart Tailscale
sudo tailscale down
sudo tailscale up --login-server https://vpn.example.comSymptoms: Certificate shows Ready: False
Diagnosis:
kubectl describe certificate <cert-name> -n dev
kubectl get challenges -A
kubectl describe clusterissuer letsencrypt-prodCommon Causes:
-
DNS-01 challenge failing
- Cloudflare API token lacks permissions
- Zone ID incorrect
-
Rate limiting
- Let's Encrypt rate limits hit
- Use staging issuer for testing
-
cert-manager not ready
kubectl get pods -n cert-manager kubectl logs -n cert-manager -l app=cert-manager
Fix:
# Delete and recreate certificate
kubectl delete certificate <cert-name> -n dev
# Terraform will recreate it
terraform applySymptoms: tailscale up fails or hangs
Diagnosis:
# Check Headscale is running
kubectl get pods -n dev headscale-0
# Check Headscale logs
kubectl logs -n dev headscale-0
# Test Headscale endpoint
curl -I https://vpn.example.com/healthCommon Causes:
-
External ingress not accessible
kubectl get svc -n ingress-nginx-external # Should have EXTERNAL-IP -
DNS not pointing to LoadBalancer
dig vpn.example.com # Should return LoadBalancer IP -
Firewall blocking
- Allow HTTPS (443) to LoadBalancer IP
- Allow UDP 41641 for WireGuard
Symptoms: Device enrolled but shows offline
Diagnosis:
kubectl exec -n dev headscale-0 -- headscale nodes listFixes:
- Check device Tailscale daemon is running
- Check firewall allows UDP 41641
- Restart Tailscale on device
Symptoms: Cannot login with LDAP credentials
Diagnosis:
# Check FreeIPA is running
kubectl get pods -n dev freeipa-0
# Test LDAP bind
kubectl exec -n dev freeipa-0 -c freeipa -- \
ldapsearch -x -H ldaps://localhost:636 \
-D "uid=admin,cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local" \
-w "$FREEIPA_ADMIN_PASSWORD" \
-b "cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local" \
"(uid=*)"Common Causes:
-
Password expired
- FreeIPA passwords expire by default
- Reset password:
kubectl exec -n dev freeipa-0 -c freeipa -- \ ipa user-mod username --password-expiration='20261127142802Z'
-
Wrong base DN
- Check LDAP configuration uses correct domain
- Should be
dc=dev,dc=svc,dc=cluster,dc=local
-
CA certificate not trusted
- Check FreeIPA CA is mounted
- Verify cert in
/etc/ssl/certs/freeipa-ca.pem
Symptoms: Terraform fails with "resource doesn't exist"
Diagnosis: Look for [0] indexing in error message
Fix: See Dependency Patterns Guide
Symptoms: "Error acquiring the state lock"
Fix:
# If you're sure no other terraform is running:
terraform force-unlock <lock-id>Diagnosis:
kubectl describe pod <pod-name> -n dev
# Check Events for slow image pulls or volume mountsFixes:
- Use faster storage class
- Pre-pull images to nodes
- Increase resource limits
Diagnosis:
kubectl top pods -n dev
kubectl top nodesFixes:
- Increase node resources
- Adjust pod resource limits
- Scale horizontally if supported
Symptoms: All Grafana dashboards show "No data" after Thanos deployment
Diagnosis:
# Check if Thanos is enabled
kubectl get pods -n monitoring | grep thanos
# Check Grafana data source configuration
kubectl exec -n monitoring grafana-0 -c grafana -- \
cat /etc/grafana/provisioning/datasources/datasources.yaml
# Test Prometheus endpoint from Grafana
kubectl exec -n monitoring grafana-0 -c grafana -- \
curl http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=upRoot Cause:
Dashboards configured to query Thanos but thanos_enabled = false, or vice versa.
Solution:
If Thanos is disabled (check terraform.tfvars):
- Update Grafana data source to point to Prometheus
- Update dashboard queries to remove
thanos_metric prefixes
If Thanos is enabled:
- Ensure Grafana data source points to
http://thanos-query.monitoring.svc.cluster.local:9090 - Wait for Thanos to compact initial data (can take 2-4 hours)
- Check Thanos Query logs:
kubectl logs -n monitoring deployment/thanos-query
See Monitoring Guide for detailed troubleshooting.
Symptoms: Prometheus pod OOMKilled or running near memory limit
Diagnosis:
# Check current usage
kubectl top pod -n monitoring prometheus-server-*
# Check retention and scrape config
kubectl get configmap -n monitoring prometheus-server -o yamlFixes:
-
Reduce retention:
# terraform/prometheus.tf server.retention = "7d" # Default: 15d
-
Increase memory:
# terraform.tfvars prometheus_memory_limit = "8Gi" # Default: 4Gi
-
Reduce scrape frequency:
# terraform/prometheus.tf server.global.scrape_interval = "60s" # Default: 30s
Symptoms: Grafana Explore shows no logs from Loki
Diagnosis:
# Check Loki pod
kubectl get pods -n monitoring | grep loki
# Check Loki logs
kubectl logs -n monitoring deployment/loki
# Check Promtail daemonset
kubectl get daemonset -n monitoring promtail
# Check Promtail logs
kubectl logs -n monitoring daemonset/promtail --tail=50Common Causes:
-
Loki pod restarted (emptyDir storage is ephemeral)
- Expected behavior with current config
- Logs only persist until pod restart
-
Promtail not collecting:
# Check Promtail targets kubectl exec -n monitoring deployment/loki -- \ curl localhost:3100/ready
-
Grafana data source misconfigured:
- URL should be:
http://loki.monitoring.svc.cluster.local:3100
- URL should be:
Symptoms: Old metrics not being compacted, storage usage growing
Diagnosis:
# Check compactor pod
kubectl get pods -n monitoring | grep thanos-compactor
# Check compactor logs
kubectl logs -n monitoring statefulset/thanos-compactor
# Check PVC
kubectl get pvc -n monitoringFixes:
-
Check object storage access:
kubectl exec -n monitoring thanos-compactor-0 -- ls -la /data/thanos -
Verify retention settings:
kubectl logs -n monitoring thanos-compactor-0 | grep retention -
Restart compactor if stuck:
kubectl delete pod -n monitoring thanos-compactor-0
Symptoms: Prometheus rules firing but no notifications
Diagnosis:
# Check AlertManager pod
kubectl get pods -n monitoring | grep alertmanager
# Check active alerts
kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
# Open http://localhost:9093
# Check Prometheus alert rules
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open http://localhost:9090/alertsCommon Causes:
-
No notification receivers configured
- Update
terraform/prometheus.tfwith receiver config - Add Slack webhook, email SMTP, etc.
- Update
-
Alert inhibition rules:
- Check if higher-severity alert is inhibiting
-
AlertManager config syntax error:
kubectl logs -n monitoring deployment/prometheus-alertmanager | grep -i error
If you're still stuck:
-
Check logs: Every issue leaves a trail
kubectl logs -n monitoring <pod> -c <container> --tail=100
-
Describe resources: Events often explain issues
kubectl describe <resource-type> <name> -n monitoring
-
Check GitHub Issues: Project Issues
-
Review Documentation:
- Monitoring Guide - Complete observability stack documentation
- Architecture
- Dependency Patterns
- LDAP Troubleshooting
Navigation: π Documentation Index | π Home