Comprehensive guide to Charon's observability stack: Prometheus, Grafana, Thanos, Loki, and AlertManager.
Charon's monitoring infrastructure provides:
- Prometheus - Real-time metrics collection (15-day retention)
- Thanos - Long-term metrics storage (30/90/180-day retention with downsampling)
- Grafana - Visualization and dashboarding with git-sync provisioning
- Loki - Log aggregation (ephemeral emptyDir storage)
- Promtail - Log collection from all pods
- AlertManager - Alert routing and notification
All components deployed in the monitoring namespace.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Monitoring Namespace β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ βββββββββββββββββββ β
β β Prometheus βββββββββΆβ Thanos Query β β
β β (15d local) β β (global view) β β
β ββββββββββββββββ βββββββββββββββββββ β
β β β β
β β βΌ β
β β βββββββββββββββββββ β
β β β Thanos Store GW β β
β β β (long-term TSDB)β β
β β βββββββββββββββββββ β
β β β β
β βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββ β
β β Grafana β β
β β (dashboards via git-sync) β β
β ββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββ βββββββββββββββββββ β
β β Promtail βββββββββΆβ Loki β β
β β (DaemonSet) β β (emptyDir tmp) β β
β ββββββββββββββββ βββββββββββββββββββ β
β β β
β βΌ β
β Grafana (logs) β
β β
β ββββββββββββββββ β
β β AlertManager β β
β β (alerting) β β
β ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Purpose: Current metrics collection and short-term storage
Configuration:
# terraform.tfvars
prometheus_enabled = true
prometheus_version = "25.8.0" # Helm chart version
prometheus_storage = "50Gi" # Local retentionFeatures:
- 15-day local retention
- 30-second scrape interval
- AlertManager integration
- ServiceMonitor auto-discovery
- Node exporter (system metrics)
- Kube-state-metrics (K8s objects)
Access:
# Via kubectl port-forward
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open in browser
open http://localhost:9090Purpose: Long-term metrics storage with downsampling
Configuration:
# terraform.tfvars
thanos_enabled = true
thanos_version = "15.7.27" # Helm chart version
thanos_compactor_storage = "50Gi"
thanos_storegateway_storage = "50Gi"Components:
- Query - Global query interface (Grafana uses this)
- Query Frontend - Query caching and splitting
- Store Gateway - Serves historical data from object storage
- Compactor - Compacts and downsamples metrics
Retention Policy:
- Raw metrics: 30 days
- 5-minute downsampled: 90 days
- 1-hour downsampled: 180 days
Storage:
- Type: Filesystem (local PVCs)
- Compactor: 50Gi retain PVC
- Store Gateway: 50Gi retain PVC
IMPORTANT: Thanos is currently disabled by default (thanos_enabled = false). Dashboards use Prometheus directly.
Purpose: Visualization and dashboarding
Configuration:
# terraform.tfvars
grafana_enabled = true
grafana_version = "8.5.0"
grafana_hostname = "grafana.example.com"
grafana_admin_password = "secure-password"
# Git dashboard provisioning
grafana_dashboards_git_enabled = true
grafana_dashboards_git_repo = "https://github.com/org/dashboards"
grafana_dashboards_git_branch = "main"
grafana_dashboards_git_token = "github-token"Access:
# Via VPN
open https://grafana.example.com
# Default login
# Username: admin
# Password: <grafana_admin_password>Data Sources:
When Thanos is enabled:
Thanos:
URL: http://thanos-query.monitoring.svc.cluster.local:9090
Type: Prometheus
Access: Server (default)When Thanos is disabled (current):
Prometheus:
URL: http://prometheus-server.monitoring.svc.cluster.local
Type: Prometheus
Access: Server (default)Loki (always):
Loki:
URL: http://loki.monitoring.svc.cluster.local:3100
Type: Loki
Access: Server (default)Purpose: Log aggregation and querying
Configuration:
# terraform.tfvars
loki_enabled = true
loki_version = "6.16.0" # Helm chart versionStorage:
- Type: emptyDir (ephemeral)
- Retention: Until pod restart
- Purpose: Short-term log analysis
Features:
- Single binary deployment
- Filesystem storage (no S3)
- TSDB schema (v13)
- 24-hour index period
Purpose: Log collection from all pods
Deployment: DaemonSet on all nodes
Configuration:
- Automatically discovers pods
- Forwards to Loki at
http://loki:3100 - Adds Kubernetes metadata labels
Purpose: Alert routing and notifications
Configuration:
- Deployed with Prometheus Helm chart
- 10Gi retain PVC for alert state
- Web UI on port 9093
Alerting Rules:
Infrastructure alerts:
- PrometheusDown (critical)
- GrafanaDown (warning)
- LokiDown (warning)
- HeadscaleDown (critical)
Resource alerts:
- HighMemoryUsage (>85% for 10m)
- HighDiskUsage (>85% for 10m)
- PersistentVolumeSpaceRunningOut (<15%)
Certificate alerts:
- CertificateExpiryWarning (<30 days)
- CertificateExpiryCritical (<7 days)
# All monitoring pods
kubectl get pods -n monitoring
# Check Prometheus
kubectl logs -n monitoring deployment/prometheus-server
# Check Thanos query (if enabled)
kubectl logs -n monitoring deployment/thanos-query
# Check Loki
kubectl logs -n monitoring deployment/loki
# Check Promtail
kubectl logs -n monitoring daemonset/promtail -fPrometheus:
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
open http://localhost:9090Thanos Query (if enabled):
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
open http://localhost:9090AlertManager:
kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
open http://localhost:9093Grafana:
# Via VPN only
open https://grafana.example.comPromQL Examples:
# CPU usage by pod
rate(container_cpu_usage_seconds_total[5m])
# Memory usage by namespace
sum(container_memory_working_set_bytes) by (namespace)
# Network IO by pod
rate(container_network_transmit_bytes_total[5m])
# Disk usage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# Service uptime
up{job="headscale"}
Via Grafana Explore:
- Open Grafana
- Click Explore (compass icon)
- Select Loki data source
- Query examples:
# All logs from namespace
{namespace="core"}
# Specific pod logs
{pod="headscale-0"}
# Error logs across cluster
{namespace=~".+"} |= "error"
# Grafana access logs
{app="grafana"} | json | line_format "{{.status}} {{.method}} {{.path}}"
Edit Prometheus rules:
# View current rules
kubectl get configmap -n monitoring prometheus-server -o yaml
# Edit via Terraform
# Update terraform/prometheus.tf serverFiles section
# Apply changes
cd terraform && terraform applyVia Git-Sync (Recommended): Dashboards are automatically synced from Git repository every 60 seconds.
Manual Backup via API:
# Export all dashboards
curl -u admin:password https://grafana.example.com/api/search | \
jq -r '.[] | select(.type=="dash-db") | .uid' | \
while read uid; do
curl -u admin:password "https://grafana.example.com/api/dashboards/uid/$uid" | \
jq '.dashboard' > "dashboard-$uid.json"
doneThis is the current known issue after Thanos deployment.
Diagnosis:
# Check Grafana data source configuration
kubectl exec -n monitoring grafana-0 -c grafana -- \
cat /etc/grafana/provisioning/datasources/datasources.yaml
# Check if Thanos is enabled
kubectl get pods -n monitoring | grep thanos
# Test Prometheus endpoint
kubectl exec -n monitoring grafana-0 -c grafana -- \
curl http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=upSolution:
If Thanos is disabled (current state): Grafana data source should point to Prometheus:
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-server.monitoring.svc.cluster.localIf Thanos is enabled: Grafana data source should point to Thanos Query:
datasources:
- name: Thanos
type: prometheus
url: http://thanos-query.monitoring.svc.cluster.local:9090Dashboard queries:
- If using Prometheus: Remove any
thanos_prefixes from queries - If using Thanos: Queries work the same as Prometheus (PromQL compatible)
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open http://localhost:9090/targets
# Check ServiceMonitor CRDs
kubectl get servicemonitor -A
# Verify pod labels match ServiceMonitor selectors
kubectl get pods -n core headscale-0 --show-labels# Check Promtail pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail
# Check Promtail logs
kubectl logs -n monitoring daemonset/promtail -f
# Test Loki endpoint
kubectl exec -n monitoring deployment/loki -- \
curl http://localhost:3100/ready
# Check Loki logs
kubectl logs -n monitoring deployment/loki -f# Check AlertManager status
kubectl logs -n monitoring deployment/prometheus-alertmanager
# View active alerts
kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
# Open http://localhost:9093
# Check Prometheus alert rules
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open http://localhost:9090/alerts# Check compactor logs
kubectl logs -n monitoring statefulset/thanos-compactor
# Check compactor PVC
kubectl get pvc -n monitoring thanos-compactor-data-thanos-compactor-0
# Verify object storage access
kubectl exec -n monitoring thanos-compactor-0 -- \
ls -la /data/thanosPrometheus:
# Check current usage
kubectl top pod -n monitoring prometheus-server-*
# Increase resources in terraform.tfvars
# prometheus_memory_limit = "8Gi"
# Apply changes
cd terraform && terraform applyThanos:
# Check Thanos Query memory
kubectl top pod -n monitoring thanos-query-*
# Increase in terraform.tfvars
# thanos_query_memory_limit = "2Gi"| Exporter | Metrics | Port | Notes |
|---|---|---|---|
| node-exporter | System metrics (CPU, memory, disk, network) | 9100 | DaemonSet on all nodes |
| kube-state-metrics | Kubernetes object metrics | 8080 | Deployment counts, pod status, etc. |
| Headscale | VPN metrics | 9090 | Connected nodes, traffic stats |
| Grafana | Dashboard metrics | 3000 | User sessions, query performance |
| Loki | Log ingestion metrics | 3100 | Log rates, storage usage |
Cluster Health:
# Node status
up{job="kubernetes-nodes"}
# Pod restart count
kube_pod_container_status_restarts_total
# Persistent volume usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes
Application Metrics:
# Headscale connected nodes
headscale_registered_nodes
# Grafana active users
grafana_stat_active_users
# Loki ingestion rate
rate(loki_distributor_bytes_received_total[5m])
Resource Utilization:
# CPU usage by namespace
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)
# Memory usage by pod
container_memory_working_set_bytes{pod!=""}
# Network bandwidth
rate(container_network_transmit_bytes_total[5m])
| Component | Data Type | Retention | Storage |
|---|---|---|---|
| Prometheus | Raw metrics | 15 days | 50Gi block storage |
| Thanos (raw) | Raw metrics | 30 days | Compactor PVC |
| Thanos (5m) | Downsampled | 90 days | Compactor PVC |
| Thanos (1h) | Downsampled | 180 days | Compactor PVC |
| Loki | Logs | Ephemeral | emptyDir (until restart) |
| AlertManager | Alert state | Persistent | 10Gi retain PVC |
# Decrease retention
# Edit terraform/prometheus.tf
server.retention = "7d"
# Reduce scrape frequency
server.global.scrape_interval = "60s"# Add persistence (if needed)
loki_enabled = true
# Increase resources
loki_memory_limit = "2Gi"
loki_cpu_limit = "1000m"# Increase replicas via Helm values
# Edit terraform/thanos.tf
query.replicas = 2
# Apply
terraform applyGrafana automatically syncs dashboards from Git repository.
Repository Structure:
grafana-dashboards/
βββ infrastructure/
β βββ kubernetes-cluster.json
β βββ node-metrics.json
βββ services/
β βββ headscale.json
β βββ grafana.json
βββ README.md
Update Dashboards:
- Commit JSON to Git repository
- Wait up to 60 seconds for sync
- Refresh Grafana UI
Check Sync Status:
# View git-sync logs
kubectl logs -n monitoring grafana-0 -c git-sync -f
# Verify dashboards directory
kubectl exec -n monitoring grafana-0 -c grafana -- \
ls -la /var/lib/grafana/dashboards- VPN-only access - All UIs accessible only via Tailscale VPN
- TLS everywhere - Grafana uses cert-manager certificates
- LDAP authentication - Grafana integrates with FreeIPA
- RBAC - Prometheus has cluster-wide read permissions
- Secrets management - Passwords stored as Kubernetes secrets
- Grafana Service
- Prometheus Service (TODO)
- Thanos Service (TODO)
- Loki Service (TODO)
- Troubleshooting Guide
- Terraform Variables
Navigation: Documentation Index | Home