Comprehensive observability platform with metrics, logs, traces, and long-term storage.
The monitoring stack in Charon provides complete observability:
- Prometheus - Metrics collection and alerting
- Grafana - Visualization and dashboards
- Loki - Log aggregation
- Tempo - Distributed tracing
- Thanos - Long-term metrics storage
- Promtail - Log collection agent
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Monitoring Stack β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ β
β β Prometheus ββββββββββΆβ Thanos β β
β β (metrics) β β (long-term) β β
β ββββββββββββββββ ββββββββββββββββ β
β β β β
β βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Grafana β β
β β ββββββββββββ ββββββββ ββββββββ β β
β β βPrometheusβ β Loki β βTempo β β β
β β βDataSourceβ β DS β β DS β β β
β β ββββββββββββ ββββββββ ββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β² β² β² β
β β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Apps β β Promtail β β Apps β β
β β (metrics)β β (logs) β β (traces) β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Purpose: Metrics collection and alerting
Namespace: monitoring
Port: 9090
Storage: 50Gi (retain)
Features:
- Service discovery via Kubernetes API
- 35+ scrape targets across 10 jobs
- AlertManager integration
- Federation support for Thanos
Configuration targets:
- Kubernetes components (API server, kubelet, etcd)
- Service endpoints (Headscale, FreeIPA, Grafana, etc.)
- Node metrics via node-exporter
- GPU metrics via DCGM exporter
Purpose: Visualization and dashboards
Namespace: monitoring
Port: 3000 (app), 443 (nginx-tls)
Storage: 10Gi for dashboards
Access: VPN-only via HTTPS
Features:
- Pre-configured dashboards from Git repository
- Tempo correlation for traces
- LDAP authentication via FreeIPA
- Multi-datasource support
Dashboards included:
- Kubernetes cluster overview
- Node exporter metrics
- Headscale VPN statistics
- Open-WebUI performance
- Loki log exploration
- Tempo trace analysis
Purpose: Log aggregation and querying
Namespace: monitoring
Port: 3100
Storage: 50Gi (emptyDir for short-term)
Features:
- LogQL query language
- Label-based log organization
- Promtail automatic collection
- Grafana integration
Log sources:
- All pod logs via Promtail
- System logs from nodes
- Application-specific logs
Purpose: Distributed tracing
Namespace: monitoring
Port: 3200 (HTTP), 4317 (gRPC OTLP)
Storage: 10Gi
Features:
- OpenTelemetry support
- Trace to logs correlation
- Grafana native integration
- Service dependency mapping
Integrated services:
- Open-WebUI (OTLP traces)
- Custom applications with OpenTelemetry SDK
Purpose: Long-term metrics storage and global view
Namespace: monitoring
Storage: 2x 50Gi retain PVCs
Components:
- Query: Global query interface
- Store Gateway: Access to object storage
- Compactor: Downsampling and retention
- Ruler: Recording and alerting rules
Features:
- Unlimited retention via object storage
- Query deduplication
- Downsampling for efficiency
- Multi-cluster federation
Purpose: Log collection agent
Namespace: monitoring
Type: DaemonSet (runs on all nodes)
Features:
- Automatic Kubernetes pod discovery
- Label extraction from metadata
- Log pipeline processing
- Direct shipping to Loki
In terraform.tfvars:
# Prometheus
prometheus_enabled = true
# Grafana
grafana_enabled = true
grafana_hostname = "grafana.example.com"
grafana_admin_password = "your-secure-password"
grafana_dashboards_git_enabled = true
grafana_dashboards_git_repo = "https://github.com/your-org/grafana-dashboards.git"
grafana_dashboards_git_token = "ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Loki
loki_enabled = true
loki_storage = "50Gi"
# Thanos
thanos_enabled = true
thanos_compactor_storage = "50Gi"
thanos_storegateway_storage = "50Gi"cd terraform
terraform apply# Connect to VPN
tailscale up --login-server https://vpn.example.com
# Access Grafana
open https://grafana.example.com
# Default credentials
Username: admin
Password: <from terraform.tfvars># Port-forward to Prometheus
kubectl port-forward -n monitoring prometheus-0 9090:9090
# Access Prometheus UI
open http://localhost:9090
# Example queries:
up{job="kubernetes-nodes"}
rate(container_cpu_usage_seconds_total[5m])- Open Grafana
- Navigate to Explore
- Select Loki datasource
- Use LogQL queries:
{namespace="core", pod="headscale-0"}
{job="systemd-journal"} |= "error"
rate({app="nginx"}[5m])
- Open Grafana
- Navigate to Explore
- Select Tempo datasource
- Search by:
- Trace ID
- Service name
- Time range
# Port-forward to Thanos Query
kubectl port-forward -n monitoring service/thanos-query 9090:9090
# Access Thanos UI
open http://localhost:9090
# Query across all time rangesCreate alert rules:
# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alerts
namespace: monitoring
data:
alerts.yaml: |
groups:
- name: kubernetes
rules:
- alert: NodeDown
expr: up{job="kubernetes-nodes"} == 0
for: 5m
annotations:
summary: "Node {{ $labels.node }} is down"- Create alert in Grafana UI
- Set notification channels (email, Slack, etc.)
- Configure alert rules based on queries
# Export all dashboards
kubectl exec -n monitoring grafana-0 -c grafana -- \
grafana-cli admin export-dashboard-json /var/lib/grafana/dashboards/
# Copy to local
kubectl cp monitoring/grafana-0:/var/lib/grafana/dashboards ./grafana-backup# Create snapshot
kubectl exec -n monitoring prometheus-0 -- \
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
# Copy snapshot
kubectl cp monitoring/prometheus-0:/prometheus/snapshots ./prometheus-backup# Check targets
kubectl port-forward -n monitoring prometheus-0 9090:9090
# Visit http://localhost:9090/targets
# Check service discovery
kubectl logs -n monitoring prometheus-0 | grep discovery# Test datasource connectivity
kubectl exec -n monitoring grafana-0 -c grafana -- \
curl http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=up
# Check logs
kubectl logs -n monitoring grafana-0 -c grafana# Check Promtail
kubectl logs -n monitoring -l app=promtail
# Verify Loki is running
kubectl logs -n monitoring loki-0
# Test Loki API
kubectl port-forward -n monitoring loki-0 3100:3100
curl http://localhost:3100/ready# Check OTLP receiver
kubectl logs -n monitoring tempo-0 | grep otlp
# Verify service is receiving traces
kubectl port-forward -n monitoring tempo-0 3200:3200
curl http://localhost:3200/status# Adjust retention and storage
prometheus_retention: "30d"
prometheus_storage: "100Gi"
# Tune scrape intervals
scrape_interval: "30s" # Default 15s# Increase query timeout
dataproxy_timeout: "300"
# Cache configuration
caching_enabled: true
cache_ttl: "300"# Adjust chunk storage
chunk_idle_period: "30m"
max_chunk_age: "1h"
# Tune ingestion
ingestion_rate_limit: "10MB"
ingestion_burst_size: "20MB"- Monitoring Guide - Operations monitoring guide
- Grafana Service - Grafana specific configuration
- Architecture Overview - System architecture
- Prometheus Documentation
- Grafana Documentation
- Loki Documentation
- Tempo Documentation
- Thanos Documentation
Navigation: Documentation Index | Services | Home