Skip to content

Latest commit

Β 

History

History
630 lines (479 loc) Β· 16.4 KB

File metadata and controls

630 lines (479 loc) Β· 16.4 KB

Monitoring Guide

Comprehensive guide to Charon's observability stack: Prometheus, Grafana, Thanos, Loki, and AlertManager.

Overview

Charon's monitoring infrastructure provides:

  • Prometheus - Real-time metrics collection (15-day retention)
  • Thanos - Long-term metrics storage (30/90/180-day retention with downsampling)
  • Grafana - Visualization and dashboarding with git-sync provisioning
  • Loki - Log aggregation (ephemeral emptyDir storage)
  • Promtail - Log collection from all pods
  • AlertManager - Alert routing and notification

All components deployed in the monitoring namespace.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Monitoring Namespace                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  Prometheus  │───────▢│  Thanos Query   β”‚            β”‚
β”‚  β”‚  (15d local) β”‚        β”‚  (global view)  β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚         β”‚                         β”‚                      β”‚
β”‚         β”‚                         β–Ό                      β”‚
β”‚         β”‚                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚         β”‚                β”‚ Thanos Store GW β”‚            β”‚
β”‚         β”‚                β”‚ (long-term TSDB)β”‚            β”‚
β”‚         β”‚                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚         β”‚                         β”‚                      β”‚
β”‚         β–Ό                         β–Ό                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚         Grafana                   β”‚                   β”‚
β”‚  β”‚  (dashboards via git-sync)       β”‚                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚                                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  Promtail    │───────▢│      Loki       β”‚            β”‚
β”‚  β”‚  (DaemonSet) β”‚        β”‚  (emptyDir tmp) β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                 β”‚                         β”‚
β”‚                                 β–Ό                         β”‚
β”‚                          Grafana (logs)                   β”‚
β”‚                                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                        β”‚
β”‚  β”‚ AlertManager β”‚                                        β”‚
β”‚  β”‚ (alerting)   β”‚                                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
β”‚                                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

Prometheus

Purpose: Current metrics collection and short-term storage

Configuration:

# terraform.tfvars
prometheus_enabled = true
prometheus_version = "25.8.0"  # Helm chart version
prometheus_storage = "50Gi"    # Local retention

Features:

  • 15-day local retention
  • 30-second scrape interval
  • AlertManager integration
  • ServiceMonitor auto-discovery
  • Node exporter (system metrics)
  • Kube-state-metrics (K8s objects)

Access:

# Via kubectl port-forward
kubectl port-forward -n monitoring svc/prometheus-server 9090:80

# Open in browser
open http://localhost:9090

Thanos

Purpose: Long-term metrics storage with downsampling

Configuration:

# terraform.tfvars
thanos_enabled = true
thanos_version = "15.7.27"  # Helm chart version
thanos_compactor_storage   = "50Gi"
thanos_storegateway_storage = "50Gi"

Components:

  1. Query - Global query interface (Grafana uses this)
  2. Query Frontend - Query caching and splitting
  3. Store Gateway - Serves historical data from object storage
  4. Compactor - Compacts and downsamples metrics

Retention Policy:

  • Raw metrics: 30 days
  • 5-minute downsampled: 90 days
  • 1-hour downsampled: 180 days

Storage:

  • Type: Filesystem (local PVCs)
  • Compactor: 50Gi retain PVC
  • Store Gateway: 50Gi retain PVC

IMPORTANT: Thanos is currently disabled by default (thanos_enabled = false). Dashboards use Prometheus directly.

Grafana

Purpose: Visualization and dashboarding

Configuration:

# terraform.tfvars
grafana_enabled           = true
grafana_version           = "8.5.0"
grafana_hostname          = "grafana.example.com"
grafana_admin_password    = "secure-password"

# Git dashboard provisioning
grafana_dashboards_git_enabled = true
grafana_dashboards_git_repo    = "https://github.com/org/dashboards"
grafana_dashboards_git_branch  = "main"
grafana_dashboards_git_token   = "github-token"

Access:

# Via VPN
open https://grafana.example.com

# Default login
# Username: admin
# Password: <grafana_admin_password>

Data Sources:

When Thanos is enabled:

Thanos:
  URL: http://thanos-query.monitoring.svc.cluster.local:9090
  Type: Prometheus
  Access: Server (default)

When Thanos is disabled (current):

Prometheus:
  URL: http://prometheus-server.monitoring.svc.cluster.local
  Type: Prometheus
  Access: Server (default)

Loki (always):

Loki:
  URL: http://loki.monitoring.svc.cluster.local:3100
  Type: Loki
  Access: Server (default)

Loki

Purpose: Log aggregation and querying

Configuration:

# terraform.tfvars
loki_enabled = true
loki_version = "6.16.0"  # Helm chart version

Storage:

  • Type: emptyDir (ephemeral)
  • Retention: Until pod restart
  • Purpose: Short-term log analysis

Features:

  • Single binary deployment
  • Filesystem storage (no S3)
  • TSDB schema (v13)
  • 24-hour index period

Promtail

Purpose: Log collection from all pods

Deployment: DaemonSet on all nodes

Configuration:

  • Automatically discovers pods
  • Forwards to Loki at http://loki:3100
  • Adds Kubernetes metadata labels

AlertManager

Purpose: Alert routing and notifications

Configuration:

  • Deployed with Prometheus Helm chart
  • 10Gi retain PVC for alert state
  • Web UI on port 9093

Alerting Rules:

Infrastructure alerts:

  • PrometheusDown (critical)
  • GrafanaDown (warning)
  • LokiDown (warning)
  • HeadscaleDown (critical)

Resource alerts:

  • HighMemoryUsage (>85% for 10m)
  • HighDiskUsage (>85% for 10m)
  • PersistentVolumeSpaceRunningOut (<15%)

Certificate alerts:

  • CertificateExpiryWarning (<30 days)
  • CertificateExpiryCritical (<7 days)

Common Operations

Check Component Status

# All monitoring pods
kubectl get pods -n monitoring

# Check Prometheus
kubectl logs -n monitoring deployment/prometheus-server

# Check Thanos query (if enabled)
kubectl logs -n monitoring deployment/thanos-query

# Check Loki
kubectl logs -n monitoring deployment/loki

# Check Promtail
kubectl logs -n monitoring daemonset/promtail -f

Access UIs

Prometheus:

kubectl port-forward -n monitoring svc/prometheus-server 9090:80
open http://localhost:9090

Thanos Query (if enabled):

kubectl port-forward -n monitoring svc/thanos-query 9090:9090
open http://localhost:9090

AlertManager:

kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
open http://localhost:9093

Grafana:

# Via VPN only
open https://grafana.example.com

Query Metrics

PromQL Examples:

# CPU usage by pod
rate(container_cpu_usage_seconds_total[5m])

# Memory usage by namespace
sum(container_memory_working_set_bytes) by (namespace)

# Network IO by pod
rate(container_network_transmit_bytes_total[5m])

# Disk usage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Service uptime
up{job="headscale"}

View Logs

Via Grafana Explore:

  1. Open Grafana
  2. Click Explore (compass icon)
  3. Select Loki data source
  4. Query examples:
# All logs from namespace
{namespace="core"}

# Specific pod logs
{pod="headscale-0"}

# Error logs across cluster
{namespace=~".+"} |= "error"

# Grafana access logs
{app="grafana"} | json | line_format "{{.status}} {{.method}} {{.path}}"

Configure Alerts

Edit Prometheus rules:

# View current rules
kubectl get configmap -n monitoring prometheus-server -o yaml

# Edit via Terraform
# Update terraform/prometheus.tf serverFiles section
# Apply changes
cd terraform && terraform apply

Backup Grafana Dashboards

Via Git-Sync (Recommended): Dashboards are automatically synced from Git repository every 60 seconds.

Manual Backup via API:

# Export all dashboards
curl -u admin:password https://grafana.example.com/api/search | \
  jq -r '.[] | select(.type=="dash-db") | .uid' | \
  while read uid; do
    curl -u admin:password "https://grafana.example.com/api/dashboards/uid/$uid" | \
      jq '.dashboard' > "dashboard-$uid.json"
  done

Troubleshooting

Grafana Dashboards Show "No Data"

This is the current known issue after Thanos deployment.

Diagnosis:

# Check Grafana data source configuration
kubectl exec -n monitoring grafana-0 -c grafana -- \
  cat /etc/grafana/provisioning/datasources/datasources.yaml

# Check if Thanos is enabled
kubectl get pods -n monitoring | grep thanos

# Test Prometheus endpoint
kubectl exec -n monitoring grafana-0 -c grafana -- \
  curl http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up

Solution:

If Thanos is disabled (current state): Grafana data source should point to Prometheus:

datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus-server.monitoring.svc.cluster.local

If Thanos is enabled: Grafana data source should point to Thanos Query:

datasources:
  - name: Thanos
    type: prometheus
    url: http://thanos-query.monitoring.svc.cluster.local:9090

Dashboard queries:

  • If using Prometheus: Remove any thanos_ prefixes from queries
  • If using Thanos: Queries work the same as Prometheus (PromQL compatible)

Prometheus Not Scraping Targets

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open http://localhost:9090/targets

# Check ServiceMonitor CRDs
kubectl get servicemonitor -A

# Verify pod labels match ServiceMonitor selectors
kubectl get pods -n core headscale-0 --show-labels

Loki Not Receiving Logs

# Check Promtail pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail

# Check Promtail logs
kubectl logs -n monitoring daemonset/promtail -f

# Test Loki endpoint
kubectl exec -n monitoring deployment/loki -- \
  curl http://localhost:3100/ready

# Check Loki logs
kubectl logs -n monitoring deployment/loki -f

AlertManager Not Sending Alerts

# Check AlertManager status
kubectl logs -n monitoring deployment/prometheus-alertmanager

# View active alerts
kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
# Open http://localhost:9093

# Check Prometheus alert rules
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open http://localhost:9090/alerts

Thanos Compactor Issues (if enabled)

# Check compactor logs
kubectl logs -n monitoring statefulset/thanos-compactor

# Check compactor PVC
kubectl get pvc -n monitoring thanos-compactor-data-thanos-compactor-0

# Verify object storage access
kubectl exec -n monitoring thanos-compactor-0 -- \
  ls -la /data/thanos

High Memory Usage

Prometheus:

# Check current usage
kubectl top pod -n monitoring prometheus-server-*

# Increase resources in terraform.tfvars
# prometheus_memory_limit = "8Gi"

# Apply changes
cd terraform && terraform apply

Thanos:

# Check Thanos Query memory
kubectl top pod -n monitoring thanos-query-*

# Increase in terraform.tfvars
# thanos_query_memory_limit = "2Gi"

Metrics Reference

Available Exporters

Exporter Metrics Port Notes
node-exporter System metrics (CPU, memory, disk, network) 9100 DaemonSet on all nodes
kube-state-metrics Kubernetes object metrics 8080 Deployment counts, pod status, etc.
Headscale VPN metrics 9090 Connected nodes, traffic stats
Grafana Dashboard metrics 3000 User sessions, query performance
Loki Log ingestion metrics 3100 Log rates, storage usage

Common Metric Queries

Cluster Health:

# Node status
up{job="kubernetes-nodes"}

# Pod restart count
kube_pod_container_status_restarts_total

# Persistent volume usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes

Application Metrics:

# Headscale connected nodes
headscale_registered_nodes

# Grafana active users
grafana_stat_active_users

# Loki ingestion rate
rate(loki_distributor_bytes_received_total[5m])

Resource Utilization:

# CPU usage by namespace
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

# Memory usage by pod
container_memory_working_set_bytes{pod!=""}

# Network bandwidth
rate(container_network_transmit_bytes_total[5m])

Retention Policies

Component Data Type Retention Storage
Prometheus Raw metrics 15 days 50Gi block storage
Thanos (raw) Raw metrics 30 days Compactor PVC
Thanos (5m) Downsampled 90 days Compactor PVC
Thanos (1h) Downsampled 180 days Compactor PVC
Loki Logs Ephemeral emptyDir (until restart)
AlertManager Alert state Persistent 10Gi retain PVC

Performance Tuning

Reduce Prometheus Storage

# Decrease retention
# Edit terraform/prometheus.tf
server.retention = "7d"

# Reduce scrape frequency
server.global.scrape_interval = "60s"

Optimize Loki Performance

# Add persistence (if needed)
loki_enabled = true

# Increase resources
loki_memory_limit = "2Gi"
loki_cpu_limit    = "1000m"

Scale Thanos Query (if enabled)

# Increase replicas via Helm values
# Edit terraform/thanos.tf
query.replicas = 2

# Apply
terraform apply

Dashboard Management

Git-Sync Provisioning

Grafana automatically syncs dashboards from Git repository.

Repository Structure:

grafana-dashboards/
β”œβ”€β”€ infrastructure/
β”‚   β”œβ”€β”€ kubernetes-cluster.json
β”‚   └── node-metrics.json
β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ headscale.json
β”‚   └── grafana.json
└── README.md

Update Dashboards:

  1. Commit JSON to Git repository
  2. Wait up to 60 seconds for sync
  3. Refresh Grafana UI

Check Sync Status:

# View git-sync logs
kubectl logs -n monitoring grafana-0 -c git-sync -f

# Verify dashboards directory
kubectl exec -n monitoring grafana-0 -c grafana -- \
  ls -la /var/lib/grafana/dashboards

Security

  • VPN-only access - All UIs accessible only via Tailscale VPN
  • TLS everywhere - Grafana uses cert-manager certificates
  • LDAP authentication - Grafana integrates with FreeIPA
  • RBAC - Prometheus has cluster-wide read permissions
  • Secrets management - Passwords stored as Kubernetes secrets

Related Documentation


Navigation: Documentation Index | Home