Monitoring Guide

Comprehensive guide to Charon's observability stack: Prometheus, Grafana, Thanos, Loki, and AlertManager.

Overview

Charon's monitoring infrastructure provides:

Prometheus - Real-time metrics collection (15-day retention)
Thanos - Long-term metrics storage (30/90/180-day retention with downsampling)
Grafana - Visualization and dashboarding with git-sync provisioning
Loki - Log aggregation (ephemeral emptyDir storage)
Promtail - Log collection from all pods
AlertManager - Alert routing and notification

All components deployed in the monitoring namespace.

Architecture

┌─────────────────────────────────────────────────────────┐
│                 Monitoring Namespace                     │
├─────────────────────────────────────────────────────────┤
│                                                           │
│  ┌──────────────┐        ┌─────────────────┐            │
│  │  Prometheus  │───────▶│  Thanos Query   │            │
│  │  (15d local) │        │  (global view)  │            │
│  └──────────────┘        └─────────────────┘            │
│         │                         │                      │
│         │                         ▼                      │
│         │                ┌─────────────────┐            │
│         │                │ Thanos Store GW │            │
│         │                │ (long-term TSDB)│            │
│         │                └─────────────────┘            │
│         │                         │                      │
│         ▼                         ▼                      │
│  ┌──────────────────────────────────┐                   │
│  │         Grafana                   │                   │
│  │  (dashboards via git-sync)       │                   │
│  └──────────────────────────────────┘                   │
│                                                           │
│  ┌──────────────┐        ┌─────────────────┐            │
│  │  Promtail    │───────▶│      Loki       │            │
│  │  (DaemonSet) │        │  (emptyDir tmp) │            │
│  └──────────────┘        └─────────────────┘            │
│                                 │                         │
│                                 ▼                         │
│                          Grafana (logs)                   │
│                                                           │
│  ┌──────────────┐                                        │
│  │ AlertManager │                                        │
│  │ (alerting)   │                                        │
│  └──────────────┘                                        │
│                                                           │
└─────────────────────────────────────────────────────────┘

Components

Prometheus

Purpose: Current metrics collection and short-term storage

Configuration:

# terraform.tfvars
prometheus_enabled = true
prometheus_version = "25.8.0"  # Helm chart version
prometheus_storage = "50Gi"    # Local retention

Features:

15-day local retention
30-second scrape interval
AlertManager integration
ServiceMonitor auto-discovery
Node exporter (system metrics)
Kube-state-metrics (K8s objects)

Access:

# Via kubectl port-forward
kubectl port-forward -n monitoring svc/prometheus-server 9090:80

# Open in browser
open http://localhost:9090

Thanos

Purpose: Long-term metrics storage with downsampling

Configuration:

# terraform.tfvars
thanos_enabled = true
thanos_version = "15.7.27"  # Helm chart version
thanos_compactor_storage   = "50Gi"
thanos_storegateway_storage = "50Gi"

Components:

Query - Global query interface (Grafana uses this)
Query Frontend - Query caching and splitting
Store Gateway - Serves historical data from object storage
Compactor - Compacts and downsamples metrics

Retention Policy:

Raw metrics: 30 days
5-minute downsampled: 90 days
1-hour downsampled: 180 days

Storage:

Type: Filesystem (local PVCs)
Compactor: 50Gi retain PVC
Store Gateway: 50Gi retain PVC

IMPORTANT: Thanos is currently disabled by default (thanos_enabled = false). Dashboards use Prometheus directly.

Grafana

Purpose: Visualization and dashboarding

Configuration:

# terraform.tfvars
grafana_enabled           = true
grafana_version           = "8.5.0"
grafana_hostname          = "grafana.example.com"
grafana_admin_password    = "secure-password"

# Git dashboard provisioning
grafana_dashboards_git_enabled = true
grafana_dashboards_git_repo    = "https://github.com/org/dashboards"
grafana_dashboards_git_branch  = "main"
grafana_dashboards_git_token   = "github-token"

Access:

# Via VPN
open https://grafana.example.com

# Default login
# Username: admin
# Password: <grafana_admin_password>

Data Sources:

When Thanos is enabled:

Thanos:
  URL: http://thanos-query.monitoring.svc.cluster.local:9090
  Type: Prometheus
  Access: Server (default)

When Thanos is disabled (current):

Prometheus:
  URL: http://prometheus-server.monitoring.svc.cluster.local
  Type: Prometheus
  Access: Server (default)

Loki (always):

Loki:
  URL: http://loki.monitoring.svc.cluster.local:3100
  Type: Loki
  Access: Server (default)

Loki

Purpose: Log aggregation and querying

Configuration:

# terraform.tfvars
loki_enabled = true
loki_version = "6.16.0"  # Helm chart version

Storage:

Type: emptyDir (ephemeral)
Retention: Until pod restart
Purpose: Short-term log analysis

Features:

Single binary deployment
Filesystem storage (no S3)
TSDB schema (v13)
24-hour index period

Promtail

Purpose: Log collection from all pods

Deployment: DaemonSet on all nodes

Configuration:

Automatically discovers pods
Forwards to Loki at http://loki:3100
Adds Kubernetes metadata labels

AlertManager

Purpose: Alert routing and notifications

Configuration:

Deployed with Prometheus Helm chart
10Gi retain PVC for alert state
Web UI on port 9093

Alerting Rules:

Infrastructure alerts:

PrometheusDown (critical)
GrafanaDown (warning)
LokiDown (warning)
HeadscaleDown (critical)

Resource alerts:

HighMemoryUsage (>85% for 10m)
HighDiskUsage (>85% for 10m)
PersistentVolumeSpaceRunningOut (<15%)

Certificate alerts:

CertificateExpiryWarning (<30 days)
CertificateExpiryCritical (<7 days)

Common Operations

Check Component Status

# All monitoring pods
kubectl get pods -n monitoring

# Check Prometheus
kubectl logs -n monitoring deployment/prometheus-server

# Check Thanos query (if enabled)
kubectl logs -n monitoring deployment/thanos-query

# Check Loki
kubectl logs -n monitoring deployment/loki

# Check Promtail
kubectl logs -n monitoring daemonset/promtail -f

Access UIs

Prometheus:

kubectl port-forward -n monitoring svc/prometheus-server 9090:80
open http://localhost:9090

Thanos Query (if enabled):

kubectl port-forward -n monitoring svc/thanos-query 9090:9090
open http://localhost:9090

AlertManager:

kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
open http://localhost:9093

Grafana:

# Via VPN only
open https://grafana.example.com

Query Metrics

PromQL Examples:

# CPU usage by pod
rate(container_cpu_usage_seconds_total[5m])

# Memory usage by namespace
sum(container_memory_working_set_bytes) by (namespace)

# Network IO by pod
rate(container_network_transmit_bytes_total[5m])

# Disk usage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Service uptime
up{job="headscale"}

View Logs

Via Grafana Explore:

Open Grafana
Click Explore (compass icon)
Select Loki data source
Query examples:

# All logs from namespace
{namespace="core"}

# Specific pod logs
{pod="headscale-0"}

# Error logs across cluster
{namespace=~".+"} |= "error"

# Grafana access logs
{app="grafana"} | json | line_format "{{.status}} {{.method}} {{.path}}"

Configure Alerts

Edit Prometheus rules:

# View current rules
kubectl get configmap -n monitoring prometheus-server -o yaml

# Edit via Terraform
# Update terraform/prometheus.tf serverFiles section
# Apply changes
cd terraform && terraform apply

Backup Grafana Dashboards

Via Git-Sync (Recommended): Dashboards are automatically synced from Git repository every 60 seconds.

Manual Backup via API:

# Export all dashboards
curl -u admin:password https://grafana.example.com/api/search | \
  jq -r '.[] | select(.type=="dash-db") | .uid' | \
  while read uid; do
    curl -u admin:password "https://grafana.example.com/api/dashboards/uid/$uid" | \
      jq '.dashboard' > "dashboard-$uid.json"
  done

Troubleshooting

Grafana Dashboards Show "No Data"

This is the current known issue after Thanos deployment.

Diagnosis:

# Check Grafana data source configuration
kubectl exec -n monitoring grafana-0 -c grafana -- \
  cat /etc/grafana/provisioning/datasources/datasources.yaml

# Check if Thanos is enabled
kubectl get pods -n monitoring | grep thanos

# Test Prometheus endpoint
kubectl exec -n monitoring grafana-0 -c grafana -- \
  curl http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up

Solution:

If Thanos is disabled (current state): Grafana data source should point to Prometheus:

datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus-server.monitoring.svc.cluster.local

If Thanos is enabled: Grafana data source should point to Thanos Query:

datasources:
  - name: Thanos
    type: prometheus
    url: http://thanos-query.monitoring.svc.cluster.local:9090

Dashboard queries:

If using Prometheus: Remove any thanos_ prefixes from queries
If using Thanos: Queries work the same as Prometheus (PromQL compatible)

Prometheus Not Scraping Targets

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open http://localhost:9090/targets

# Check ServiceMonitor CRDs
kubectl get servicemonitor -A

# Verify pod labels match ServiceMonitor selectors
kubectl get pods -n core headscale-0 --show-labels

Loki Not Receiving Logs

# Check Promtail pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail

# Check Promtail logs
kubectl logs -n monitoring daemonset/promtail -f

# Test Loki endpoint
kubectl exec -n monitoring deployment/loki -- \
  curl http://localhost:3100/ready

# Check Loki logs
kubectl logs -n monitoring deployment/loki -f

AlertManager Not Sending Alerts

# Check AlertManager status
kubectl logs -n monitoring deployment/prometheus-alertmanager

# View active alerts
kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
# Open http://localhost:9093

# Check Prometheus alert rules
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Open http://localhost:9090/alerts

Thanos Compactor Issues (if enabled)

# Check compactor logs
kubectl logs -n monitoring statefulset/thanos-compactor

# Check compactor PVC
kubectl get pvc -n monitoring thanos-compactor-data-thanos-compactor-0

# Verify object storage access
kubectl exec -n monitoring thanos-compactor-0 -- \
  ls -la /data/thanos

High Memory Usage

Prometheus:

# Check current usage
kubectl top pod -n monitoring prometheus-server-*

# Increase resources in terraform.tfvars
# prometheus_memory_limit = "8Gi"

# Apply changes
cd terraform && terraform apply

Thanos:

# Check Thanos Query memory
kubectl top pod -n monitoring thanos-query-*

# Increase in terraform.tfvars
# thanos_query_memory_limit = "2Gi"

Metrics Reference

Available Exporters

Exporter	Metrics	Port	Notes
node-exporter	System metrics (CPU, memory, disk, network)	9100	DaemonSet on all nodes
kube-state-metrics	Kubernetes object metrics	8080	Deployment counts, pod status, etc.
Headscale	VPN metrics	9090	Connected nodes, traffic stats
Grafana	Dashboard metrics	3000	User sessions, query performance
Loki	Log ingestion metrics	3100	Log rates, storage usage

Common Metric Queries

Cluster Health:

# Node status
up{job="kubernetes-nodes"}

# Pod restart count
kube_pod_container_status_restarts_total

# Persistent volume usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes

Application Metrics:

# Headscale connected nodes
headscale_registered_nodes

# Grafana active users
grafana_stat_active_users

# Loki ingestion rate
rate(loki_distributor_bytes_received_total[5m])

Resource Utilization:

# CPU usage by namespace
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

# Memory usage by pod
container_memory_working_set_bytes{pod!=""}

# Network bandwidth
rate(container_network_transmit_bytes_total[5m])

Retention Policies

Component	Data Type	Retention	Storage
Prometheus	Raw metrics	15 days	50Gi block storage
Thanos (raw)	Raw metrics	30 days	Compactor PVC
Thanos (5m)	Downsampled	90 days	Compactor PVC
Thanos (1h)	Downsampled	180 days	Compactor PVC
Loki	Logs	Ephemeral	emptyDir (until restart)
AlertManager	Alert state	Persistent	10Gi retain PVC

Performance Tuning

Reduce Prometheus Storage

# Decrease retention
# Edit terraform/prometheus.tf
server.retention = "7d"

# Reduce scrape frequency
server.global.scrape_interval = "60s"

Optimize Loki Performance

# Add persistence (if needed)
loki_enabled = true

# Increase resources
loki_memory_limit = "2Gi"
loki_cpu_limit    = "1000m"

Scale Thanos Query (if enabled)

# Increase replicas via Helm values
# Edit terraform/thanos.tf
query.replicas = 2

# Apply
terraform apply

Dashboard Management

Git-Sync Provisioning

Grafana automatically syncs dashboards from Git repository.

Repository Structure:

grafana-dashboards/
├── infrastructure/
│   ├── kubernetes-cluster.json
│   └── node-metrics.json
├── services/
│   ├── headscale.json
│   └── grafana.json
└── README.md

Update Dashboards:

Commit JSON to Git repository
Wait up to 60 seconds for sync
Refresh Grafana UI

Check Sync Status:

# View git-sync logs
kubectl logs -n monitoring grafana-0 -c git-sync -f

# Verify dashboards directory
kubectl exec -n monitoring grafana-0 -c grafana -- \
  ls -la /var/lib/grafana/dashboards

Security

VPN-only access - All UIs accessible only via Tailscale VPN
TLS everywhere - Grafana uses cert-manager certificates
LDAP authentication - Grafana integrates with FreeIPA
RBAC - Prometheus has cluster-wide read permissions
Secrets management - Passwords stored as Kubernetes secrets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring Guide

Overview

Architecture

Components

Prometheus

Thanos

Grafana

Loki

Promtail

AlertManager

Common Operations

Check Component Status

Access UIs

Query Metrics

View Logs

Configure Alerts

Backup Grafana Dashboards

Troubleshooting

Grafana Dashboards Show "No Data"

Prometheus Not Scraping Targets

Loki Not Receiving Logs

AlertManager Not Sending Alerts

Thanos Compactor Issues (if enabled)

High Memory Usage

Metrics Reference

Available Exporters

Common Metric Queries

Retention Policies

Performance Tuning

Reduce Prometheus Storage

Optimize Loki Performance

Scale Thanos Query (if enabled)

Dashboard Management

Git-Sync Provisioning

Security

Related Documentation

FilesExpand file tree

monitoring.md

Latest commit

History

monitoring.md

File metadata and controls

Monitoring Guide

Overview

Architecture

Components

Prometheus

Thanos

Grafana

Loki

Promtail

AlertManager

Common Operations

Check Component Status

Access UIs

Query Metrics

View Logs

Configure Alerts

Backup Grafana Dashboards

Troubleshooting

Grafana Dashboards Show "No Data"

Prometheus Not Scraping Targets

Loki Not Receiving Logs

AlertManager Not Sending Alerts

Thanos Compactor Issues (if enabled)

High Memory Usage

Metrics Reference

Available Exporters

Common Metric Queries

Retention Policies

Performance Tuning

Reduce Prometheus Storage

Optimize Loki Performance

Scale Thanos Query (if enabled)

Dashboard Management

Git-Sync Provisioning

Security

Related Documentation