Skip to content

Latest commit

Β 

History

History
406 lines (308 loc) Β· 10.3 KB

File metadata and controls

406 lines (308 loc) Β· 10.3 KB

Monitoring Stack Services

Comprehensive observability platform with metrics, logs, traces, and long-term storage.

Overview

The monitoring stack in Charon provides complete observability:

  • Prometheus - Metrics collection and alerting
  • Grafana - Visualization and dashboards
  • Loki - Log aggregation
  • Tempo - Distributed tracing
  • Thanos - Long-term metrics storage
  • Promtail - Log collection agent

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Monitoring Stack                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚  Prometheus  │────────▢│    Thanos    β”‚             β”‚
β”‚  β”‚  (metrics)   β”‚         β”‚ (long-term)  β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚         β”‚                         β”‚                     β”‚
β”‚         β–Ό                         β–Ό                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚  β”‚            Grafana                   β”‚              β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”    β”‚              β”‚
β”‚  β”‚  β”‚Prometheusβ”‚ β”‚ Loki β”‚ β”‚Tempo β”‚    β”‚              β”‚
β”‚  β”‚  β”‚DataSourceβ”‚ β”‚  DS  β”‚ β”‚  DS  β”‚    β”‚              β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜    β”‚              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚         β–²            β–²           β–²                     β”‚
β”‚         β”‚            β”‚           β”‚                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚   Apps   β”‚  β”‚ Promtail β”‚  β”‚   Apps   β”‚            β”‚
β”‚  β”‚ (metrics)β”‚  β”‚  (logs)  β”‚  β”‚ (traces) β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

Prometheus

Purpose: Metrics collection and alerting Namespace: monitoring Port: 9090 Storage: 50Gi (retain)

Features:

  • Service discovery via Kubernetes API
  • 35+ scrape targets across 10 jobs
  • AlertManager integration
  • Federation support for Thanos

Configuration targets:

  • Kubernetes components (API server, kubelet, etcd)
  • Service endpoints (Headscale, FreeIPA, Grafana, etc.)
  • Node metrics via node-exporter
  • GPU metrics via DCGM exporter

Grafana

Purpose: Visualization and dashboards Namespace: monitoring Port: 3000 (app), 443 (nginx-tls) Storage: 10Gi for dashboards Access: VPN-only via HTTPS

Features:

  • Pre-configured dashboards from Git repository
  • Tempo correlation for traces
  • LDAP authentication via FreeIPA
  • Multi-datasource support

Dashboards included:

  • Kubernetes cluster overview
  • Node exporter metrics
  • Headscale VPN statistics
  • Open-WebUI performance
  • Loki log exploration
  • Tempo trace analysis

Loki

Purpose: Log aggregation and querying Namespace: monitoring Port: 3100 Storage: 50Gi (emptyDir for short-term)

Features:

  • LogQL query language
  • Label-based log organization
  • Promtail automatic collection
  • Grafana integration

Log sources:

  • All pod logs via Promtail
  • System logs from nodes
  • Application-specific logs

Tempo

Purpose: Distributed tracing Namespace: monitoring Port: 3200 (HTTP), 4317 (gRPC OTLP) Storage: 10Gi

Features:

  • OpenTelemetry support
  • Trace to logs correlation
  • Grafana native integration
  • Service dependency mapping

Integrated services:

  • Open-WebUI (OTLP traces)
  • Custom applications with OpenTelemetry SDK

Thanos

Purpose: Long-term metrics storage and global view Namespace: monitoring Storage: 2x 50Gi retain PVCs

Components:

  • Query: Global query interface
  • Store Gateway: Access to object storage
  • Compactor: Downsampling and retention
  • Ruler: Recording and alerting rules

Features:

  • Unlimited retention via object storage
  • Query deduplication
  • Downsampling for efficiency
  • Multi-cluster federation

Promtail

Purpose: Log collection agent Namespace: monitoring Type: DaemonSet (runs on all nodes)

Features:

  • Automatic Kubernetes pod discovery
  • Label extraction from metadata
  • Log pipeline processing
  • Direct shipping to Loki

Configuration

Enable Monitoring Stack

In terraform.tfvars:

# Prometheus
prometheus_enabled = true

# Grafana
grafana_enabled                = true
grafana_hostname               = "grafana.example.com"
grafana_admin_password         = "your-secure-password"
grafana_dashboards_git_enabled = true
grafana_dashboards_git_repo    = "https://github.com/your-org/grafana-dashboards.git"
grafana_dashboards_git_token   = "ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Loki
loki_enabled        = true
loki_storage        = "50Gi"

# Thanos
thanos_enabled               = true
thanos_compactor_storage    = "50Gi"
thanos_storegateway_storage = "50Gi"

Apply Configuration

cd terraform
terraform apply

Access and Usage

Access Grafana

# Connect to VPN
tailscale up --login-server https://vpn.example.com

# Access Grafana
open https://grafana.example.com

# Default credentials
Username: admin
Password: <from terraform.tfvars>

Query Metrics (Prometheus)

# Port-forward to Prometheus
kubectl port-forward -n monitoring prometheus-0 9090:9090

# Access Prometheus UI
open http://localhost:9090

# Example queries:
up{job="kubernetes-nodes"}
rate(container_cpu_usage_seconds_total[5m])

Query Logs (Loki via Grafana)

  1. Open Grafana
  2. Navigate to Explore
  3. Select Loki datasource
  4. Use LogQL queries:
{namespace="core", pod="headscale-0"}
{job="systemd-journal"} |= "error"
rate({app="nginx"}[5m])

View Traces (Tempo via Grafana)

  1. Open Grafana
  2. Navigate to Explore
  3. Select Tempo datasource
  4. Search by:
    • Trace ID
    • Service name
    • Time range

Long-term Metrics (Thanos)

# Port-forward to Thanos Query
kubectl port-forward -n monitoring service/thanos-query 9090:9090

# Access Thanos UI
open http://localhost:9090

# Query across all time ranges

Alerting Configuration

Prometheus Alerts

Create alert rules:

# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-alerts
  namespace: monitoring
data:
  alerts.yaml: |
    groups:
    - name: kubernetes
      rules:
      - alert: NodeDown
        expr: up{job="kubernetes-nodes"} == 0
        for: 5m
        annotations:
          summary: "Node {{ $labels.node }} is down"

Grafana Alerts

  1. Create alert in Grafana UI
  2. Set notification channels (email, Slack, etc.)
  3. Configure alert rules based on queries

Backup and Restore

Backup Grafana Dashboards

# Export all dashboards
kubectl exec -n monitoring grafana-0 -c grafana -- \
  grafana-cli admin export-dashboard-json /var/lib/grafana/dashboards/

# Copy to local
kubectl cp monitoring/grafana-0:/var/lib/grafana/dashboards ./grafana-backup

Backup Prometheus Data

# Create snapshot
kubectl exec -n monitoring prometheus-0 -- \
  curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot

# Copy snapshot
kubectl cp monitoring/prometheus-0:/prometheus/snapshots ./prometheus-backup

Troubleshooting

Prometheus Not Scraping

# Check targets
kubectl port-forward -n monitoring prometheus-0 9090:9090
# Visit http://localhost:9090/targets

# Check service discovery
kubectl logs -n monitoring prometheus-0 | grep discovery

Grafana Datasource Issues

# Test datasource connectivity
kubectl exec -n monitoring grafana-0 -c grafana -- \
  curl http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=up

# Check logs
kubectl logs -n monitoring grafana-0 -c grafana

Loki Not Receiving Logs

# Check Promtail
kubectl logs -n monitoring -l app=promtail

# Verify Loki is running
kubectl logs -n monitoring loki-0

# Test Loki API
kubectl port-forward -n monitoring loki-0 3100:3100
curl http://localhost:3100/ready

Tempo Missing Traces

# Check OTLP receiver
kubectl logs -n monitoring tempo-0 | grep otlp

# Verify service is receiving traces
kubectl port-forward -n monitoring tempo-0 3200:3200
curl http://localhost:3200/status

Performance Tuning

Prometheus

# Adjust retention and storage
prometheus_retention: "30d"
prometheus_storage: "100Gi"

# Tune scrape intervals
scrape_interval: "30s"  # Default 15s

Grafana

# Increase query timeout
dataproxy_timeout: "300"

# Cache configuration
caching_enabled: true
cache_ttl: "300"

Loki

# Adjust chunk storage
chunk_idle_period: "30m"
max_chunk_age: "1h"

# Tune ingestion
ingestion_rate_limit: "10MB"
ingestion_burst_size: "20MB"

Related Documentation


Navigation: Documentation Index | Services | Home