Monitoring Stack Services

Comprehensive observability platform with metrics, logs, traces, and long-term storage.

Overview

The monitoring stack in Charon provides complete observability:

Prometheus - Metrics collection and alerting
Grafana - Visualization and dashboards
Loki - Log aggregation
Tempo - Distributed tracing
Thanos - Long-term metrics storage
Promtail - Log collection agent

Architecture

┌─────────────────────────────────────────────────────────┐
│                  Monitoring Stack                        │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  ┌──────────────┐         ┌──────────────┐             │
│  │  Prometheus  │────────▶│    Thanos    │             │
│  │  (metrics)   │         │ (long-term)  │             │
│  └──────────────┘         └──────────────┘             │
│         │                         │                     │
│         ▼                         ▼                     │
│  ┌──────────────────────────────────────┐              │
│  │            Grafana                   │              │
│  │  ┌──────────┐ ┌──────┐ ┌──────┐    │              │
│  │  │Prometheus│ │ Loki │ │Tempo │    │              │
│  │  │DataSource│ │  DS  │ │  DS  │    │              │
│  │  └──────────┘ └──────┘ └──────┘    │              │
│  └──────────────────────────────────────┘              │
│         ▲            ▲           ▲                     │
│         │            │           │                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
│  │   Apps   │  │ Promtail │  │   Apps   │            │
│  │ (metrics)│  │  (logs)  │  │ (traces) │            │
│  └──────────┘  └──────────┘  └──────────┘            │
└─────────────────────────────────────────────────────────┘

Components

Prometheus

Purpose: Metrics collection and alerting Namespace: monitoring Port: 9090 Storage: 50Gi (retain)

Features:

Service discovery via Kubernetes API
35+ scrape targets across 10 jobs
AlertManager integration
Federation support for Thanos

Configuration targets:

Kubernetes components (API server, kubelet, etcd)
Service endpoints (Headscale, FreeIPA, Grafana, etc.)
Node metrics via node-exporter
GPU metrics via DCGM exporter

Grafana

Purpose: Visualization and dashboards Namespace: monitoring Port: 3000 (app), 443 (nginx-tls) Storage: 10Gi for dashboards Access: VPN-only via HTTPS

Features:

Pre-configured dashboards from Git repository
Tempo correlation for traces
LDAP authentication via FreeIPA
Multi-datasource support

Dashboards included:

Kubernetes cluster overview
Node exporter metrics
Headscale VPN statistics
Open-WebUI performance
Loki log exploration
Tempo trace analysis

Loki

Purpose: Log aggregation and querying Namespace: monitoring Port: 3100 Storage: 50Gi (emptyDir for short-term)

Features:

LogQL query language
Label-based log organization
Promtail automatic collection
Grafana integration

Log sources:

All pod logs via Promtail
System logs from nodes
Application-specific logs

Tempo

Purpose: Distributed tracing Namespace: monitoring Port: 3200 (HTTP), 4317 (gRPC OTLP) Storage: 10Gi

Features:

OpenTelemetry support
Trace to logs correlation
Grafana native integration
Service dependency mapping

Integrated services:

Open-WebUI (OTLP traces)
Custom applications with OpenTelemetry SDK

Thanos

Purpose: Long-term metrics storage and global view Namespace: monitoring Storage: 2x 50Gi retain PVCs

Components:

Query: Global query interface
Store Gateway: Access to object storage
Compactor: Downsampling and retention
Ruler: Recording and alerting rules

Features:

Unlimited retention via object storage
Query deduplication
Downsampling for efficiency
Multi-cluster federation

Promtail

Purpose: Log collection agent Namespace: monitoring Type: DaemonSet (runs on all nodes)

Features:

Automatic Kubernetes pod discovery
Label extraction from metadata
Log pipeline processing
Direct shipping to Loki

Configuration

Enable Monitoring Stack

In terraform.tfvars:

# Prometheus
prometheus_enabled = true

# Grafana
grafana_enabled                = true
grafana_hostname               = "grafana.example.com"
grafana_admin_password         = "your-secure-password"
grafana_dashboards_git_enabled = true
grafana_dashboards_git_repo    = "https://github.com/your-org/grafana-dashboards.git"
grafana_dashboards_git_token   = "ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Loki
loki_enabled        = true
loki_storage        = "50Gi"

# Thanos
thanos_enabled               = true
thanos_compactor_storage    = "50Gi"
thanos_storegateway_storage = "50Gi"

Apply Configuration

cd terraform
terraform apply

Access and Usage

Access Grafana

# Connect to VPN
tailscale up --login-server https://vpn.example.com

# Access Grafana
open https://grafana.example.com

# Default credentials
Username: admin
Password: <from terraform.tfvars>

Query Metrics (Prometheus)

# Port-forward to Prometheus
kubectl port-forward -n monitoring prometheus-0 9090:9090

# Access Prometheus UI
open http://localhost:9090

# Example queries:
up{job="kubernetes-nodes"}
rate(container_cpu_usage_seconds_total[5m])

Query Logs (Loki via Grafana)

Open Grafana
Navigate to Explore
Select Loki datasource
Use LogQL queries:

{namespace="core", pod="headscale-0"}
{job="systemd-journal"} |= "error"
rate({app="nginx"}[5m])

View Traces (Tempo via Grafana)

Open Grafana
Navigate to Explore
Select Tempo datasource
Search by:
- Trace ID
- Service name
- Time range

Long-term Metrics (Thanos)

# Port-forward to Thanos Query
kubectl port-forward -n monitoring service/thanos-query 9090:9090

# Access Thanos UI
open http://localhost:9090

# Query across all time ranges

Alerting Configuration

Prometheus Alerts

Create alert rules:

# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-alerts
  namespace: monitoring
data:
  alerts.yaml: |
    groups:
    - name: kubernetes
      rules:
      - alert: NodeDown
        expr: up{job="kubernetes-nodes"} == 0
        for: 5m
        annotations:
          summary: "Node {{ $labels.node }} is down"

Grafana Alerts

Create alert in Grafana UI
Set notification channels (email, Slack, etc.)
Configure alert rules based on queries

Backup and Restore

Backup Grafana Dashboards

# Export all dashboards
kubectl exec -n monitoring grafana-0 -c grafana -- \
  grafana-cli admin export-dashboard-json /var/lib/grafana/dashboards/

# Copy to local
kubectl cp monitoring/grafana-0:/var/lib/grafana/dashboards ./grafana-backup

Backup Prometheus Data

# Create snapshot
kubectl exec -n monitoring prometheus-0 -- \
  curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot

# Copy snapshot
kubectl cp monitoring/prometheus-0:/prometheus/snapshots ./prometheus-backup

Troubleshooting

Prometheus Not Scraping

# Check targets
kubectl port-forward -n monitoring prometheus-0 9090:9090
# Visit http://localhost:9090/targets

# Check service discovery
kubectl logs -n monitoring prometheus-0 | grep discovery

Grafana Datasource Issues

# Test datasource connectivity
kubectl exec -n monitoring grafana-0 -c grafana -- \
  curl http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=up

# Check logs
kubectl logs -n monitoring grafana-0 -c grafana

Loki Not Receiving Logs

# Check Promtail
kubectl logs -n monitoring -l app=promtail

# Verify Loki is running
kubectl logs -n monitoring loki-0

# Test Loki API
kubectl port-forward -n monitoring loki-0 3100:3100
curl http://localhost:3100/ready

Tempo Missing Traces

# Check OTLP receiver
kubectl logs -n monitoring tempo-0 | grep otlp

# Verify service is receiving traces
kubectl port-forward -n monitoring tempo-0 3200:3200
curl http://localhost:3200/status

Performance Tuning

Prometheus

# Adjust retention and storage
prometheus_retention: "30d"
prometheus_storage: "100Gi"

# Tune scrape intervals
scrape_interval: "30s"  # Default 15s

Grafana

# Increase query timeout
dataproxy_timeout: "300"

# Cache configuration
caching_enabled: true
cache_ttl: "300"

Loki

# Adjust chunk storage
chunk_idle_period: "30m"
max_chunk_age: "1h"

# Tune ingestion
ingestion_rate_limit: "10MB"
ingestion_burst_size: "20MB"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring Stack Services

Overview

Architecture

Components

Prometheus

Grafana

Loki

Tempo

Thanos

Promtail

Configuration

Enable Monitoring Stack

Apply Configuration

Access and Usage

Access Grafana

Query Metrics (Prometheus)

Query Logs (Loki via Grafana)

View Traces (Tempo via Grafana)

Long-term Metrics (Thanos)

Alerting Configuration

Prometheus Alerts

Grafana Alerts

Backup and Restore

Backup Grafana Dashboards

Backup Prometheus Data

Troubleshooting

Prometheus Not Scraping

Grafana Datasource Issues

Loki Not Receiving Logs

Tempo Missing Traces

Performance Tuning

Prometheus

Grafana

Loki

Related Documentation

FilesExpand file tree

monitoring-stack.md

Latest commit

History

monitoring-stack.md

File metadata and controls

Monitoring Stack Services

Overview

Architecture

Components

Prometheus

Grafana

Loki

Tempo

Thanos

Promtail

Configuration

Enable Monitoring Stack

Apply Configuration

Access and Usage

Access Grafana

Query Metrics (Prometheus)

Query Logs (Loki via Grafana)

View Traces (Tempo via Grafana)

Long-term Metrics (Thanos)

Alerting Configuration

Prometheus Alerts

Grafana Alerts

Backup and Restore

Backup Grafana Dashboards

Backup Prometheus Data

Troubleshooting

Prometheus Not Scraping

Grafana Datasource Issues

Loki Not Receiving Logs

Tempo Missing Traces

Performance Tuning

Prometheus

Grafana

Loki

Related Documentation