Skip to content

Latest commit

Β 

History

History
422 lines (306 loc) Β· 11.6 KB

File metadata and controls

422 lines (306 loc) Β· 11.6 KB

Grafana Service

Grafana provides monitoring dashboards and visualization for Charon infrastructure and services.

Overview

Purpose: Monitoring and visualization dashboards Version: 8.5.0 (Helm chart) Port: 3000 (application), 443 (nginx-tls) Storage: Persistent volume for dashboards and datasources Access: VPN-only via HTTPS Authentication: LDAP via FreeIPA (optional)

Features

  • Dashboards - Custom dashboards for infrastructure monitoring (Kubernetes, Headscale, Open-WebUI, GPU)
  • Data Sources - Prometheus, Loki, Tempo with traces-to-metrics/logs correlations
  • Distributed Tracing - Tempo integration with OpenTelemetry traces from Open-WebUI
  • Alerting - Alert rules and notifications
  • LDAP Integration - User authentication via FreeIPA
  • Git Sync - Automatic dashboard sync from Git repository

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          Grafana Pod                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ nginx-   │─▢│   Grafana        β”‚ β”‚
β”‚  β”‚ tls      β”‚  β”‚   (port 3000)    β”‚ β”‚
β”‚  β”‚ (443)    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚            β”‚
β”‚                         β–Ό            β”‚
β”‚                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚                  β”‚ Prometheus β”‚      β”‚
β”‚                  β”‚ Loki       β”‚      β”‚
β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚Tailscale β”‚ VPN connectivity       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

Terraform Variables

# terraform.tfvars
grafana_enabled           = true
grafana_version           = "8.5.0"
grafana_hostname          = "grafana.example.com"
grafana_admin_password    = "your-secure-password"
grafana_tailscale_enabled = true

# Git sync for dashboards (optional)
grafana_dashboards_git_enabled      = true
grafana_dashboards_git_repo         = "https://github.com/org/repo"
grafana_dashboards_git_branch       = "main"
grafana_dashboards_git_token        = "github-token"
grafana_dashboards_git_sync_interval = 60

Resource Limits

grafana_cpu_request    = "500m"
grafana_memory_request = "512Mi"
grafana_cpu_limit      = "1"
grafana_memory_limit   = "1Gi"

Access

Web UI

# Via VPN
open https://grafana.example.com

# Default credentials
# Username: admin
# Password: <grafana_admin_password>

First Login

  1. Connect to VPN
  2. Navigate to https://grafana.example.com
  3. Login with admin credentials
  4. Change default password (recommended)

LDAP Integration

LDAP Configuration

Grafana integrates with FreeIPA for user authentication:

# Configured automatically via terraform/grafana.tf
auth.ldap:
  enabled: true
  config_file: /etc/grafana/ldap.toml

ldap.toml:
  host: freeipa.dev.svc.cluster.local
  port: 636
  use_ssl: true
  bind_dn: uid=admin,cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local
  search_base_dns: ["cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local"]
  search_filter: (uid=%s)

User Login

FreeIPA users can log in with their LDAP credentials:

  • Username: FreeIPA username
  • Password: FreeIPA password

Group Mapping

Users in FreeIPA admins group get Grafana Admin role.

Git Dashboard Sync

Grafana dashboards are automatically provisioned from a Git repository using the git-sync sidecar pattern. This is the primary method for managing dashboards.

Configuration

Required variables in terraform.tfvars:

grafana_dashboards_git_enabled = true
grafana_dashboards_git_repo    = "https://github.com/your-org/grafana-dashboards"
grafana_dashboards_git_branch  = "main"
grafana_dashboards_git_token   = "your-github-token"
grafana_dashboards_git_sync_interval = 60  # seconds

How It Works

  1. Git-sync sidecar container runs alongside Grafana
  2. Clones repository to /var/lib/grafana/dashboards every 60 seconds
  3. Grafana auto-discovers JSON dashboard files via provisioning
  4. Changes in Git automatically sync to Grafana
  5. Supports private repositories via GitHub Personal Access Token

Dashboard Repository Structure

Organize your dashboards in a Git repository:

grafana-dashboards/
β”œβ”€β”€ infrastructure/
β”‚   β”œβ”€β”€ kubernetes-cluster.json
β”‚   β”œβ”€β”€ docker-containers.json
β”‚   └── linux-host.json
β”œβ”€β”€ applications/
β”‚   β”œβ”€β”€ headscale-vpn.json
β”‚   └── service-metrics.json
β”œβ”€β”€ gpu/
β”‚   └── nvidia-dcgm.json
└── README.md

Dashboard Format

Dashboards must be valid Grafana JSON format:

{
  "id": null,
  "title": "My Dashboard",
  "tags": ["custom"],
  "timezone": "browser",
  "panels": [...],
  "time": {"from": "now-1h", "to": "now"},
  "refresh": "30s"
}

Testing Dashboard Changes

  1. Commit dashboard JSON to your Git repository
  2. Wait up to 60 seconds for git-sync to pull changes
  3. Refresh Grafana UI to see new dashboards
  4. Check git-sync logs if dashboards don't appear:
kubectl logs -n monitoring grafana-0 -c git-sync -f

Common Operations

Reset Admin Password

kubectl exec -n dev grafana-0 -c grafana -- \
  grafana-cli admin reset-admin-password NewPassword123

Install Plugin

kubectl exec -n dev grafana-0 -c grafana -- \
  grafana-cli plugins install <plugin-name>

# Restart Grafana
kubectl delete pod grafana-0 -n dev

Backup Dashboards

# Export all dashboards via API
curl -u admin:password https://grafana.example.com/api/dashboards/db

Data Sources

Prometheus / Thanos

Grafana automatically connects to the appropriate metrics backend based on your configuration.

When Thanos is ENABLED (thanos_enabled = true):

  • Data Source: Thanos Query
  • URL: http://thanos-query.monitoring.svc.cluster.local:9090
  • Type: Prometheus
  • Access: Server (default)
  • Benefits: Long-term retention (30/90/180 days), downsampling, global query view

When Thanos is DISABLED (thanos_enabled = false, current default):

  • Data Source: Prometheus Server
  • URL: http://prometheus-server.monitoring.svc.cluster.local
  • Type: Prometheus
  • Access: Server (default)
  • Benefits: Simpler setup, lower resource usage, 15-day retention

IMPORTANT: Dashboard queries use PromQL syntax regardless of which backend is configured. Queries are identical whether using Prometheus or Thanos.

Loki

Configured automatically if loki_enabled = true:

  • URL: http://loki.monitoring.svc.cluster.local:3100
  • Type: Loki
  • Access: Server (default)
  • Storage: emptyDir (ephemeral, resets on pod restart)

Tempo

Configured automatically if tempo_enabled = true:

  • URL: http://tempo.monitoring.svc.cluster.local:3200
  • Type: Tempo
  • Access: Server (default)
  • Features: Distributed tracing, traces-to-metrics, traces-to-logs correlations
  • Integration: Receives OpenTelemetry traces from Open-WebUI (gRPC on port 4317)

Troubleshooting

Cannot Access Web UI

# Check pod status
kubectl get pods -n dev grafana-0

# Check ingress
kubectl get ingress -n dev grafana

# Verify VPN connection
tailscale status

LDAP Authentication Failing

# Check Grafana logs
kubectl logs -n dev grafana-0 -c grafana | grep -i ldap

# Verify FreeIPA is accessible
kubectl exec -n dev grafana-0 -c grafana -- \
  nc -zv freeipa.dev.svc.cluster.local 636

# Test LDAP bind
kubectl exec -n dev grafana-0 -c grafana -- \
  ldapsearch -x -H ldaps://freeipa.dev.svc.cluster.local:636 \
  -D "uid=admin,cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local" \
  -w "ADMIN_PASSWORD" \
  -b "cn=users,cn=accounts,dc=dev,dc=svc,dc=cluster,dc=local"

Git Sync Not Working

Check git-sync sidecar logs:

kubectl logs -n monitoring grafana-0 -c git-sync -f

Common issues:

  1. Invalid Git token - Verify token has read permissions
  2. Repository not accessible - Check repository URL and network connectivity
  3. Branch name incorrect - Verify branch exists in repository
  4. Authentication failed - Regenerate GitHub Personal Access Token

Verify git-sync environment variables:

kubectl exec -n monitoring grafana-0 -c git-sync -- env | grep GIT_SYNC

Dashboards Not Loading

Check dashboard provisioning:

# Verify provisioning config
kubectl exec -n monitoring grafana-0 -c grafana -- \
  cat /etc/grafana/provisioning/dashboards/dashboards.yaml

# Check dashboard files in volume
kubectl exec -n monitoring grafana-0 -c grafana -- \
  ls -la /var/lib/grafana/dashboards

# Verify Grafana can read dashboards
kubectl logs -n monitoring grafana-0 -c grafana | grep -i dashboard

If dashboards still don't appear:

  1. Verify JSON files are valid Grafana dashboard format
  2. Check file permissions in volume
  3. Restart Grafana pod to reload provisioning:
kubectl delete pod grafana-0 -n monitoring

Dashboards Show "No Data"

Symptoms: All dashboards show "No data" but Grafana is working

Diagnosis:

# Check data source configuration
kubectl exec -n monitoring grafana-0 -c grafana -- \
  cat /etc/grafana/provisioning/datasources/datasources.yaml

# Test Prometheus/Thanos endpoint
kubectl exec -n monitoring grafana-0 -c grafana -- \
  curl http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up

Common Causes:

  1. Data source URL mismatch:

    • Thanos enabled but Grafana pointing to Prometheus
    • Thanos disabled but Grafana pointing to Thanos Query
  2. Dashboard queries have wrong prefix:

    • Check if queries use thanos_ prefix when they shouldn't
  3. Metrics backend not ready:

    • Wait 2-5 minutes after deployment
    • Check Prometheus/Thanos pod logs

Solution:

Verify data source matches your Terraform configuration:

# Check if Thanos is enabled
kubectl get pods -n monitoring | grep thanos

# If Thanos pods exist: data source should be thanos-query:9090
# If no Thanos pods: data source should be prometheus-server

See Monitoring Guide for detailed troubleshooting.

Security

  • Admin password stored as Kubernetes secret
  • LDAPS encryption for LDAP authentication
  • VPN-only access via Tailscale
  • IP allowlisting (100.64.0.0/10)
  • TLS certificates via cert-manager

Performance Tuning

Increase Resources

grafana_cpu_limit      = "2"
grafana_memory_limit   = "2Gi"

Query Optimization

  • Use dashboard time range wisely
  • Limit number of panels per dashboard
  • Use Prometheus query optimizations
  • Enable query caching

Related Documentation


Navigation: Documentation Index | Home