Skip to content

Latest commit

 

History

History
301 lines (239 loc) · 8.61 KB

File metadata and controls

301 lines (239 loc) · 8.61 KB

Prometheus Metrics Emulator (PromEmu) Docker Infrastructure

This directory contains Docker configurations for a complete Prometheus monitoring stack to support the Prometheus Metrics Emulator (PromEmu).

Table of Contents

Architecture

┌─────────────────┐    ┌────────────────────┐    ┌─────────────────┐
│   Pushgateway   │◄───│  Emulation Script  │    │     Grafana     │
│   Port: 9091    │    │  (Python)          │    │   Port: 3000    │
└─────────┬───────┘    └────────────────────┘    └─────────┬───────┘
          │                                                │
          │                                                │
          ▼                                                ▼
┌─────────────────┐                              ┌─────────────────┐
│   Prometheus    │◄─────────────────────────────│  Data Storage   │
│   Port: 9090    │                              │   (Volumes)     │
└─────────────────┘                              └─────────────────┘

Services

Pushgateway (Port 9091)

  • Image: prom/pushgateway:v1.11.1
  • Purpose: Receives metrics from emulation script
  • Features:
    • Persistent storage with 5-minute intervals
    • Health checks
    • Data persistence in Docker volume

Prometheus (Port 9090)

  • Image: prom/prometheus:v3.5.0
  • Purpose: Scrapes metrics from Pushgateway, stores time-series data
  • Features:
    • 7-day retention policy
    • 10GB storage limit
    • Custom alerting rules for emulation metrics
    • API access for external tools

Grafana (Port 3000)

  • Image: grafana/grafana:12.1.0
  • Purpose: Visualization and dashboards
  • Credentials: admin / admin
  • Features:
    • Pre-configured Prometheus datasource
    • Automatic dashboard provisioning
    • Additional plugins (piechart, worldmap)
    • Anonymous viewer access

Quick Start

Prerequisites

  • Docker and Docker Compose installed
  • Ports 3000, 9090, 9091 available

Start Everything

# Start all services
./manage.sh start

# Or using docker-compose directly
docker-compose up -d --build

Access Services

Check Status

./manage.sh status

Management Script

The manage.sh script provides convenient management commands:

./manage.sh start     # Start all services
./manage.sh stop      # Stop all services  
./manage.sh restart   # Restart all services
./manage.sh status    # Show service status
./manage.sh config    # Show current configuration
./manage.sh logs      # Show all logs
./manage.sh logs grafana  # Show specific service logs
./manage.sh backup    # Backup data volumes
./manage.sh cleanup   # Complete cleanup (removes all data)

Configuration Files

Docker Compose

  • docker-compose.yml - Main orchestration file
  • Defines services, networks, volumes, and dependencies

Pushgateway

  • pushgateway/Dockerfile - Custom build with health checks
  • Persistent storage in /data volume
  • Runs as non-root user for security

Prometheus

  • prometheus/Dockerfile - Custom build with config files
  • prometheus/prometheus.yml - Main configuration
  • prometheus/rules/emulation_alerts.yml - Alerting rules
  • Scrapes Pushgateway every 15 seconds (configurable)

Grafana

  • grafana/Dockerfile - Custom build with plugins
  • grafana/provisioning/datasources/ - Auto-configured datasources
  • grafana/provisioning/dashboards/ - Auto-provisioned dashboards
  • grafana/dashboards/emulation_overview.json - Main dashboard

Data Persistence

All data is stored in Docker named volumes:

  • max-prometheus-data - Prometheus time-series data
  • max-grafana-data - Grafana dashboards and settings
  • max-pushgateway-data - Pushgateway metric buffer

Backup Data

./manage.sh backup

Creates timestamped backup in backups/ directory.

Networking

Services communicate via dedicated Docker network:

  • Network: max-monitoring
  • Internal DNS resolution between services
  • Only necessary ports exposed to host

Health Checks

All services include health checks:

  • Pushgateway: HTTP GET /metrics
  • Prometheus: HTTP GET /-/healthy
  • Grafana: HTTP GET /api/health

Health check parameters:

  • Interval: 30 seconds
  • Timeout: 10 seconds
  • Retries: 3
  • Start period: 10-30 seconds

Monitoring Emulated Metrics

Prometheus Queries

# CPU usage across all hosts
cpu_usage_percent

# Memory usage for specific host
memory_usage_percent{host="worker-01"}

# Heavy task status
heavy_task_active

# Database connections
db_connections

Grafana Dashboards

The included dashboard shows:

  • CPU usage by host (time series)
  • Memory usage by host (time series)
  • Heavy task status (stat panel)
  • Host distribution by role (pie chart)
  • Database connections (time series)
  • Network throughput (time series)

Alerting Rules

Pre-configured alerts:

  • HighCPUUsage: CPU > 80% for 2 minutes
  • CriticalCPUUsage: CPU > 95% for 1 minute
  • HighMemoryUsage: Memory > 85% for 3 minutes
  • HeavyTaskActive: Heavy task started
  • PushgatewayDown: Service unavailable
  • PrometheusDown: Service unavailable

Troubleshooting

Services Won't Start

# Check Docker is running
docker info

# Check port availability
netstat -tulpn | grep -E ':(3000|9090|9091)'

# View service logs
./manage.sh logs [service]

Missing Data

# Verify Pushgateway has metrics
curl http://localhost:9091/metrics

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Verify Grafana datasource
curl -u admin:admin123 http://localhost:3000/api/datasources

Performance Issues

# Monitor resource usage
docker stats

# Check volume sizes
docker system df -v

# Reduce Prometheus retention (current default: 7d)
# Edit .docker-env: PROMETHEUS_RETENTION_TIME=3d

Reset Everything

./manage.sh cleanup
./manage.sh start

Integration with Emulation

The emulation script pushes metrics to Pushgateway:

from core.emulator import MetricsEmulator
from core.config_example import get_example_config

# Configure to use local Pushgateway
config = get_example_config()
config.pushgateway_url = 'http://localhost:9091'

# Run emulation
emulator = MetricsEmulator(config)
await emulator.run_for_duration(1800)  # 30 minutes

Production Considerations

For production use, consider:

  1. Security:

    • Change default Grafana password
    • Enable HTTPS with reverse proxy
    • Configure authentication
  2. Persistence:

    • Use external storage for volumes
    • Set up backup automation
    • Configure retention policies
  3. Monitoring:

    • Add Alertmanager for notifications
    • Configure external receivers (email, Slack)
    • Monitor the monitoring stack itself
  4. Performance:

    • Tune Prometheus storage settings
    • Configure appropriate resource limits
    • Use SSD storage for better performance