A complete GPU monitoring solution that collects real-time performance metrics from AMD Radeon hardware and visualizes them through a modern observability stack. Built with Prometheus for metrics aggregation, Grafana for dashboards, and a custom Python service for hardware telemetry, all containerized with Docker and deployable to AWS with using Terraform.
The platform enables side-by-side comparison of GPU performance under real workloads. In this example, monitoring both the AMD RX 7800 XT and RX 7700 XT during a sustained stress test revealed:
- The 7800 XT delivered equivalent performance while maintaining 10-15% lower GPU utilization
- Improved thermal design kept the 7800 XT running 5-8°C cooler throughout the test
- Despite consuming ~15W more power, the 7800 XT achieved 20% better performance per watt
The Alertmanager is configured to monitor for multiple critical conditions (like high temperature, power spikes, or low utilization). This specific example demonstrates the GPU Idle alert, where a notification is sent (email and webhook) when the GPU is free, allowing external systems to automatically queue up new workloads.
| Componenet | Technology | Functionality |
|---|---|---|
| Metrics Exporter | Custom Python (gpu_service.py) |
Extracts raw GPU telemetry (ROCm SMI). |
| Metrics Endpoint | Python (FastAPI/Uvicorn) | Acts as a central API/aggregator; exposes metrics on /metrics for Prometheus to scrape. |
| Monitoring & Alerting | Prometheus (v3.8.0) & Alertmanager (v0.29.0) | Handles time-series data storage, rule evaluation (alert_rules.yml), and email/webhook notifications. |
| Visualization | Grafana | Dashboards for real-time comparative analysis. |
| Deployment | Docker/Docker Compose | Defines the four-container stack and the isolated gpu-observer_monitoring network. |
| Infrastructure | AWS EC2 (t3.small) & Terraform | Managed infrastructure as code; automated SSH and secure port access. |
The core of the architecture is a Docker-defined pipeline where the GPU service pushes raw telemetry to the cloudservice (FastAPI),
which then exposes a Prometheus-compliant endpoint. Prometheus scrapes this data on the shared Docker network, and the results are visualized
in Grafana. Alerts are routed to Alertmanager, which handles notifications via authenticated SMTP and an event-driven webhook.

- Automated Alerting Pipeline: Configured Alertmanager to reliably send email alerts via secure SMTP (Port 587) and simultaneously push real-time alerts to a webhook endpoint (/alert in cloud_service.py).
- Custom Metrics & Integration: Developed a Python service to integrate with ROCm SMI and expose metrics in a Prometheus format.
- Infrastructure as Code (IaC): Used Terraform to define and deploy the AWS EC2 instance, security group, and automated startup script, ensuring reproducible environments.
- Secure Credential Handling: Implemented a startup script to dynamically generate the final alertmanager.yml from a template using environment variables ($EMAIL_PASS) to prevent hardcoding secrets.
- Low-Latency Metrics Pipeline: Engineered a data collection system that scrapes critical GPU telemetry (utilization, temperature, power) at 5-second intervals, enabling genuine near-real-time hardware performance monitoring.