-
Notifications
You must be signed in to change notification settings - Fork 32
Description
A clear and concise description of what you want to happen.
Current GPU health checks and telemetry collection rely on NVIDIA NVML bindings.
This tightly couples GCM’s monitoring and health validation logic to NVIDIA devices and assumes homogeneous GPU clusters.
Modern HPC and AI training environments increasingly deploy heterogeneous accelerator fleets, including:
AMD GPUs (ROCm)
Intel GPUs (Level Zero / oneAPI)
TPU nodes
AWS Trainium / Inferentia
Vendor specific accelerators
Without an abstraction layer, supporting additional accelerators requires duplicating health check logic and vendor specific conditionals across the codebase. This makes feature parity difficult and increases maintenance overhead.
AcceleratorBackend
get_identity()
get_temperature()
get_utilization()
get_memory_status()
get_power_usage()
get_clock_state()
get_ecc_errors()
Implementations:
NVMLBackend (existing behavior)
ROCmBackend (AMD)
LevelZeroBackend (Intel)
Future capability limited backends:
TPUBackend
NeuronBackend (Trainium / Inferentia)
A clear and concise description of any alternative solutions or features you've considered, if any.
No response
Additional context
No response