Skip to content

[Enhancement] Heterogeneous accelerator support (AMD, Intel, TPU, custom) in health checks and telemetry #74

@theap06

Description

@theap06

A clear and concise description of what you want to happen.

Current GPU health checks and telemetry collection rely on NVIDIA NVML bindings.
This tightly couples GCM’s monitoring and health validation logic to NVIDIA devices and assumes homogeneous GPU clusters.

Modern HPC and AI training environments increasingly deploy heterogeneous accelerator fleets, including:

AMD GPUs (ROCm)

Intel GPUs (Level Zero / oneAPI)

TPU nodes

AWS Trainium / Inferentia

Vendor specific accelerators

Without an abstraction layer, supporting additional accelerators requires duplicating health check logic and vendor specific conditionals across the codebase. This makes feature parity difficult and increases maintenance overhead.

AcceleratorBackend
  get_identity()
  get_temperature()
  get_utilization()
  get_memory_status()
  get_power_usage()
  get_clock_state()
  get_ecc_errors()

Implementations:

NVMLBackend (existing behavior)

ROCmBackend (AMD)

LevelZeroBackend (Intel)

Future capability limited backends:

TPUBackend

NeuronBackend (Trainium / Inferentia)

A clear and concise description of any alternative solutions or features you've considered, if any.

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions