[Enhancement] Heterogeneous accelerator support (AMD, Intel, TPU, custom) in health checks and telemetry

### A clear and concise description of what you want to happen.

Current GPU health checks and telemetry collection rely on NVIDIA NVML bindings.
This tightly couples GCM’s monitoring and health validation logic to NVIDIA devices and assumes homogeneous GPU clusters.

Modern HPC and AI training environments increasingly deploy heterogeneous accelerator fleets, including:

AMD GPUs (ROCm)

Intel GPUs (Level Zero / oneAPI)

TPU nodes

AWS Trainium / Inferentia

Vendor specific accelerators

Without an abstraction layer, supporting additional accelerators requires duplicating health check logic and vendor specific conditionals across the codebase. This makes feature parity difficult and increases maintenance overhead.

```
AcceleratorBackend
  get_identity()
  get_temperature()
  get_utilization()
  get_memory_status()
  get_power_usage()
  get_clock_state()
  get_ecc_errors()
``` 

Implementations:

NVMLBackend (existing behavior)

ROCmBackend (AMD)

LevelZeroBackend (Intel)

Future capability limited backends:

TPUBackend

NeuronBackend (Trainium / Inferentia)


### A clear and concise description of any alternative solutions or features you've considered, if any.

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Heterogeneous accelerator support (AMD, Intel, TPU, custom) in health checks and telemetry #74

A clear and concise description of what you want to happen.

A clear and concise description of any alternative solutions or features you've considered, if any.

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Heterogeneous accelerator support (AMD, Intel, TPU, custom) in health checks and telemetry #74

Description

A clear and concise description of what you want to happen.

A clear and concise description of any alternative solutions or features you've considered, if any.

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions