[RFE] Add support for running DCGM with NVML injection to allow simulation of GPU health checks and diags

We have several GPU health checking and monitoring components that are built on top of DCGM. To test these components, DCGM needs to be deployed with the variable `NVML_INJECTION_MODE=True` set. This also allows injection of GPU errors using dcgmi test. An example implementation is available on https://github.com/NVIDIA/NVSentinel/pull/112/files

Would it be possible to include support for DCGM in the fake GPU operator?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] Add support for running DCGM with NVML injection to allow simulation of GPU health checks and diags #136

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFE] Add support for running DCGM with NVML injection to allow simulation of GPU health checks and diags #136

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions