Skip to content

[RFE] Add support for running DCGM with NVML injection to allow simulation of GPU health checks and diags #136

@lalitadithya

Description

@lalitadithya

We have several GPU health checking and monitoring components that are built on top of DCGM. To test these components, DCGM needs to be deployed with the variable NVML_INJECTION_MODE=True set. This also allows injection of GPU errors using dcgmi test. An example implementation is available on https://github.com/NVIDIA/NVSentinel/pull/112/files

Would it be possible to include support for DCGM in the fake GPU operator?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions