We have several GPU health checking and monitoring components that are built on top of DCGM. To test these components, DCGM needs to be deployed with the variable NVML_INJECTION_MODE=True set. This also allows injection of GPU errors using dcgmi test. An example implementation is available on https://github.com/NVIDIA/NVSentinel/pull/112/files
Would it be possible to include support for DCGM in the fake GPU operator?