-
Notifications
You must be signed in to change notification settings - Fork 8
How to evaluate if the nvidia driver is available on the node #16
Description
With #14 we introduced a direct dependency to the nvidia-gpu-operator
Dependabot flagged the dependency for potential vulnerability:
- https://github.com/CoHDI/composable-resource-operator/security/dependabot/10
- https://github.com/CoHDI/composable-resource-operator/security/dependabot/11
The goal of this issue is to discuss how to proceed.
I looked at the source code and the operator checks for the nvidia gpu operator cluster policy to know if the driver is enabled.
Here some of the calls:
- https://github.com/CoHDI/composable-resource-operator/blob/main/internal/utils/gpus.go#L91
- https://github.com/CoHDI/composable-resource-operator/blob/main/internal/utils/gpus.go#L192
While checking the cluster policy gives us hints on the gpu-operator configuration, this does not ensure that the driver is actually installed. Errors during the installation might translate in the driver not being present.
Question: Should we verify if the driver is enabled by inspecting the current status of the node?
For instance, files like /proc/driver/nvidia/version or /proc/driver/nvidia/version can give us insight on the current state. As a bonus point we also remove the direct dependency to the nvidia gpu-operator