-
Notifications
You must be signed in to change notification settings - Fork 26
feat: GPU operator should skip driver installation if drivers already exist and install otherwise #475
Description
Prerequisites
- I searched existing issues
Feature Summary
GPU operator includes a flag to install the GPU drivers with the driver.enabled flag. But providers like EKS, GKE, and AKS can also have their own managed GPU driver. It'll conflict if GPU operator tries to install its own driver. And if the provider doesn't install a driver and GPU operator also skips, then the cluster fails too. I'd like AICR to figure out if the GPU nodes already have a pre-installed driver to determine how to set the driver.enabled flag in GPU operator values. Idea would be such that we just make the cluster conformant and ready to run dynamo, etc.
As of now, the Helm values for GKE always sets driver.enabled to false while the templates for AKS sets driver.enabled to true. This feature would remove the inconsistency between providers and ensure there the drivers and operator work regardless of what the user brings in their cluster.
Note: If some nodes have GPU drivers installed while others do not, that could be a tricky situation. For example, if you create one AKS node pool with the --gpu-driver None flag and another AKS nodepool --gpu-driver Install flag, we couldn't just query one node to check for the driver and assume it applies to all nodes.
Success Criteria
When you run AICR and deploy to a cluster with pre installed driver, GPU operator will work and there won't be driver conflict. When you run AICR and deploy to a cluster without a pre installed driver, it will install GPU operator along with the driver.
Alternatives Considered
No response
Component
Recipe engine / data
Priority
Important (would improve my workflow)
Compatibility / Breaking Changes
No response
Operational Considerations
No response
Are you willing to contribute?
Yes, I can open a PR