Skip to content

feat: GPU operator should skip driver installation if drivers already exist and install otherwise #475

@Jont828

Description

@Jont828

Prerequisites

  • I searched existing issues

Feature Summary

GPU operator includes a flag to install the GPU drivers with the driver.enabled flag. But providers like EKS, GKE, and AKS can also have their own managed GPU driver. It'll conflict if GPU operator tries to install its own driver. And if the provider doesn't install a driver and GPU operator also skips, then the cluster fails too. I'd like AICR to figure out if the GPU nodes already have a pre-installed driver to determine how to set the driver.enabled flag in GPU operator values. Idea would be such that we just make the cluster conformant and ready to run dynamo, etc.

As of now, the Helm values for GKE always sets driver.enabled to false while the templates for AKS sets driver.enabled to true. This feature would remove the inconsistency between providers and ensure there the drivers and operator work regardless of what the user brings in their cluster.

Note: If some nodes have GPU drivers installed while others do not, that could be a tricky situation. For example, if you create one AKS node pool with the --gpu-driver None flag and another AKS nodepool --gpu-driver Install flag, we couldn't just query one node to check for the driver and assume it applies to all nodes.

Success Criteria

When you run AICR and deploy to a cluster with pre installed driver, GPU operator will work and there won't be driver conflict. When you run AICR and deploy to a cluster without a pre installed driver, it will install GPU operator along with the driver.

Alternatives Considered

No response

Component

Recipe engine / data

Priority

Important (would improve my workflow)

Compatibility / Breaking Changes

No response

Operational Considerations

No response

Are you willing to contribute?

Yes, I can open a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions