-
Notifications
You must be signed in to change notification settings - Fork 11
Support clusterpolicy driver spec updates #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support clusterpolicy driver spec updates #176
Conversation
Added option to specify in clusterpolicy Driver spec parameters before deploying on the cluster via these optional environment variables: - NVIDIAGPU_GPU_DRIVER_IMAGE - NVIDIAGPU_GPU_DRIVER_REPO - NVIDIAGPU_GPU_DRIVER_VERSION - NVIDIAGPU_GPU_DRIVER_ENABLE_RDMA modified: README.md modified: internal/nvidiagpuconfig/config.go modified: pkg/nvidiagpu/clusterpolicy.go modified: tests/nvidiagpu/deploy-gpu-test.go
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wabouhamad The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThe changes introduce new environment variables and configuration fields to support customization of NVIDIA GPU driver parameters, including image, repository, version, and RDMA enablement. Documentation is updated, configuration structs are extended, a utility function is added, and test logic is enhanced to apply these customizations during GPU operator deployment tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Env as Environment
participant Test as deploy-gpu-test.go
participant Config as NvidiaGPUConfig
participant ClusterPolicy as ClusterPolicy Spec
Env->>Test: Provides GPU driver env variables
Test->>Config: Reads env variables into config fields
Test->>Test: Sets updateGPUDriverSpec flag if any variable set
Test->>ClusterPolicy: Updates driver image, repo, version, RDMA if flag set
ClusterPolicy-->>Test: ClusterPolicy spec updated for deployment
Suggested labels
Suggested reviewers
Poem
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. Note ⚡️ Faster reviews with cachingCodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
/unnc @ybettan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
pkg/nvidiagpu/clusterpolicy.go (1)
228-231: Consider movingBoolPtrto a shared utilities package and making it generic
BoolPtris a handy helper, but:
The project already contains several utility helpers – co-locating this one under a generic
pkg/util(or similar) avoids scattering helpers across domain packages (nvidiagpuis a business-domain package, not a util package).With Go 1.18 you can implement a single generic helper that works for any value type:
// util/pointer.go // Ptr returns a pointer to the supplied value. func Ptr[T any](v T) *T { return &v }That removes the need for one‐off helpers (
StringPtr,IntPtr,BoolPtr, …) and stays 100 % type-safe.If refactoring is not feasible now, at least add a short godoc style comment starting with the function name (to appease
golint).README.md (2)
82-86: Unify bullet style – fixes markdown-lint (MD004)The surrounding list uses asterisks; these lines switched to dashes which triggers the linter.
-* `NVIDIAGPU_GPU_DRIVER_IMAGE`: specific GPU driver image specified in clusterPolicy - _optional_ -* `NVIDIAGPU_GPU_DRIVER_REPO`: specific GPU driver image repository specified in clusterPolicy - _optional_ -* `NVIDIAGPU_GPU_DRIVER_VERSION`: specific GPU driver version specified in clusterPolicy - _optional_ -* `NVIDIAGPU_GPU_DRIVER_ENABLE_RDMA`: option to enable GPUDirect RDMA in clusterpolicy. Default value is false - _optional_ +* `NVIDIAGPU_GPU_DRIVER_IMAGE`: specific GPU driver image specified in clusterPolicy - _optional_ +* `NVIDIAGPU_GPU_DRIVER_REPO`: specific GPU driver image repository specified in clusterPolicy - _optional_ +* `NVIDIAGPU_GPU_DRIVER_VERSION`: specific GPU driver version specified in clusterPolicy - _optional_ +* `NVIDIAGPU_GPU_DRIVER_ENABLE_RDMA`: option to enable GPUDirect RDMA in clusterpolicy. Default value is false - _optional_🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
82-82: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
83-83: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
84-84: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
85-85: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
170-176: Same bullet-style inconsistency further downFor consistency apply the same change to the commented example block.
tests/nvidiagpu/deploy-gpu-test.go (2)
89-97: Prefer explicit zero values over sentinel stringUndefinedValue
gpuDriverImage,gpuDriverRepo, andgpuDriverVersionare initialised with the sentinelUndefinedValue.
Because these are plain strings, the empty string already signals “unset”, which simplifies later checks:- gpuDriverImage = UndefinedValue - gpuDriverRepo = UndefinedValue - gpuDriverVersion = UndefinedValue + gpuDriverImage string + gpuDriverRepo string + gpuDriverVersion stringThis avoids accidental collisions when
UndefinedValueis used for other purposes.
206-245: Repeated flag-setting logic can be collapsedThe four nearly identical blocks (
GPUDriverImage,…Repo,…Version,…EnableRDMA) all:
- Check the env-derived value,
- Log,
- Assign to a package variable,
- Flip
updateGPUDriverSpec = true.This repetition is error-prone (easy to forget one field next time) and bloats the setup.
Consider a helper:func setIfPresent[T comparable](val T, target *T) bool { var zero T if val != zero { *target = val return true } return false } updated := setIfPresent(nvidiaGPUConfig.GPUDriverImage, &gpuDriverImage) updated = setIfPresent(nvidiaGPUConfig.GPUDriverRepo, &gpuDriverRepo) || updated updated = setIfPresent(nvidiaGPUConfig.GPUDriverVersion, &gpuDriverVersion) || updated if nvidiaGPUConfig.GPUDriverEnableRDMA { gpuDriverEnableRDMA = true updated = true } updateGPUDriverSpec = updatedThis trims ~40 lines and keeps the intent crystal clear.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
README.md(2 hunks)internal/nvidiagpuconfig/config.go(1 hunks)pkg/nvidiagpu/clusterpolicy.go(1 hunks)tests/nvidiagpu/deploy-gpu-test.go(6 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
README.md
82-82: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
83-83: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
84-84: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
85-85: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
🔇 Additional comments (1)
internal/nvidiagpuconfig/config.go (1)
18-21: Double-prefix risk withenvconfig.Process("nvidiagpu_", …)The struct tags already contain the full variable name (
NVIDIAGPU_GPU_DRIVER_IMAGE, …).
Becauseenvconfig.Process("nvidiagpu_", cfg)automatically adds an underscore after the prefix, the effective lookup key becomes
NVIDIAGPU__GPU_DRIVER_IMAGE(note the double underscore) which will never match the intended environment variable.This pattern existed before, but the four new fields inherit the same issue – so look-ups will silently fail.
Two possible fixes:
- if err := envconfig.Process("nvidiagpu_", cfg); err != nil { + if err := envconfig.Process("NVIDIAGPU", cfg); err != nil { // let envconfig add the underscoreor keep the current call and drop the prefix in the struct tags:
- GPUDriverImage string `envconfig:"NVIDIAGPU_GPU_DRIVER_IMAGE"` + GPUDriverImage string `envconfig:"GPU_DRIVER_IMAGE"`Please verify which pattern the rest of the code relies on before merging.
| if clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA == nil { | ||
| clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA = &nvidiagpuv1.GPUDirectRDMASpec{} | ||
| } | ||
|
|
||
| // Now it's safe to set the Enabled field | ||
| clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA.Enabled = | ||
| nvidiagpu.BoolPtr(gpuDriverEnableRDMA) | ||
| } | ||
|
|
||
| if gpuDriverImage != UndefinedValue { | ||
| glog.V(gpuparams.GpuLogLevel).Infof("Updating ClusterPolicy object driver image param "+ | ||
| "to '%s'", gpuDriverImage) | ||
| clusterPolicyBuilder.Definition.Spec.Driver.Image = gpuDriverImage | ||
| } | ||
|
|
||
| if gpuDriverRepo != UndefinedValue { | ||
| glog.V(gpuparams.GpuLogLevel).Infof("Updating ClusterPolicy object driver repository "+ | ||
| "param to '%s'", gpuDriverRepo) | ||
| clusterPolicyBuilder.Definition.Spec.Driver.Repository = gpuDriverRepo | ||
| } | ||
|
|
||
| if gpuDriverVersion != UndefinedValue { | ||
| glog.V(gpuparams.GpuLogLevel).Infof("Updating ClusterPolicy object driver version param "+ | ||
| "to '%s'", gpuDriverVersion) | ||
| clusterPolicyBuilder.Definition.Spec.Driver.Version = gpuDriverVersion | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nil-safety before modifying nested structs
clusterPolicyBuilder.Definition.Spec.Driver is dereferenced without a nil check.
If the ALM example ever omits the driver stanza, the test will panic.
- if clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA == nil {
- clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA = &nvidiagpuv1.GPUDirectRDMASpec{}
- }
+driver := &clusterPolicyBuilder.Definition.Spec.Driver
+if driver == nil {
+ driver = &nvidiagpuv1.DriverSpec{}
+ clusterPolicyBuilder.Definition.Spec.Driver = *driver
+}
+if driver.GPUDirectRDMA == nil {
+ driver.GPUDirectRDMA = &nvidiagpuv1.GPUDirectRDMASpec{}
+}Same precaution applies to the later Image, Repository, and Version assignments.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tests/nvidiagpu/deploy-gpu-test.go around lines 693 to 719, the code
dereferences clusterPolicyBuilder.Definition.Spec.Driver without checking if
Driver is nil, which can cause a panic if the driver stanza is missing. Add a
nil check for clusterPolicyBuilder.Definition.Spec.Driver before accessing or
modifying its fields, and initialize it if nil. Apply the same nil-safety check
before setting the Image, Repository, and Version fields to prevent panics.
|
/uncc @fabiendupont |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1 similar comment
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@wabouhamad: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Added option to specify in clusterpolicy Driver spec parameters before deploying on the cluster via these optional environment variables:
modified: README.md
modified: internal/nvidiagpuconfig/config.go
modified: pkg/nvidiagpu/clusterpolicy.go
modified: tests/nvidiagpu/deploy-gpu-test.go
Summary by CodeRabbit