Skip to content

Conversation

@wabouhamad
Copy link
Collaborator

@wabouhamad wabouhamad commented May 16, 2025

Added option to specify in clusterpolicy Driver spec parameters before deploying on the cluster via these optional environment variables:

  • NVIDIAGPU_GPU_DRIVER_IMAGE
  • NVIDIAGPU_GPU_DRIVER_REPO
  • NVIDIAGPU_GPU_DRIVER_VERSION
  • NVIDIAGPU_GPU_DRIVER_ENABLE_RDMA

modified: README.md
modified: internal/nvidiagpuconfig/config.go
modified: pkg/nvidiagpu/clusterpolicy.go
modified: tests/nvidiagpu/deploy-gpu-test.go

Summary by CodeRabbit

  • New Features
    • Added support for specifying GPU driver image, repository, version, and GPUDirect RDMA enablement via environment variables for NVIDIA GPU Operator deployments.
  • Documentation
    • Updated usage instructions and examples in the documentation to reflect new optional environment variables for GPU driver configuration.
  • Tests
    • Enhanced deployment tests to allow customization of GPU driver parameters using environment variables.

Added option to specify in clusterpolicy Driver spec parameters before deploying on the cluster
via these optional environment variables:

- NVIDIAGPU_GPU_DRIVER_IMAGE
- NVIDIAGPU_GPU_DRIVER_REPO
- NVIDIAGPU_GPU_DRIVER_VERSION
- NVIDIAGPU_GPU_DRIVER_ENABLE_RDMA

modified:   README.md
modified:   internal/nvidiagpuconfig/config.go
modified:   pkg/nvidiagpu/clusterpolicy.go
modified:   tests/nvidiagpu/deploy-gpu-test.go
@openshift-ci
Copy link

openshift-ci bot commented May 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wabouhamad

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 16, 2025

Walkthrough

The changes introduce new environment variables and configuration fields to support customization of NVIDIA GPU driver parameters, including image, repository, version, and RDMA enablement. Documentation is updated, configuration structs are extended, a utility function is added, and test logic is enhanced to apply these customizations during GPU operator deployment tests.

Changes

File(s) Change Summary
README.md Added documentation for four new optional NVIDIA GPU Operator environment variables, including example usage and notes.
internal/nvidiagpuconfig/config.go Added four new fields to the NvidiaGPUConfig struct for GPU driver image, repository, version, and RDMA enablement, all configurable via environment variables.
pkg/nvidiagpu/clusterpolicy.go Introduced a new utility function BoolPtr to return a pointer to a boolean value.
tests/nvidiagpu/deploy-gpu-test.go Added variables and logic to read new GPU driver environment variables and update the ClusterPolicy driver spec accordingly during deployment tests. Improved logging formatting.

Sequence Diagram(s)

sequenceDiagram
    participant Env as Environment
    participant Test as deploy-gpu-test.go
    participant Config as NvidiaGPUConfig
    participant ClusterPolicy as ClusterPolicy Spec

    Env->>Test: Provides GPU driver env variables
    Test->>Config: Reads env variables into config fields
    Test->>Test: Sets updateGPUDriverSpec flag if any variable set
    Test->>ClusterPolicy: Updates driver image, repo, version, RDMA if flag set
    ClusterPolicy-->>Test: ClusterPolicy spec updated for deployment
Loading

Suggested labels

approved, lgtm

Suggested reviewers

  • fabiendupont
  • ggordaniRed

Poem

A bunny hopped into the code,
With GPU drivers in its load.
Now image, repo, version—check!
RDMA too, what the heck!
With docs and tests all up to date,
This hare declares the changes great! 🐇✨

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@wabouhamad
Copy link
Collaborator Author

/unnc @ybettan
/cc @empovit @TomerNewman

@openshift-ci openshift-ci bot requested review from TomerNewman and empovit May 16, 2025 04:23
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
pkg/nvidiagpu/clusterpolicy.go (1)

228-231: Consider moving BoolPtr to a shared utilities package and making it generic

BoolPtr is a handy helper, but:

  1. The project already contains several utility helpers – co-locating this one under a generic pkg/util (or similar) avoids scattering helpers across domain packages (nvidiagpu is a business-domain package, not a util package).

  2. With Go 1.18 you can implement a single generic helper that works for any value type:

// util/pointer.go
// Ptr returns a pointer to the supplied value.
func Ptr[T any](v T) *T { return &v }

That removes the need for one‐off helpers (StringPtr, IntPtr, BoolPtr, …) and stays 100 % type-safe.

If refactoring is not feasible now, at least add a short godoc style comment starting with the function name (to appease golint).

README.md (2)

82-86: Unify bullet style – fixes markdown-lint (MD004)

The surrounding list uses asterisks; these lines switched to dashes which triggers the linter.

-* `NVIDIAGPU_GPU_DRIVER_IMAGE`: specific GPU driver image specified in clusterPolicy - _optional_
-* `NVIDIAGPU_GPU_DRIVER_REPO`: specific GPU driver image repository specified in clusterPolicy - _optional_
-* `NVIDIAGPU_GPU_DRIVER_VERSION`: specific GPU driver version specified in clusterPolicy - _optional_
-* `NVIDIAGPU_GPU_DRIVER_ENABLE_RDMA`: option to enable GPUDirect RDMA in clusterpolicy.  Default value is false - _optional_
+* `NVIDIAGPU_GPU_DRIVER_IMAGE`: specific GPU driver image specified in clusterPolicy - _optional_
+* `NVIDIAGPU_GPU_DRIVER_REPO`: specific GPU driver image repository specified in clusterPolicy - _optional_
+* `NVIDIAGPU_GPU_DRIVER_VERSION`: specific GPU driver version specified in clusterPolicy - _optional_
+* `NVIDIAGPU_GPU_DRIVER_ENABLE_RDMA`: option to enable GPUDirect RDMA in clusterpolicy.  Default value is false - _optional_
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

82-82: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


83-83: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


84-84: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


85-85: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


170-176: Same bullet-style inconsistency further down

For consistency apply the same change to the commented example block.

tests/nvidiagpu/deploy-gpu-test.go (2)

89-97: Prefer explicit zero values over sentinel string UndefinedValue

gpuDriverImage, gpuDriverRepo, and gpuDriverVersion are initialised with the sentinel UndefinedValue.
Because these are plain strings, the empty string already signals “unset”, which simplifies later checks:

- gpuDriverImage      = UndefinedValue
- gpuDriverRepo       = UndefinedValue
- gpuDriverVersion    = UndefinedValue
+ gpuDriverImage      string
+ gpuDriverRepo       string
+ gpuDriverVersion    string

This avoids accidental collisions when UndefinedValue is used for other purposes.


206-245: Repeated flag-setting logic can be collapsed

The four nearly identical blocks (GPUDriverImage, …Repo, …Version, …EnableRDMA) all:

  1. Check the env-derived value,
  2. Log,
  3. Assign to a package variable,
  4. Flip updateGPUDriverSpec = true.

This repetition is error-prone (easy to forget one field next time) and bloats the setup.
Consider a helper:

func setIfPresent[T comparable](val T, target *T) bool {
    var zero T
    if val != zero {
        *target = val
        return true
    }
    return false
}

updated := setIfPresent(nvidiaGPUConfig.GPUDriverImage, &gpuDriverImage)
updated = setIfPresent(nvidiaGPUConfig.GPUDriverRepo, &gpuDriverRepo) || updated
updated = setIfPresent(nvidiaGPUConfig.GPUDriverVersion, &gpuDriverVersion) || updated
if nvidiaGPUConfig.GPUDriverEnableRDMA {
    gpuDriverEnableRDMA = true
    updated = true
}
updateGPUDriverSpec = updated

This trims ~40 lines and keeps the intent crystal clear.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between edcec37 and 5ad2503.

📒 Files selected for processing (4)
  • README.md (2 hunks)
  • internal/nvidiagpuconfig/config.go (1 hunks)
  • pkg/nvidiagpu/clusterpolicy.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (6 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
README.md

82-82: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


83-83: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


84-84: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


85-85: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

🔇 Additional comments (1)
internal/nvidiagpuconfig/config.go (1)

18-21: Double-prefix risk with envconfig.Process("nvidiagpu_", …)

The struct tags already contain the full variable name (NVIDIAGPU_GPU_DRIVER_IMAGE, …).
Because envconfig.Process("nvidiagpu_", cfg) automatically adds an underscore after the prefix, the effective lookup key becomes
NVIDIAGPU__GPU_DRIVER_IMAGE (note the double underscore) which will never match the intended environment variable.

This pattern existed before, but the four new fields inherit the same issue – so look-ups will silently fail.

Two possible fixes:

- if err := envconfig.Process("nvidiagpu_", cfg); err != nil {
+ if err := envconfig.Process("NVIDIAGPU", cfg); err != nil { // let envconfig add the underscore

or keep the current call and drop the prefix in the struct tags:

- GPUDriverImage string `envconfig:"NVIDIAGPU_GPU_DRIVER_IMAGE"`
+ GPUDriverImage string `envconfig:"GPU_DRIVER_IMAGE"`

Please verify which pattern the rest of the code relies on before merging.

Comment on lines +693 to +719
if clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA == nil {
clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA = &nvidiagpuv1.GPUDirectRDMASpec{}
}

// Now it's safe to set the Enabled field
clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA.Enabled =
nvidiagpu.BoolPtr(gpuDriverEnableRDMA)
}

if gpuDriverImage != UndefinedValue {
glog.V(gpuparams.GpuLogLevel).Infof("Updating ClusterPolicy object driver image param "+
"to '%s'", gpuDriverImage)
clusterPolicyBuilder.Definition.Spec.Driver.Image = gpuDriverImage
}

if gpuDriverRepo != UndefinedValue {
glog.V(gpuparams.GpuLogLevel).Infof("Updating ClusterPolicy object driver repository "+
"param to '%s'", gpuDriverRepo)
clusterPolicyBuilder.Definition.Spec.Driver.Repository = gpuDriverRepo
}

if gpuDriverVersion != UndefinedValue {
glog.V(gpuparams.GpuLogLevel).Infof("Updating ClusterPolicy object driver version param "+
"to '%s'", gpuDriverVersion)
clusterPolicyBuilder.Definition.Spec.Driver.Version = gpuDriverVersion
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Nil-safety before modifying nested structs

clusterPolicyBuilder.Definition.Spec.Driver is dereferenced without a nil check.
If the ALM example ever omits the driver stanza, the test will panic.

- if clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA == nil {
-     clusterPolicyBuilder.Definition.Spec.Driver.GPUDirectRDMA = &nvidiagpuv1.GPUDirectRDMASpec{}
- }
+driver := &clusterPolicyBuilder.Definition.Spec.Driver
+if driver == nil {
+    driver = &nvidiagpuv1.DriverSpec{}
+    clusterPolicyBuilder.Definition.Spec.Driver = *driver
+}
+if driver.GPUDirectRDMA == nil {
+    driver.GPUDirectRDMA = &nvidiagpuv1.GPUDirectRDMASpec{}
+}

Same precaution applies to the later Image, Repository, and Version assignments.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tests/nvidiagpu/deploy-gpu-test.go around lines 693 to 719, the code
dereferences clusterPolicyBuilder.Definition.Spec.Driver without checking if
Driver is nil, which can cause a panic if the driver stanza is missing. Add a
nil check for clusterPolicyBuilder.Definition.Spec.Driver before accessing or
modifying its fields, and initialize it if nil. Apply the same nil-safety check
before setting the Image, Repository, and Version fields to prevent panics.

@wabouhamad
Copy link
Collaborator Author

/uncc @fabiendupont

@openshift-ci openshift-ci bot removed the request for review from fabiendupont May 16, 2025 04:47
@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

1 similar comment
@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link

openshift-ci bot commented Oct 19, 2025

@wabouhamad: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/lint 5ad2503 link true /test lint

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants