Skip to content

Conversation

@vtruhpon
Copy link
Collaborator

@vtruhpon vtruhpon commented Dec 11, 2025

Summary by CodeRabbit

  • New Features

    • Complete MIG testing/workflow support including MIG-enabled GPU-burn workloads, node label management, MIG profile handling, new GPU log level constants, node-label and cluster-policy polling helpers, and a configmap deletion wait.
  • Tests

    • New end-to-end MIG test suite and expanded MIG test paths for single/mixed MIG profiles.
  • Bug Fixes

    • Improved error handling when no matching pods are found.
  • Documentation

    • Expanded MIG testing docs, env vars, examples, and NO_COLOR logging guidance.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2025

Walkthrough

Adds MIG (Multi-Instance GPU) support: new MIG-aware GPU-burn pod creation, MIG profile discovery/parsing, node labeling and ClusterPolicy orchestration, polling/wait helpers, test suites and test integrations, config/constant additions, docs, and cleanup/logging utilities.

Changes

Cohort / File(s) Summary
MIG core & orchestration
pkg/mig/mig.go
New large implementation for MIG workflows: MIG profile discovery/parsing, selection, node label management, ClusterPolicy updates and readiness polling, workload provisioning (namespaces/configmaps/pods), pod exec/log collection, validation, and many cleanup helpers; adds MIGProfileInfo and multiple public APIs.
Tests & suites
tests/mig/mig_suite_test.go, tests/mig/mig-test.go, tests/nvidiagpu/deploy-gpu-test.go
New Ginkgo MIG suite and tests; integrates MIG flows into GPU tests, adds lifecycle hooks, MIG gating and cleanup logic, and MIG-specific test cases and helpers.
GPU-burn pod creation
internal/gpu-burn/gpu-burn.go
Added CreateGPUBurnPodWithMIG(...) which builds a pod requesting MIG resources (nvidia.com/<migProfile>), configmap entrypoint, node selectors, tolerations, security context, and container settings.
Configuration & params
internal/nvidiagpuconfig/config.go, internal/tsparams/mig-vars.go
Added SingleMIGProfile and MixedMIGProfile to NvidiaGPUConfig; introduced MigLabels, MigReporterNamespacesToDump, and MigReporterCRDsToDump for MIG test reporting.
Constants & log levels
internal/gpuparams/const.go, pkg/nvidiagpu/consts.go
Added Gpu10LogLevel, Gpu50LogLevel, Gpu100LogLevel and new timing constants for ClusterPolicy/label polling (intervals/timeouts).
Wait / polling helpers
internal/wait/wait.go
New exported pollers: ClusterPolicyNotReady and NodeLabelExists to wait for policy state and node label presence.
Node builder
pkg/nodes/node.go
Added WithLabel(key, value string) *Builder to set/overwrite node labels on builder; updated comment for WithNewLabel.
Get & pod utilities
internal/get/get.go, pkg/pod/pod.go
GetFirstPodNameWithLabel now returns pod list errors and empties with explicit error; WaitUntilDeleted uses polling context when fetching pods.
ConfigMap lifecycle
pkg/configmap/configmap.go
Added Builder.WaitUntilDeleted(timeout time.Duration) error to poll for configmap deletion.
Docs & README
README.md
Expanded MIG testing documentation, new env var docs (NVIDIAGPU_SINGLE_MIG_PROFILE), examples, run and cleanup instructions.
Version mapping
workflows/gpu_operator_versions/versions.json
Removed OpenShift 4.21 rc entry from versions mapping.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Suite as MIG Test Suite
    participant API as Kubernetes API
    participant Nodes as Worker Nodes
    participant Policy as GPU Operator ClusterPolicy
    participant Pod as GPU-burn Pod
    participant Logs as Log Collector

    Suite->>API: list nodes (nodeSelector)
    API-->>Suite: node list
    Suite->>Suite: parse/select MIG profile(s)
    Suite->>API: label nodes for chosen MIG strategy
    API-->>Nodes: labels applied
    Suite->>Policy: fetch ClusterPolicy (capture resourceVersion)
    Suite->>Policy: update ClusterPolicy with MIG config
    Policy-->>Suite: updated resourceVersion/state
    Suite->>Policy: poll ClusterPolicy readiness / notReady
    Suite->>API: poll until node label exists
    Suite->>API: create namespace & configmap for burn
    Suite->>API: create GPU-burn Pod requesting nvidia.com/<migProfile>
    Pod->>Pod: run burn across MIG instances
    Pod-->>Logs: emit per-slice logs
    Suite->>Pod: wait for completion
    Suite->>API: fetch pod logs
    Suite->>Suite: parse & validate logs per MIG instance
    Suite->>API: cleanup pod/configmap/namespace and reset labels/ClusterPolicy
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Nfd refractoring #55 — Overlaps modifications to tests/nvidiagpu/deploy-gpu-test.go and test wiring.
  • Nfd refactoring #80 — Related changes around NFD integration and test configuration used by the new MIG tests.

Suggested labels

lgtm, ok-to-test

Suggested reviewers

  • empovit
  • wabouhamad
  • TomerNewman

Poem

🐰
I hopped where MIG labels grow,
Sliced the GPUs in tidy row,
ClusterPolicy drummed the beat,
Pods burned warm — the logs were sweet,
Rabbit cheers — the tests all glow!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title "Single mig first commit" is partially related to the changeset - it references MIG (Multi-Instance GPU) which is a core feature added, but it is overly broad and vague (uses phrase "first commit" which is not descriptive) and does not clearly convey the main substance of the work. Consider a more descriptive title such as "Add MIG (Multi-Instance GPU) support for GPU burn testing" or "Implement GPU burn testing with MIG profile support" to better reflect the primary changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 95.65% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (17)
internal/gpuparams/gpuvars.go (1)

7-28: Remove commented-out code before merging.

These commented-out function definitions add noise without value. If they're needed in the future, they can be added at that time. Version control preserves history, so there's no need to keep dead code in comments.

internal/gpu-burn/gpu-burn.go (2)

61-63: Consider adding input validation for migProfile and migCount.

An empty migProfile would create an invalid resource name nvidia.com/, and a non-positive migCount would result in invalid resource requests. Defensive validation would prevent subtle runtime failures.

 func CreateGPUBurnPodWithMIG(apiClient *clients.Settings, podName, podNamespace string,
 	gpuBurnImage string, migProfile string, migCount int, timeout time.Duration) (*corev1.Pod, error) {
+	if migProfile == "" {
+		return nil, fmt.Errorf("migProfile cannot be empty")
+	}
+	if migCount <= 0 {
+		return nil, fmt.Errorf("migCount must be positive, got %d", migCount)
+	}
+
 	var volumeDefaultMode int32 = 0777

62-63: The timeout parameter is unused.

The timeout parameter is declared but never referenced in the function body. Either use it (e.g., for pod readiness waiting) or remove it to avoid confusion. Note that CreateGPUBurnPod has the same issue.

tests/mig/mig_suite_test.go (1)

23-23: Consider using tsparams.MigLabels for consistency.

The mig-vars.go file defines MigLabels specifically for MIG tests (which extends gpuparams.Labels with LabelSuite). Using tsparams.Labels here may not include MIG-specific labeling. Verify if this is intentional.

-	RunSpecs(t, "MIG", Label(tsparams.Labels...), reporterConfig)
+	RunSpecs(t, "MIG", Label(tsparams.MigLabels...), reporterConfig)
tests/mig/mig-test.go (3)

3-42: Clean up commented-out imports.

Multiple commented-out imports indicate incomplete refactoring. Consider removing them to improve code clarity, or add a TODO comment explaining why they're preserved.


64-68: Unused variables declared.

ScaleCluster, UseSingleMIGProfile, UseMixedMIGProfile, SingleMigProfile, and MixedMigProfile are declared but never used in this file. Remove them or add a TODO if they're planned for future use.

-	ScaleCluster        = false
-	UseSingleMIGProfile = false
-	UseMixedMIGProfile  = false
-	SingleMigProfile    = UndefinedValue
-	MixedMigProfile     = UndefinedValue

116-119: Use Skip() for unimplemented tests.

The placeholder test should use Ginkgo's Skip() function to clearly indicate it's not implemented, rather than just logging. This provides better test reporting.

 		It("Test GPU Burn with mixed strategy MIG Configuration", Label("gpu-mixed-mig"), func() {
-			glog.V(gpuparams.Gpu10LogLevel).Infof("gpu-mixed-mig testcase not yet implemented")
-			// mig.TestMixedMIGGPUBurn(nvidiaGPUConfig, burn, BurnImageName, WorkerNodeSelector, cleanupAfterTest)
+			Skip("gpu-mixed-mig testcase not yet implemented")
 		})
tests/nvidiagpu/deploy-gpu-test.go (3)

1184-1205: Remove commented-out code.

The shouldKeepOperator function is entirely commented out. If it's been replaced by mig.ShouldKeepOperator, remove this dead code to improve maintainability.


1267-1271: Redundant return after Skip.

Skip() in Ginkgo panics to abort the test, so the return statement on line 1270 is unreachable. Remove it for clarity.

 	if len(migCapabilities) == 0 {
 		glog.V(gpuparams.GpuLogLevel).Infof("No MIG configurations available")
 		Skip("No MIG configurations found")
-		return
 	}

1111-1182: Consider extracting cleanupGPUOperatorResources to shared package.

This function largely duplicates mig.CleanupGPUOperatorResources (shown in relevant snippets at pkg/mig/mig.go lines 186-254). Consider reusing the existing function or consolidating to avoid drift.

internal/get/get.go (2)

130-144: Hardcoded namespace may limit flexibility.

The namespace "nvidia-gpu-operator" is hardcoded. If the GPU operator is installed in a different namespace (e.g., custom installation), this function will fail to discover MIG capabilities. Consider making this configurable or deriving it from the cluster configuration.


315-333: Fragile state management between line1 and line2 parsing.

The exclude flag is used to track whether to skip line2 after an excluded line1. However, if the input contains unexpected patterns (e.g., malformed output, extra blank lines between pairs), the line2 regex could match an unrelated line and incorrectly update the last profile.

Consider a more robust approach like tracking expected line2 state explicitly or using a multiline regex to match both lines together.

pkg/mig/mig.go (5)

185-255: Consider softer error handling in cleanup function.

Using Expect assertions throughout cleanup (lines 204, 216, 226, etc.) will abort cleanup on the first error, potentially leaving resources behind. Consider accumulating errors and reporting them at the end, or logging warnings instead.


273-286: Remove commented-out code and consider exact label matching.

  1. The commented-out code block (lines 273-279) should be removed before merging.
  2. Using strings.Contains(filterQuery, label) can cause false positives - e.g., label "mig" would match filter "nomig". Consider using word boundaries or the commented approach with MatchesLabelFilter.
-	// for _, label := range labelsToCheck {
-	// 	glog.V(gpuparams.GpuLogLevel).Infof("Checking if label %s is present in CurrentSpecReport", label)
-	// 	if matches, _ := specReport.MatchesLabelFilter(label); matches {
-	// 		glog.V(gpuparams.Gpu100LogLevel).Infof("Label %s is present in CurrentSpecReport", label)
-	// 		return true
-	// 	}
-	// }
 	for _, label := range labelsToCheck {
 		glog.V(gpuparams.GpuLogLevel).Infof("Checking if label %s is present in Ginkgo label filter", label)
-		if strings.Contains(filterQuery, label) {
+		// Use word boundary check to avoid partial matches
+		labelPattern := regexp.MustCompile(`\b` + regexp.QuoteMeta(label) + `\b`)
+		if labelPattern.MatchString(filterQuery) {
 			glog.V(gpuparams.Gpu100LogLevel).Infof("Label %s is present in Ginkgo label filter", label)
 			return true
 		}
 	}

402-419: Inefficient: Two Update() calls per node.

Each node is updated twice - once for the strategy label and once for the config label. Consider setting both labels before calling Update() to reduce API calls.

 		for _, nodeBuilder := range nodeBuilders {
 			glog.V(gpuparams.GpuLogLevel).Infof("Setting MIG %s strategy label on node '%s' (overwrite=true)", strategy, nodeBuilder.Definition.Name)
 			nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.strategy", strategy)
-			_, err = nodeBuilder.Update()
-			if err != nil {
-				glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG label: %v, error: %v", nodeBuilder.Definition.Name, strategy, err)
-			} else {
-				glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG %s strategy label on node '%s'", strategy, nodeBuilder.Definition.Name)
-			}
 			glog.V(gpuparams.GpuLogLevel).Infof("Setting MIG configuration label %s on node '%s' (overwrite=true)", MigProfile, nodeBuilder.Definition.Name)
 			nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", MigProfile)
 			_, err = nodeBuilder.Update()
 			if err != nil {
-				glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG label: %v", nodeBuilder.Definition.Name, err)
+				glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG labels: %v", nodeBuilder.Definition.Name, err)
 			} else {
-				glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG configuration label on node '%s' with %s", nodeBuilder.Definition.Name, MigProfile)
+				glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG labels on node '%s' (strategy=%s, config=%s)", nodeBuilder.Definition.Name, strategy, MigProfile)
 			}
 		}

744-752: Use BeTrue() instead of ToNot(BeFalse()) for clarity.

The double negation ToNot(BeFalse()) is harder to read than To(BeTrue()). Both are functionally equivalent.

 	for i := 0; i < migInstanceCount; i++ {
 		match1Mig := strings.Contains(gpuBurnMigLogs, fmt.Sprintf("GPU %d: OK", i))
 		glog.V(gpuparams.Gpu10LogLevel).Infof("Checking if GPU %d: OK is present in logs: %v", i, match1Mig)
-		Expect(match1Mig).ToNot(BeFalse(), "gpu-burn pod execution with MIG was FAILED")
+		Expect(match1Mig).To(BeTrue(), "gpu-burn pod execution with MIG was FAILED for GPU %d", i)
 	}
 	match2Mig := strings.Contains(gpuBurnMigLogs, "100.0%  proc'd:")

-	Expect(match2Mig).ToNot(BeFalse(), "gpu-burn pod execution with MIG was FAILED")
+	Expect(match2Mig).To(BeTrue(), "gpu-burn pod execution with MIG did not reach 100%% processed")

761-775: Remove commented-out code before merging.

The commented functions colorlogs() and clog() should be removed. If needed later, they can be retrieved from version control.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e375d6 and 2d61ebb.

📒 Files selected for processing (12)
  • .gitignore (1 hunks)
  • internal/get/get.go (3 hunks)
  • internal/gpu-burn/gpu-burn.go (2 hunks)
  • internal/gpuparams/const.go (1 hunks)
  • internal/gpuparams/gpuvars.go (1 hunks)
  • internal/nvidiagpuconfig/config.go (1 hunks)
  • internal/tsparams/mig-vars.go (1 hunks)
  • pkg/mig/mig.go (1 hunks)
  • pkg/nodes/node.go (3 hunks)
  • tests/mig/mig-test.go (1 hunks)
  • tests/mig/mig_suite_test.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (17 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/mig/mig_suite_test.go (3)
internal/reporter/reporter.go (1)
  • ReportIfFailed (54-90)
internal/tsparams/mig-vars.go (2)
  • MigReporterNamespacesToDump (14-18)
  • MigReporterCRDsToDump (21-23)
pkg/clients/clients.go (1)
  • SetScheme (130-168)
tests/mig/mig-test.go (9)
pkg/operatorconfig/config.go (1)
  • NewCustomConfig (18-27)
pkg/nvidiagpu/gpuBurnConfig.go (1)
  • NewDefaultGPUBurnConfig (15-22)
pkg/nvidiagpu/consts.go (1)
  • NvidiaGPULabel (8-8)
internal/nvidiagpuconfig/config.go (2)
  • NvidiaGPUConfig (9-21)
  • NewNvidiaGPUConfig (25-37)
pkg/global/const.go (1)
  • UndefinedValue (5-5)
internal/tsparams/consts.go (1)
  • LabelSuite (5-5)
internal/gpuparams/const.go (3)
  • Gpu100LogLevel (11-11)
  • Gpu10LogLevel (9-9)
  • GpuLogLevel (8-8)
pkg/mig/mig.go (2)
  • ReportOpenShiftVersionAndEnsureNFD (152-164)
  • CleanupGPUOperatorResources (187-255)
internal/inittools/inittools.go (1)
  • APIClient (20-20)
internal/tsparams/mig-vars.go (1)
internal/tsparams/consts.go (1)
  • LabelSuite (5-5)
pkg/nodes/node.go (1)
internal/gpuparams/const.go (2)
  • Gpu100LogLevel (11-11)
  • Gpu10LogLevel (9-9)
tests/nvidiagpu/deploy-gpu-test.go (2)
pkg/mig/mig.go (1)
  • ShouldKeepOperator (258-289)
pkg/global/const.go (1)
  • UndefinedValue (5-5)
pkg/mig/mig.go (1)
internal/get/get.go (4)
  • MIGProfileInfo (185-201)
  • GetFirstPodNameWithLabel (62-79)
  • MIGCapabilities (114-182)
  • GetClusterArchitecture (82-109)
🔇 Additional comments (16)
.gitignore (1)

5-6: LGTM!

The additions are appropriate: /archive/ excludes archived data from tracking, and *.test excludes compiled Go test binaries (standard practice for Go projects).

internal/gpuparams/const.go (1)

8-11: LGTM!

The new log level constants provide appropriate granularity for GPU-related logging and follow the existing naming convention.

pkg/nodes/node.go (1)

235-258: Good addition for overwrite-capable label setting.

The WithLabel method properly complements WithNewLabel by providing explicit overwrite semantics. The implementation correctly handles nil label maps and follows the builder pattern consistently.

internal/gpu-burn/gpu-burn.go (1)

82-145: Good security posture and pod configuration.

The pod spec follows security best practices: RunAsNonRoot, SeccompProfile: RuntimeDefault, dropping all capabilities, and setting both resource limits and requests for guaranteed QoS class.

internal/nvidiagpuconfig/config.go (1)

19-20: LGTM!

The new MIG profile configuration fields follow the existing pattern and naming conventions in the struct. The envconfig tags are consistent with the established prefix format.

tests/mig/mig_suite_test.go (1)

26-31: LGTM!

The JustAfterEach hook correctly uses the MIG-specific reporter configuration (MigReporterNamespacesToDump, MigReporterCRDsToDump) for failure reporting, which aligns with the definitions in mig-vars.go.

tests/mig/mig-test.go (1)

15-15: Verify blank import is necessary.

The blank import _ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients" suggests side-effect initialization. Confirm this is required; if not, remove it to avoid unnecessary dependencies.

internal/tsparams/mig-vars.go (1)

9-23: LGTM!

The MIG-specific test parameters are well-defined. The MigLabels properly extends the base labels, and the reporter configuration includes the relevant namespaces and CRDs for MIG test diagnostics.

tests/nvidiagpu/deploy-gpu-test.go (1)

1102-1108: LGTM!

The new MIG test entries are properly labeled and delegate to the appropriate test functions.

internal/get/get.go (4)

3-19: LGTM - Imports are appropriately updated for new functionality.

The added imports (context, regexp, strconv, strings, corev1) are all required for the new MIG capability discovery functions.


62-79: LGTM - Improved error handling for pod listing.

The explicit error handling and logging for empty pod lists improves debuggability. Minor note: Line 78 returns err which is always nil at that point; consider returning nil explicitly for clarity.


184-201: LGTM - Well-documented struct for MIG profile information.

The MIGProfileInfo struct is well-documented with clear comments explaining each field's purpose.


205-258: LGTM - Robust command execution with proper status checks.

The function properly validates pod and container readiness before execution, and appropriately handles non-zero exit codes by returning output when available.

pkg/mig/mig.go (3)

1-51: LGTM - Package setup and imports.

Imports are comprehensive for the MIG testing functionality. Color constants for logging are a nice touch for observability.


166-183: LGTM - Clean label checking implementation.

The function properly handles comma-separated labels with whitespace trimming.


351-385: LGTM - Proper MIG profile selection with bounds checking.

The function handles random selection when no profile is specified and gracefully handles out-of-range indices. Note: since Go 1.20, the global random source is automatically seeded.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
pkg/nodes/node.go (1)

11-11: New import adds coupling to internal package.

The gpuparams import introduces a dependency from a general-purpose pkg/nodes package to an internal GPU-specific package. This creates tight coupling. Consider using the standard log level 100 (as used elsewhere in this file) or passing the log level as a parameter.

-	"github.com/rh-ecosystem-edge/nvidia-ci/internal/gpuparams"

Then in the WithLabel method, use glog.V(100) consistently with the rest of the file.

tests/nvidiagpu/deploy-gpu-test.go (3)

1114-1123: Remove commented-out code.

The commented HasLabel("cleanup") block (lines 1116-1122) appears to be leftover from development. Either implement this feature or remove the dead code.

 func cleanupGPUOperatorResources() {
 	if !cleanupAfterTest {
-		// if HasLabel("cleanup") {
-		// 	glog.V(gpuparams.GpuLogLevel).Infof("Cleanup is enabled via cleanup label, running cleanup")
-		// } else {
-		// 	glog.V(gpuparams.GpuLogLevel).Infof("Cleanup is disabled, skipping GPU operator cleanup")
-		// 	return
-		// }
 		return
 	}

1184-1205: Remove commented-out function.

The shouldKeepOperator function is entirely commented out and appears to be superseded by mig.ShouldKeepOperator. Remove this dead code to improve maintainability.


1102-1108: Consider consolidating duplicate MIG test implementations.

There are two test cases for single MIG strategy:

  • gpu-burn-mig (line 1102) uses local testGPUBurnWithMIGConfiguration()
  • single-mig (line 1106) uses mig.TestSingleMIGGPUBurn() from pkg/mig

If both test the same scenario, consider removing one to reduce maintenance burden. If they test different aspects, add comments clarifying the distinction.

internal/gpu-burn/gpu-burn.go (1)

62-63: Unused timeout parameter.

The timeout parameter is declared but never used in CreateGPUBurnPodWithMIG. Either use it (e.g., for context deadline) or remove it to match the function's actual behavior.

 func CreateGPUBurnPodWithMIG(apiClient *clients.Settings, podName, podNamespace string,
-	gpuBurnImage string, migProfile string, migCount int, timeout time.Duration) (*corev1.Pod, error) {
+	gpuBurnImage string, migProfile string, migCount int) (*corev1.Pod, error) {

Note: The same issue exists in CreateGPUBurnPod (line 151), so this may be intentional for API consistency. If so, consider documenting why the parameter exists.

tests/mig/mig-test.go (4)

3-42: Clean up commented-out imports.

Many imports are commented out (lines 4-9, 13, 19, 22, 30-36, 40-41). This suggests work-in-progress code. Remove unused commented imports before merging to keep the code clean.

 import (
-	// "context"
-	// "encoding/json"
-	// "fmt"
-	// "math/rand"
-	// "strings"
-	// "time"
-
 	"github.com/rh-ecosystem-edge/nvidia-ci/internal/inittools"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/internal/nfd"
 	"github.com/rh-ecosystem-edge/nvidia-ci/internal/nvidiagpuconfig"
 	_ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients"
 	. "github.com/rh-ecosystem-edge/nvidia-ci/pkg/global"
 	"github.com/rh-ecosystem-edge/nvidia-ci/pkg/mig"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/mig"
 	nfd "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nfd"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nodes"
 	"github.com/rh-ecosystem-edge/nvidia-ci/pkg/nvidiagpu"
 	"github.com/rh-ecosystem-edge/nvidia-ci/pkg/operatorconfig"
 
 	"github.com/golang/glog"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/configmap"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/namespace"
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/pod"
-
-	// "github.com/operator-framework/api/pkg/operators/v1alpha1"
-	// "github.com/rh-ecosystem-edge/nvidia-ci/internal/check"
-
 	"github.com/rh-ecosystem-edge/nvidia-ci/internal/gpuparams"
 	"github.com/rh-ecosystem-edge/nvidia-ci/internal/tsparams"
-	// "github.com/rh-ecosystem-edge/nvidia-ci/internal/wait"
-	// corev1 "k8s.io/api/core/v1"
 )

50-58: Consider centralizing shared test configuration.

WorkerNodeSelector and BurnImageName are duplicated from tests/nvidiagpu/deploy-gpu-test.go. Consider moving these to a shared test utilities package to avoid drift between test suites.


64-73: Remove or use declared variables.

Several package-level variables are declared but never used:

  • ScaleCluster (line 64)
  • UseSingleMIGProfile (line 65)
  • UseMixedMIGProfile (line 66)
  • SingleMigProfile (line 67)
  • MixedMigProfile (line 68)

Additionally, lines 71-72 contain commented-out variable declarations. Either implement the intended functionality or remove these to avoid confusion.


113-116: Mark placeholder test as pending or skip it.

The mixed MIG test (line 113-116) only logs a message and doesn't perform any assertions. This will pass silently and may give false confidence. Either:

  1. Use Skip("not yet implemented") to clearly indicate it's not ready
  2. Use Ginkgo's PIt (pending It) to mark it as pending
  3. Remove it until implementation is ready
-		It("Test GPU Burn with mixed strategy MIG Configuration", Label("gpu-mixed-mig"), func() {
-			glog.V(gpuparams.Gpu10LogLevel).Infof("gpu-mixed-mig testcase not yet implemented")
-			// mig.TestMixedMIGGPUBurn(nvidiaGPUConfig, burn, BurnImageName, WorkerNodeSelector, cleanupAfterTest)
-		})
+		PIt("Test GPU Burn with mixed strategy MIG Configuration", Label("gpu-mixed-mig"), func() {
+			// TODO: Implement mixed MIG test
+			// mig.TestMixedMIGGPUBurn(nvidiaGPUConfig, burn, BurnImageName, WorkerNodeSelector, cleanupAfterTest)
+		})
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abe719e and 5e1bc7d.

📒 Files selected for processing (5)
  • internal/gpu-burn/gpu-burn.go (2 hunks)
  • pkg/mig/mig.go (1 hunks)
  • pkg/nodes/node.go (3 hunks)
  • tests/mig/mig-test.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (17 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/mig/mig.go
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Applied to files:

  • pkg/nodes/node.go
  • tests/mig/mig-test.go
  • internal/gpu-burn/gpu-burn.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.

Applied to files:

  • pkg/nodes/node.go
  • tests/nvidiagpu/deploy-gpu-test.go
  • tests/mig/mig-test.go
  • internal/gpu-burn/gpu-burn.go
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-06-18T14:59:40.987Z
Learnt from: wabouhamad
Repo: rh-ecosystem-edge/nvidia-ci PR: 205
File: tests/nvidianetwork/deploy-nno-test.go:1411-1414
Timestamp: 2025-06-18T14:59:40.987Z
Learning: In SRIOV RDMA testcases in tests/nvidianetwork/deploy-nno-test.go, the string "sriov" is intentionally passed as the device parameter to CreateRdmaWorkloadPod instead of the actual mlx5 device ID. The workload container is designed to dynamically discover the correct mlx5 device for SRIOV cases at runtime.

Applied to files:

  • tests/mig/mig-test.go
🧬 Code graph analysis (3)
tests/nvidiagpu/deploy-gpu-test.go (3)
pkg/mig/mig.go (1)
  • ShouldKeepOperator (264-295)
pkg/global/const.go (1)
  • UndefinedValue (5-5)
internal/wait/wait.go (1)
  • ClusterPolicyReady (17-46)
tests/mig/mig-test.go (10)
pkg/operatorconfig/config.go (1)
  • NewCustomConfig (18-27)
pkg/nvidiagpu/gpuBurnConfig.go (1)
  • NewDefaultGPUBurnConfig (15-22)
internal/nvidiagpuconfig/config.go (2)
  • NvidiaGPUConfig (9-21)
  • NewNvidiaGPUConfig (25-37)
pkg/global/const.go (1)
  • UndefinedValue (5-5)
internal/tsparams/consts.go (1)
  • LabelSuite (5-5)
vendor/github.com/golang/glog/glog.go (2)
  • V (401-403)
  • Infof (507-509)
internal/gpuparams/const.go (3)
  • Gpu10LogLevel (9-9)
  • Gpu100LogLevel (11-11)
  • GpuLogLevel (8-8)
pkg/mig/mig.go (2)
  • ReportOpenShiftVersionAndEnsureNFD (158-170)
  • CleanupGPUOperatorResources (193-261)
internal/inittools/inittools.go (1)
  • APIClient (20-20)
tests/nvidiagpu/deploy-gpu-test.go (1)
  • BurnImageName (62-65)
internal/gpu-burn/gpu-burn.go (3)
pkg/clients/clients.go (1)
  • Settings (45-60)
internal/gpuparams/gpuvars.go (1)
  • Labels (5-5)
internal/tsparams/gpu-vars.go (1)
  • Labels (11-11)
🔇 Additional comments (8)
pkg/nodes/node.go (2)

203-205: Good documentation clarification.

The updated comment clearly distinguishes WithNewLabel (no overwrite) from the new WithLabel (overwrites). This helps API consumers choose the right method.


235-257: LGTM - Clean implementation of overwrite-capable label setter.

The method correctly:

  • Validates input (empty key check)
  • Initializes the Labels map if nil
  • Unconditionally sets the label value (overwrite behavior)
  • Returns the builder for chaining

Minor note: The error logging order (errorMsg set before logging at line 245-246) differs slightly from WithNewLabel (log then errorMsg at lines 213-214), but this doesn't affect functionality.

tests/nvidiagpu/deploy-gpu-test.go (4)

93-93: Package-level slice initialization looks correct.

The labelsToCheck is initialized as an empty slice and populated in BeforeAll (line 154). This avoids issues with package-level initialization order.


153-162: Good test coordination pattern.

The labelsToCheck array and ShouldKeepOperator integration properly prevents premature cleanup when subsequent MIG or upgrade tests need the operator to remain installed.


1273-1282: Random MIG profile selection may complicate debugging.

Using rand.Intn for MIG profile selection provides good test variety, but failures may be harder to reproduce. Consider logging the selected index prominently or allowing an environment variable override for deterministic testing.

The code at line 1276 already logs the selected index, which helps.

Consider verifying if the NVIDIAGPU_SINGLE_MIG_PROFILE env var from nvidiaGPUConfig.SingleMIGProfile is intended to override this random selection. The current code doesn't appear to use it here.


1592-1605: LGTM - MIG-aware log validation.

The validation loop correctly checks for "GPU X: OK" for each available MIG slice (line 1595-1600), and separately verifies the completion message "100.0% proc'd:" (line 1602). This properly validates MIG workload distribution.

internal/gpu-burn/gpu-burn.go (2)

70-73: LGTM - Correct MIG resource naming.

The comment accurately explains the MIG resource naming convention:

  • Single strategy: migProfile="gpu""nvidia.com/gpu"
  • Mixed strategy: migProfile="mig-1g.5gb""nvidia.com/mig-1g.5gb"

This aligns with the retrieved learnings about NVIDIA MIG resource naming.


75-146: LGTM - Well-structured MIG Pod configuration.

The Pod configuration correctly includes:

  • MIG resource limits and requests using the constructed resource name
  • Appropriate security context (non-root, seccomp profile, dropped capabilities)
  • GPU tolerations for scheduling
  • ConfigMap volume mount for the entrypoint script
  • Node selector requiring GPU-present worker nodes

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
pkg/mig/mig.go (4)

279-285: Remove commented-out code.

The commented-out loop (lines 279-285) appears to be an alternative implementation using MatchesLabelFilter. If this approach is no longer needed, remove it to improve code clarity.

Apply this diff:

 	// Check if test has any of these labels
 
-	// for _, label := range labelsToCheck {
-	// 	glog.V(gpuparams.GpuLogLevel).Infof("Checking if label %s is present in CurrentSpecReport", label)
-	// 	if matches, _ := specReport.MatchesLabelFilter(label); matches {
-	// 		glog.V(gpuparams.Gpu100LogLevel).Infof("Label %s is present in CurrentSpecReport", label)
-	// 		return true
-	// 	}
-	// }
 	for _, label := range labelsToCheck {

408-425: Consider combining label updates into a single API call per node.

Currently, each node receives two separate Update() calls—one for nvidia.com/mig.strategy and another for nvidia.com/mig.config. This results in two network round-trips per node. Chain both WithLabel() calls before updating to reduce API calls.

Apply this diff:

 		for _, nodeBuilder := range nodeBuilders {
 			glog.V(gpuparams.GpuLogLevel).Infof("Setting MIG %s strategy label on node '%s' (overwrite=true)", strategy, nodeBuilder.Definition.Name)
-			nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.strategy", strategy)
-			_, err = nodeBuilder.Update()
-			if err != nil {
-				glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG label: %v, error: %v", nodeBuilder.Definition.Name, strategy, err)
-			} else {
-				glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG %s strategy label on node '%s'", strategy, nodeBuilder.Definition.Name)
-			}
 			glog.V(gpuparams.GpuLogLevel).Infof("Setting MIG configuration label %s on node '%s' (overwrite=true)", MigProfile, nodeBuilder.Definition.Name)
+			nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.strategy", strategy)
 			nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", MigProfile)
 			_, err = nodeBuilder.Update()
 			if err != nil {
-				glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG label: %v", nodeBuilder.Definition.Name, err)
+				glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG labels: %v", nodeBuilder.Definition.Name, err)
 			} else {
-				glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG configuration label on node '%s' with %s", nodeBuilder.Definition.Name, MigProfile)
+				glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG %s strategy and %s configuration labels on node '%s'", strategy, MigProfile, nodeBuilder.Definition.Name)
 			}
 		}

576-585: Remove unreachable error check.

Line 584's Expect() assertion is unreachable: if err is non-nil, line 578 returns early; otherwise err is nil and the Expect() always succeeds. Remove the redundant check.

Apply this diff:

 	glog.V(gpuparams.Gpu10LogLevel).Infof("Getting cluster architecture from nodes with "+
 		"WorkerNodeSelector: %v", WorkerNodeSelector)
 	clusterArch, err := get.GetClusterArchitecture(inittools.APIClient, WorkerNodeSelector)
 	if err != nil {
 		return "", err
 	}
 
 	glog.V(gpuparams.Gpu10LogLevel).Infof("cluster architecture for GPU enabled worker node is: %s",
 		clusterArch)
 
-	Expect(err).ToNot(HaveOccurred(), "error configuring MIG strategy and getting cluster architecture: %v", err)
 	time.Sleep(30 * time.Second)
 
 	return clusterArch, nil

767-781: Consider removing commented-out example code.

The commented functions colorlogs() and clog() appear to be test/example code that's no longer needed. Remove them to keep the codebase clean.

Apply this diff:

 	return fmt.Sprintf("%s%s%s", color, message, colorReset)
 }
-
-// // colorlogs prints sample colored log messages for testing color output.
-// // This function demonstrates all available color options for console highlighting.
-// func colorlogs() {
-// 	glog.V(101).Infof("%s", colorLog(colorCyan+colorBold, "Cyan"))
-// 	glog.V(101).Infof("%s", colorLog(colorGreen+colorBold, "Green"))
-// 	glog.V(101).Infof("%s", colorLog(colorYellow+colorBold, "Yellow"))
-// 	glog.V(101).Infof("%s", colorLog(colorBlue+colorBold, "Blue"))
-// 	glog.V(101).Infof("%s", colorLog(colorMagenta+colorBold, "Magenta"))
-// 	glog.V(101).Infof("%s", colorLog(colorWhite+colorBold, "White"))
-// 	glog.V(101).Infof("%s", colorLog(colorRed+colorBold, "Red"))
-// }
-
-// func clog(verbosity int32, color, message string, args ...interface{}) {
-// 	glog.V(glog.Level(verbosity)).Infof("%s", Cl(color, fmt.Sprintf(message, args...)))
-// }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e1bc7d and ca48471.

📒 Files selected for processing (1)
  • pkg/mig/mig.go (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Applied to files:

  • pkg/mig/mig.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.

Applied to files:

  • pkg/mig/mig.go
🧬 Code graph analysis (1)
pkg/mig/mig.go (8)
internal/nvidiagpuconfig/config.go (1)
  • NvidiaGPUConfig (9-21)
internal/get/get.go (2)
  • MIGProfileInfo (185-201)
  • MIGCapabilities (114-182)
internal/inittools/inittools.go (1)
  • GetOpenShiftVersion (47-69)
pkg/nodes/node.go (2)
  • Pull (119-138)
  • Builder (30-36)
pkg/olm/subscription.go (1)
  • PullSubscription (237-270)
internal/wait/wait.go (1)
  • ClusterPolicyReady (17-46)
internal/check/check.go (1)
  • NodeWithLabel (62-87)
internal/gpu-burn/gpu-burn.go (1)
  • CreateGPUBurnConfigMap (38-59)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
pkg/mig/mig.go (1)

777-791: Consider removing commented-out code.

The commented-out functions colorlogs and clog at the end of the file could be removed if they're not needed for future reference, or moved to a separate utilities file if they might be useful later.

Apply this diff if you want to remove them:

-// // colorlogs prints sample colored log messages for testing color output.
-// // This function demonstrates all available color options for console highlighting.
-// func colorlogs() {
-// 	glog.V(101).Infof("%s", colorLog(colorCyan+colorBold, "Cyan"))
-// 	glog.V(101).Infof("%s", colorLog(colorGreen+colorBold, "Green"))
-// 	glog.V(101).Infof("%s", colorLog(colorYellow+colorBold, "Yellow"))
-// 	glog.V(101).Infof("%s", colorLog(colorBlue+colorBold, "Blue"))
-// 	glog.V(101).Infof("%s", colorLog(colorMagenta+colorBold, "Magenta"))
-// 	glog.V(101).Infof("%s", colorLog(colorWhite+colorBold, "White"))
-// 	glog.V(101).Infof("%s", colorLog(colorRed+colorBold, "Red"))
-// }
-
-// func clog(verbosity int32, color, message string, args ...interface{}) {
-// 	glog.V(glog.Level(verbosity)).Infof("%s", Cl(color, fmt.Sprintf(message, args...)))
-// }
tests/mig/mig-test.go (1)

3-42: Consider cleaning up commented imports.

The test file contains many commented-out imports (lines 4-41). If these aren't needed for reference or upcoming development, consider removing them to improve code clarity.

Apply this diff to remove unnecessary commented imports:

 import (
-	// "context"
-	// "encoding/json"
-	// "fmt"
-	// "math/rand"
-	// "strings"
-	// "time"
-
 	"github.com/rh-ecosystem-edge/nvidia-ci/internal/inittools"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/internal/nfd"
 	"github.com/rh-ecosystem-edge/nvidia-ci/internal/nvidiagpuconfig"
 	_ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients"
 	. "github.com/rh-ecosystem-edge/nvidia-ci/pkg/global"
 	"github.com/rh-ecosystem-edge/nvidia-ci/pkg/mig"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/mig"
 	nfd "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nfd"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nodes"
 	"github.com/rh-ecosystem-edge/nvidia-ci/pkg/nvidiagpu"
 	"github.com/rh-ecosystem-edge/nvidia-ci/pkg/operatorconfig"
 
 	"github.com/golang/glog"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/configmap"
-
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/namespace"
-	// "github.com/rh-ecosystem-edge/nvidia-ci/pkg/pod"
-
-	// "github.com/operator-framework/api/pkg/operators/v1alpha1"
-	// "github.com/rh-ecosystem-edge/nvidia-ci/internal/check"
-
 	"github.com/rh-ecosystem-edge/nvidia-ci/internal/gpuparams"
 	"github.com/rh-ecosystem-edge/nvidia-ci/internal/tsparams"
-	// "github.com/rh-ecosystem-edge/nvidia-ci/internal/wait"
-	// corev1 "k8s.io/api/core/v1"
 )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca48471 and 9b0fc15.

📒 Files selected for processing (2)
  • pkg/mig/mig.go (1 hunks)
  • tests/mig/mig-test.go (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Applied to files:

  • tests/mig/mig-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-06-18T14:59:40.987Z
Learnt from: wabouhamad
Repo: rh-ecosystem-edge/nvidia-ci PR: 205
File: tests/nvidianetwork/deploy-nno-test.go:1411-1414
Timestamp: 2025-06-18T14:59:40.987Z
Learning: In SRIOV RDMA testcases in tests/nvidianetwork/deploy-nno-test.go, the string "sriov" is intentionally passed as the device parameter to CreateRdmaWorkloadPod instead of the actual mlx5 device ID. The workload container is designed to dynamically discover the correct mlx5 device for SRIOV cases at runtime.

Applied to files:

  • tests/mig/mig-test.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.

Applied to files:

  • tests/mig/mig-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.

Applied to files:

  • pkg/mig/mig.go
🧬 Code graph analysis (1)
pkg/mig/mig.go (8)
internal/nvidiagpuconfig/config.go (1)
  • NvidiaGPUConfig (9-21)
internal/get/get.go (2)
  • MIGProfileInfo (185-201)
  • GetFirstPodNameWithLabel (62-79)
internal/gpuparams/const.go (3)
  • Gpu10LogLevel (9-9)
  • Gpu100LogLevel (11-11)
  • GpuLogLevel (8-8)
pkg/global/const.go (1)
  • OpenShiftVersionFile (7-7)
pkg/nodes/node.go (2)
  • Pull (119-138)
  • Builder (30-36)
pkg/olm/subscription.go (1)
  • PullSubscription (237-270)
internal/wait/wait.go (1)
  • ClusterPolicyReady (17-46)
internal/check/check.go (1)
  • NodeWithLabel (62-87)
🔇 Additional comments (4)
pkg/mig/mig.go (3)

52-162: LGTM! Comprehensive MIG test orchestration.

The main test function properly orchestrates the MIG GPU burn test flow with appropriate error handling, cleanup defers, and step-by-step validation. All previously identified issues (defer assertions, error checking) have been addressed.


469-508: Correct label cleanup implementation.

The function properly removes the nvidia.com/mig.strategy label using RemoveLabel (line 478) and sets nvidia.com/mig.config to "all-disabled" (line 486), which aligns with the valid values for these labels as documented in the learnings.


681-769: LGTM! Pod deployment and validation logic is sound.

The pod deployment, completion waiting, log retrieval, and validation functions are well-structured. All format string issues from previous reviews have been corrected, and the validation logic properly checks for GPU success indicators across all MIG instances.

tests/mig/mig-test.go (1)

75-119: Well-structured test suite with proper lifecycle management.

The test suite properly initializes configuration in BeforeAll (with nil check on line 82) and performs cleanup in AfterAll. The single MIG test case is implemented, and the mixed MIG test case is appropriately marked as not yet implemented.

.gitignore Outdated
Comment on lines 5 to 6
/archive/
*.test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these? They don't look like common patterns to exclude. If it's something only you use, please find a different solution, e.g. change the global Git config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mig.test seems to be a binary that was produced with some command (go test -c ./tests/mig)
archive was a folder that I moved some log files from previous days which I didn't want to discard yet.

Neither are necessary in general, I can use other locations for the archive folder and mig.test or any *.test binaries can be deleted manually before committing anything.

Comment on lines 66 to 71
glog.V(gpuparams.GpuLogLevel).Infof("error listing pods in namespace %s with label selector %s: %v", podNamespace, podLabelSelector, err)
return "", err
}

if len(podList) == 0 {
glog.V(gpuparams.GpuLogLevel).Infof("no pods found in namespace %s with label selector %s", podNamespace, podLabelSelector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logging is actually a bad pattern, although it's widely used. Here, you already add a useful context to the error message, and pass it on. You don't know how the error will be used, but likely it will be acted upon or displayed to the user, or propagated until it will be acted upon or displayed to the user (and so on). If the client code follows the same error handling pattern, this error will be logged again, and again. And notice it's not a debug (which is normally turned off and is there only in case something needs fixing), but an Info. So we'll end up with a lot of log messages related to the same error. Two issues:

  1. Too many log messages to process in a meaningful way (by humans).
  2. Multiple log messages related to the same error, but with different wording makes it look like there were multiple errors.
    So, I'd say leave it up to the client code to handle errors your code doesn't know how to handle. They may log and do something useful, or they may add useful context to the error and propagate it further.
    I encourage you to read about best practices for logging and prove me wrong :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used those logs a lot to keep track on how the execution goes and I did clean up lots of them before I committed. Some of the logs are generated by cursor (I recognize and remember that this part is cursor generated), and here they were for 2 seemingly different scenarios, although maybe theoretical.

I'll remove those extra logs, they are of no interest.

}

// MIGCapabilities queries GPU hardware directly using nvidia-smi
// to discover MIG capabilities. This is a fallback when GFD labels are not available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fall back, or just fail the test if what's expected is not there (the GFD labels in this case).
If we're testing this feature end-to-end, we can miss a real customer problem by implementing this fallback, IIUC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the MIGprofile is not found, I have tested 2 scenarios with mig package:

  1. gpu-operator is installed, mig is disabled
    --> normal situation, testcase selects the mig profile and continues
  2. gpu-operator is not installed
    --> testcase does not crash, it skips the test

In this case I need to see the nvidia-smi printout. I couldn't see another way to get it.

Comments may be distracting, maybe just leave out the fallback comment?

firstNode := nodeBuilder[0]
nodeName := firstNode.Object.Name

// Find a driver pod or GFD pod on this node to query hardware
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry why do we need a GFD pod? Can't we use the nvidia.com/mig.capable node label?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the printout of nvidia-smi to get the profiles:
$ oc exec -ti -n nvidia-gpu-operator nvidia-driver-daemonset-9.6.20250925-0-9fq7p -- nvidia-smi mig -lgip | grep 'MIG|GPU' | grep -v '+me'
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| 0 MIG 1g.5gb 19 7/7 4.75 No 14 0 0 |
| 0 MIG 1g.10gb 15 4/4 9.75 No 14 1 0 |
| 0 MIG 2g.10gb 14 3/3 9.75 No 28 1 0 |
| 0 MIG 3g.20gb 9 2/2 19.62 No 42 2 0 |
| 0 MIG 4g.20gb 5 1/1 19.62 No 56 2 0 |
| 0 MIG 7g.40gb 0 1/1 39.38 No 98 5 0 |

// MIGCapabilities queries GPU hardware directly using nvidia-smi
// to discover MIG capabilities. This is a fallback when GFD labels are not available.
// Returns true if MIG is supported, along with available MIG instance profiles.
func MIGCapabilities(apiClient *clients.Settings, nodeSelector map[string]string) (bool, []MIGProfileInfo, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly against putting this in the get package. Code should be grouped around what domain it operates on, not which actions it performs. Unless it's a common operation like, for instance, get from K8s API (get pod/node/deployment/.../anything). Even then, if it's get pod, get node, ... they probably belong in "pod", "node", etc. (and not "get").

We didn't have time to refactor this, but let's not make it worse 😉

All MIG-related code should probably be somewhere in "gpu", or "mig", or "gpu/mig".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, moving this and related parts into mig.go


// Query MIG capabilities using nvidia-smi
// First, try to get MIG instance profiles directly (works even if MIG mode is not enabled)
cmd := []string{"nvidia-smi", "mig", "-lgip"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this nvidia-smi code can be very useful and should be made reusable. Finding the right pod, running the command. The function will be generic, receive command arguments and return it output.

Regardless of that, this entire function seems too long. It includes a few steps that would be easier to understand and maintain as separate functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we have to check it further on how it could be shortened. Or did you mean that it should split into smaller pieces?

The functions can be split further to be reuseable for some other purposes, as they are currently quite specific for this particular case.

}
return true, profiles, nil
} else {
// Even if command failed, check if we got any output (non-zero exit codes still produce output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's useful. What kind of output will it produce if failed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some specific situation where I remember having seen that there was an error code while running the case and when I manually checked it, it was returning valid printout. I'm not sure if it returned a non-zero return code while I checked it manually. I guess this may be unnecessary, but I tend to put code into places where case previously crashed or failed (to prevent the crash or failure).

// Internal functions serving the external functions

// execCommandInPod executes a command in a pod and returns the output
func execCommandInPod(apiClient *clients.Settings, podName, namespace string, command []string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one can be useful as well. Don't we already have something like it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch :) I had taken mps_test.go as a basis for mig package and tests and there is executeCommandInPod in mps_test.go

I'll change that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the execCommandInPod is calling the podBuilder.ExecCommand. Even though the function name is quite descriptive in this case, I think I will change it so that it wouldn't be misleading in any way.

// Labels represents the range of labels that can be used for test cases selection.
Labels = []string{"nvidia-ci", Label}
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any commented out could should be removed. We either need it, or we don't, no in-between :) Don't worry, we'll have it in Git history anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing, did not notice earlier

pkg/mig/mig.go Outdated
Comment on lines 42 to 49
// colorRed = "\033[31m"
// colorGreen = "\033[32m"
// colorYellow = "\033[33m"
// colorBlue = "\033[34m"
// colorMagenta = "\033[35m"
colorCyan = "\033[36m"
// colorWhite = "\033[37m"
colorBold = "\033[1m"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What console? Will it be capable of displaying colors? Or will it be a Prow log file with gibberish that only confuses the readers? :) Also here, too, I think we should remove all unused/commented-out code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above is an example of the output. I don't know how it works on on prow

@empovit
Copy link
Member

empovit commented Dec 16, 2025

@vtruhpon great first PR! You've done a lot of work. I had a few questions/comments, PTAL.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
tests/nvidiagpu/deploy-gpu-test.go (2)

93-95: Consider defining timeout constants in a shared location.

The QuickTimeout and QuickInterval constants are defined at package scope here, but similar constants also appear in pkg/mig/mig.go (lines 45-46). This duplication can lead to inconsistencies if timeout values need adjustment.

Consider centralizing MIG-related timing constants

Move these constants to internal/gpuparams/const.go or a dedicated MIG parameters file to ensure consistent timeout behavior across the codebase.


1235-1593: Refactor testWorkloadWithSingleMig to improve maintainability.

The testWorkloadWithSingleMig function spans 358 lines and handles multiple responsibilities: MIG profile selection, node labeling, ClusterPolicy orchestration, namespace/configmap/pod creation, and log validation. This makes it difficult to test individual components and understand the flow.

Recommended refactoring approach

The function could be split into well-defined phases using helper functions (similar to what's done in pkg/mig/mig.go):

  1. validateMIGCapability() - Check mig.capable label
  2. setupMIGConfiguration() - Query profiles, set labels, wait for ClusterPolicy
  3. deployMIGWorkload() - Create namespace, configmap, pod
  4. validateMIGWorkloadExecution() - Wait for completion, parse logs

This would align with the single responsibility principle and make the test flow clearer. Note that pkg/mig/mig.go already provides many of these helper functions (e.g., CheckMIGCapabilityOnNodes, SelectMigProfile, SetMIGLabelsOnNodes, etc.) which could potentially be reused here.

pkg/mig/mig.go (3)

36-42: Colorized output may not display correctly in all environments.

The ANSI color codes (lines 36-42) are used throughout via colorLog() (lines 610-614). As noted in a past comment, these may render as gibberish in Prow logs or other non-terminal environments that don't support ANSI escape sequences.

Optional: Make colors conditional

Consider detecting terminal capability or providing an environment variable to disable colors:

var useColors = os.Getenv("NO_COLOR") == "" && isTerminal()

func colorLog(color, message string) string {
    if !useColors {
        return message
    }
    return fmt.Sprintf("%s%s%s", color, message, colorReset)
}

This ensures logs remain readable in all environments while preserving color support where available.

Also applies to: 610-614


711-785: Complex regex parsing in parseMIGProfiles requires careful testing.

The parseMIGProfiles function uses multiple regex patterns to parse nvidia-smi mig -lgip output across two lines per profile. The logic handles exclusion of +me profiles and maintains state between line1 and line2 parsing. While the implementation appears sound, this type of parsing is fragile and sensitive to nvidia-smi output format changes.

Recommendations for robustness
  1. Add unit tests with sample nvidia-smi outputs covering various GPU types and MIG configurations
  2. Consider adding validation that profiles array is not empty before returning
  3. Log a warning if parsing fails to match expected patterns rather than silently returning empty array

Example validation:

if len(profiles) == 0 {
    glog.Warningf("No MIG profiles parsed from nvidia-smi output. Output was: %s", output)
}
return profiles

Please ensure comprehensive testing of parseMIGProfiles with various GPU hardware and nvidia-smi versions to catch potential parsing issues early.


679-707: ExecCmdInPod has robust validation but consider timeout.

The function properly validates pod and container readiness before executing commands. However, the ExecCommand method may hang indefinitely if the command doesn't complete. Consider whether a timeout context would be appropriate for long-running or potentially stuck commands.

Optional: Add command timeout

For production-grade robustness, consider adding a timeout parameter:

func ExecCmdInPod(apiClient *clients.Settings, podName, namespace string, command []string, timeout time.Duration) (string, error) {
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()
    // Use ctx in ExecCommand if the underlying implementation supports it
    ...
}

This would prevent test hangs if nvidia-smi becomes unresponsive.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b0fc15 and 724461d.

📒 Files selected for processing (7)
  • README.md (5 hunks)
  • internal/get/get.go (2 hunks)
  • internal/wait/wait.go (3 hunks)
  • pkg/configmap/configmap.go (3 hunks)
  • pkg/mig/mig.go (1 hunks)
  • tests/mig/mig-test.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (20 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/get/get.go
  • tests/mig/mig-test.go
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Applied to files:

  • README.md
  • internal/wait/wait.go
  • pkg/mig/mig.go
📚 Learning: 2025-12-12T10:59:56.555Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.

Applied to files:

  • README.md
📚 Learning: 2025-06-18T14:59:40.987Z
Learnt from: wabouhamad
Repo: rh-ecosystem-edge/nvidia-ci PR: 205
File: tests/nvidianetwork/deploy-nno-test.go:1411-1414
Timestamp: 2025-06-18T14:59:40.987Z
Learning: In SRIOV RDMA testcases in tests/nvidianetwork/deploy-nno-test.go, the string "sriov" is intentionally passed as the device parameter to CreateRdmaWorkloadPod instead of the actual mlx5 device ID. The workload container is designed to dynamically discover the correct mlx5 device for SRIOV cases at runtime.

Applied to files:

  • README.md
📚 Learning: 2025-07-20T08:18:03.794Z
Learnt from: TomerNewman
Repo: rh-ecosystem-edge/nvidia-ci PR: 226
File: scripts/gpu-operator-tests-must-gather.sh:1-2
Timestamp: 2025-07-20T08:18:03.794Z
Learning: In the nvidia-ci repository, must-gather scripts should use best-effort execution rather than fail-fast behavior. Unlike workflow scripts that should fail immediately on errors, diagnostic collection scripts should continue running even if one component fails, to maximize the amount of diagnostic data collected. For example, if NFD must-gather fails, the GPU must-gather should still run.

Applied to files:

  • README.md
📚 Learning: 2025-11-09T13:12:43.631Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 359
File: workflows/gpu_operator_versions/update_versions.py:100-106
Timestamp: 2025-11-09T13:12:43.631Z
Learning: In the nvidia-ci repository's `workflows/gpu_operator_versions/update_versions.py`, the `gpu_releases` variable is intentionally limited to the latest 2 GPU operator versions (via `get_latest_versions(list(gpu_versions.keys()), 2)` in main()). For maintenance OCP versions with pinned GPU operators, if the pinned version is not among the latest 2 releases, the test is intentionally skipped with a warning rather than falling back to test with any available version. This is because testing with outdated GPU operators would fail in their infrastructure, and they prefer to notice missing tests rather than deal with failed test runs, especially since GitHub organization repositories don't send notification emails about failed workflows.

Applied to files:

  • README.md
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.

Applied to files:

  • internal/wait/wait.go
  • pkg/configmap/configmap.go
  • tests/nvidiagpu/deploy-gpu-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.

Applied to files:

  • pkg/mig/mig.go
🧬 Code graph analysis (2)
internal/wait/wait.go (2)
pkg/nodes/node.go (1)
  • Pull (119-138)
pkg/nodes/list.go (1)
  • List (20-58)
pkg/configmap/configmap.go (5)
pkg/namespace/namespace.go (1)
  • Builder (23-32)
pkg/nfd/nodefeaturediscovery.go (1)
  • Builder (22-32)
pkg/deployment/deployment.go (1)
  • Builder (25-34)
pkg/nvidiagpu/clusterpolicy.go (1)
  • Builder (23-33)
pkg/pod/pod.go (1)
  • Builder (33-42)
🪛 markdownlint-cli2 (0.18.1)
README.md

161-161: Heading style
Expected: setext; Actual: atx

(MD003, heading-style)


167-167: Heading style
Expected: setext; Actual: atx

(MD003, heading-style)


170-170: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


171-171: Dollar signs used before commands without showing output

(MD014, commands-show-output)


172-172: Dollar signs used before commands without showing output

(MD014, commands-show-output)


173-173: Dollar signs used before commands without showing output

(MD014, commands-show-output)


174-174: Dollar signs used before commands without showing output

(MD014, commands-show-output)


175-175: Dollar signs used before commands without showing output

(MD014, commands-show-output)


176-176: Dollar signs used before commands without showing output

(MD014, commands-show-output)


177-177: Dollar signs used before commands without showing output

(MD014, commands-show-output)


178-178: Dollar signs used before commands without showing output

(MD014, commands-show-output)


179-179: Dollar signs used before commands without showing output

(MD014, commands-show-output)


180-180: Dollar signs used before commands without showing output

(MD014, commands-show-output)


186-186: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


187-187: Dollar signs used before commands without showing output

(MD014, commands-show-output)


188-188: Dollar signs used before commands without showing output

(MD014, commands-show-output)


189-189: Dollar signs used before commands without showing output

(MD014, commands-show-output)


190-190: Dollar signs used before commands without showing output

(MD014, commands-show-output)


191-191: Dollar signs used before commands without showing output

(MD014, commands-show-output)


192-192: Dollar signs used before commands without showing output

(MD014, commands-show-output)


193-193: Dollar signs used before commands without showing output

(MD014, commands-show-output)


194-194: Dollar signs used before commands without showing output

(MD014, commands-show-output)


195-195: Dollar signs used before commands without showing output

(MD014, commands-show-output)


196-196: Dollar signs used before commands without showing output

(MD014, commands-show-output)


199-199: Heading style
Expected: setext; Actual: atx

(MD003, heading-style)


203-203: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


204-204: Dollar signs used before commands without showing output

(MD014, commands-show-output)


207-207: Heading style
Expected: setext; Actual: atx

(MD003, heading-style)


210-210: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (8)
pkg/configmap/configmap.go (1)

146-177: LGTM! Clean implementation of WaitUntilDeleted.

The new WaitUntilDeleted method follows established patterns from similar builders (pod, namespace, deployment) and properly validates, polls, and handles errors. The use of wait.PollUntilContextTimeout with IsNotFound check is the standard approach for resource deletion verification.

internal/wait/wait.go (2)

51-82: LGTM! ClusterPolicyNotReady properly mirrors ClusterPolicyReady.

The new function correctly polls for the "notReady" state transition, following the same pattern as the existing ClusterPolicyReady function. Error handling and logging are appropriate.


143-171: LGTM! NodeLabelExists implements proper label polling.

The function correctly uses nodes.List with a LabelSelector and iterates through nodes to find a matching label key/value pair. The polling logic and error handling are sound.

README.md (1)

85-206: LGTM! Comprehensive MIG testing documentation.

The new MIG testing section provides clear step-by-step instructions, environment variable documentation, and concrete examples. The documentation properly explains the workflow and aligns well with the code changes in this PR.

Note: The markdownlint warnings flagged by static analysis are minor style issues (heading style, fenced code language specification, dollar signs in commands) that don't impact readability and can be addressed separately if desired.

pkg/mig/mig.go (3)

60-191: LGTM! TestSingleMIGGPUWorkload provides clear MIG test orchestration.

The main test function properly orchestrates the MIG testing workflow with clear phases:

  1. MIG capability validation
  2. Resource cleanup
  3. MIG profile selection
  4. Node labeling and ClusterPolicy readiness
  5. Workload deployment and validation

The use of helper functions (CleanupWorkloadResources, SelectMigProfile, SetMIGLabelsOnNodes, etc.) makes the flow readable and maintainable. Error handling with Expect is appropriate for test code, and the defer block properly handles cleanup.


195-284: LGTM! Cleanup helper functions follow consistent patterns.

The cleanup functions (cleanupClusterPolicy, cleanupCSV, cleanupSubscription, cleanupOperatorGroup, cleanupGPUOperatorNamespace, cleanupGPUBurnNamespace) properly check for resource existence before deletion and handle errors appropriately. The pattern of Pull/Check/Delete is consistent and safe.


287-311: ShouldKeepOperator label checking logic is clear.

The function correctly retrieves the Ginkgo label filter and checks if any of the specified labels are present. The logging is verbose enough for debugging. The implementation properly handles the case where tests with specific labels (like "operator-upgrade", "gpu-burn-mig", "single-mig") should preserve the operator state.

tests/nvidiagpu/deploy-gpu-test.go (1)

1388-1388: "all-disabled" is a valid MIG configuration profile and the correct value for this test setup.

"all-disabled" is a valid and default MIG configuration for disabling MIG on nodes. This code correctly sets the MIG configuration label as part of test setup to establish a known baseline state before running tests.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
pkg/pod/pod.go (1)

284-314: Good improvement to context propagation.

The change on line 295 correctly uses the context provided by the polling function instead of context.TODO(), enabling proper cancellation and timeout handling.

However, similar context handling issues exist in other Wait methods that should be addressed for consistency:

  • Line 274 in WaitUntilInStatus
  • Line 340 in WaitUntilCondition

Both still use context.TODO() instead of the ctx parameter provided by their respective wait.PollUntilContextTimeout calls.

🔎 Proposed fix for consistent context usage

For WaitUntilInStatus method:

 return wait.PollUntilContextTimeout(
   context.TODO(), time.Second, timeout, true, func(ctx context.Context) (bool, error) {
     updatePod, err := builder.apiClient.Pods(builder.Object.Namespace).Get(
-      context.TODO(), builder.Object.Name, metav1.GetOptions{})
+      ctx, builder.Object.Name, metav1.GetOptions{})
     if err != nil {
       return false, nil
     }

For WaitUntilCondition method:

 return wait.PollUntilContextTimeout(
   context.TODO(), time.Second, timeout, true, func(ctx context.Context) (bool, error) {
     updatePod, err := builder.apiClient.Pods(builder.Object.Namespace).Get(
-      context.TODO(), builder.Object.Name, metav1.GetOptions{})
+      ctx, builder.Object.Name, metav1.GetOptions{})
     if err != nil {
       return false, nil
     }
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 724461d and 313a442.

📒 Files selected for processing (4)
  • pkg/mig/mig.go (1 hunks)
  • pkg/nvidiagpu/consts.go (2 hunks)
  • pkg/pod/pod.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (20 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
  • pkg/pod/pod.go
  • pkg/nvidiagpu/consts.go
  • pkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Applied to files:

  • pkg/nvidiagpu/consts.go
  • pkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.

Applied to files:

  • pkg/mig/mig.go
🧬 Code graph analysis (1)
tests/nvidiagpu/deploy-gpu-test.go (1)
pkg/mig/mig.go (1)
  • ShouldKeepOperator (283-307)
🔇 Additional comments (6)
pkg/nvidiagpu/consts.go (1)

66-70: LGTM! Well-structured constants for MIG polling operations.

The new time constants for ClusterPolicy readiness checks and label validation are properly defined and follow the existing naming conventions in this file. The intervals (15s) and timeouts (3min) are reasonable for polling MIG-related state transitions.

tests/nvidiagpu/deploy-gpu-test.go (2)

151-162: LGTM! Proper cleanup gating logic.

The integration of labelsToCheck and mig.ShouldKeepOperator provides fine-grained control over cleanup based on test labels. This prevents premature cleanup when running operator-upgrade or MIG-related tests.


1122-1232: LGTM! Well-structured cleanup helpers.

The cleanup functions are properly decomposed and each checks for resource existence before deletion. Using Expect assertions is appropriate for test code to ensure cleanup failures are visible.

pkg/mig/mig.go (3)

589-596: LGTM! Proper NO_COLOR support.

The colorLog function correctly respects the NO_COLOR environment variable convention, which addresses concerns about ANSI escape codes in CI/Prow logs. When NO_COLOR is set, plain text is returned; otherwise, colored output is generated.


659-707: LGTM! Robust command execution with proper timeout handling.

The ExecCmdInPod function implements proper timeout handling using goroutines and channels, validates pod/container state before execution, and provides clear error messages. The defensive checks (pod running, container ready, non-empty output) prevent common failure modes.


709-786: LGTM! Comprehensive MIG profile parsing.

The parseMIGProfiles function correctly parses the two-line format from nvidia-smi mig -lgip output using appropriate regex patterns. The exclusion of +me (media extension) profiles and the handling of both lines to populate MIGProfileInfo structs is well-implemented.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
pkg/mig/mig.go (2)

397-410: Remove contradictory log message.

Line 403 logs "NOT setting or removing MIG strategy label on node" immediately after successfully setting the nvidia.com/mig.strategy label on lines 399-402. This contradictory message is confusing.


415-437: Remove mig.strategy label during cleanup.

ResetMIGLabelsToDisabled sets nvidia.com/mig.config to "all-disabled" but doesn't remove the nvidia.com/mig.strategy label. Based on learnings, during cleanup you should remove the mig.strategy label entirely since the only valid values are "single" or "mixed".

tests/nvidiagpu/deploy-gpu-test.go (1)

1376-1402: Remove mig.strategy label during cleanup.

The cleanup defer sets nvidia.com/mig.config to "all-disabled" but doesn't remove the nvidia.com/mig.strategy label. Based on learnings, during cleanup you should remove the mig.strategy label entirely using RemoveLabel method.

🧹 Nitpick comments (7)
pkg/mig/mig.go (3)

583-596: Remove commented-out code.

The commented-out colorLog function (lines 585-587) should be removed since there's a working replacement below it.

🔎 Suggested fix
 // colorLog returns a colored log message with ANSI escape codes
 // Usage: glog.V(level).Infof("%s", colorLog(colorCyan+colorBold, "Your message"))
-// func colorLog(color, message string) string {
-// 	return fmt.Sprintf("%s%s%s", color, message, colorReset)
-// }
 
 var useColors = os.Getenv("NO_COLOR") == ""
 
 func colorLog(color, message string) string {

692-706: Potential goroutine leak on timeout.

When the context times out (line 698-699), the goroutine spawned at line 692 continues running indefinitely since podBuilder.ExecCommand cannot be cancelled. The goroutine will eventually complete, but the result is discarded. This is a minor resource concern, though acceptable given the test context.

Consider adding a comment documenting this behavior, or using a buffered channel to ensure the goroutine can always send its result:

// Note: On timeout, the spawned goroutine continues until ExecCommand completes,
// but its result is discarded. This is acceptable in test contexts.

722-724: Minor: Remove redundant type declaration.

The explicit bool type on line 723 is redundant since it's inferred from the initialization value.

🔎 Suggested fix
 	excludeRegex := regexp.MustCompile(`\|\s+\d+\s+MIG\s+\d+g\.\d+gb\+me`)
 	flavor := "gpu"
-	var exclude bool = true
+	exclude := true
tests/nvidiagpu/deploy-gpu-test.go (4)

1137-1193: Cleanup functions will fail if resources don't exist.

Unlike the cleanup functions in pkg/mig/mig.go (which check if err == nil && builder.Exists() before deleting), these cleanup functions use Expect(err).ToNot(HaveOccurred()) after pulling resources, causing the test to fail if resources don't exist. This is fragile for cleanup scenarios where resources may have already been deleted or never created.

Consider adding existence checks or using the same pattern as pkg/mig/mig.go:

🔎 Suggested fix for cleanupClusterPolicy
 func cleanupClusterPolicy() {
 	By("Deleting ClusterPolicy")
-	// Since this is out of defer functions, the ClusterPolicy need to be pulled before deleting it.
 	clusterPolicyBuilder, err := nvidiagpu.Pull(inittools.APIClient, nvidiagpu.ClusterPolicyName)
-	Expect(err).ToNot(HaveOccurred(), "Error pulling ClusterPolicy: %v", err)
-	_, err = clusterPolicyBuilder.Delete()
-	Expect(err).ToNot(HaveOccurred(), "Error deleting ClusterPolicy: %v", err)
-	glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy deleted successfully")
+	if err == nil && clusterPolicyBuilder.Exists() {
+		_, err = clusterPolicyBuilder.Delete()
+		Expect(err).ToNot(HaveOccurred(), "Error deleting ClusterPolicy: %v", err)
+		glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy deleted successfully")
+	} else {
+		glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy not found or already deleted")
+	}
 }

Apply similar pattern to cleanupSubscription, cleanupOperatorGroup, and cleanupGPUOperatorNamespace.


1316-1354: Code duplication with pkg/mig/mig.go:SetMIGLabelsOnNodes.

This node labeling logic (lines 1326-1354) duplicates functionality from SetMIGLabelsOnNodes in pkg/mig/mig.go. Consider refactoring to use the shared function from the mig package, or consolidating the improved logic (which checks existing label values) into the mig package and reusing it here.

The logic here is actually more thorough (checking if label exists with correct value before updating), so consider enhancing mig.SetMIGLabelsOnNodes with this logic and calling it from both places.


1508-1516: Consider placing sleep after error check.

The time.Sleep on line 1513 runs before the Expect check on line 1514. If deletion fails, the test will wait 15 seconds before the assertion fails. Consider moving the sleep after the assertion for faster failure detection.

🔎 Suggested fix
 	defer func() {
 		defer GinkgoRecover()
 		glog.V(gpuparams.Gpu100LogLevel).Infof("defer11 (configmapBuilder) sleeping for 15 seconds")
 		if cleanupAfterTest {
 			err := configmapBuilder.Delete()
-			time.Sleep(time.Second * 15)
 			Expect(err).ToNot(HaveOccurred())
+			time.Sleep(time.Second * 15)
 		}
 	}()

1577-1591: Remove commented-out code.

Line 1582 and 1586 contain commented-out code that should be removed for clarity.

🔎 Suggested fix
 	// Need to add checking for other possible GPU's
 	By("Parse the gpu-burn pod logs and check for successful execution with MIG")
-	// migCapabilities[useMigIndex].Available
 	for i := 0; i < migCapabilities[useMigIndex].Total; i++ {
 		match1Mig := strings.Contains(gpuBurnMigLogs, fmt.Sprintf("GPU %d: OK", i))
-		// match2Mig := strings.Contains(gpuBurnMigLogs, fmt.Sprintf("100.0%  proc'd: %d", i))
 		glog.V(gpuparams.Gpu10LogLevel).Infof("Checking if GPU %d: OK is present in logs: %v", i, match1Mig)
 		Expect(match1Mig).ToNot(BeFalse(), "gpu-burn pod execution with MIG was FAILED")
 	}
-	// match1Mig := strings.Contains(gpuBurnMigLogs, "GPU 0: OK")
 	match2Mig := strings.Contains(gpuBurnMigLogs, "100.0%  proc'd:")
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 313a442 and 7a48571.

📒 Files selected for processing (2)
  • pkg/mig/mig.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (20 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Applied to files:

  • pkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.

Applied to files:

  • pkg/mig/mig.go
🧬 Code graph analysis (2)
tests/nvidiagpu/deploy-gpu-test.go (8)
pkg/mig/mig.go (2)
  • ShouldKeepOperator (283-307)
  • MIGProfiles (601-636)
internal/get/get.go (1)
  • GetFirstPodNameWithLabel (56-71)
internal/networkparams/const.go (1)
  • Label (5-5)
tests/nvidianetwork/deploy-nno-test.go (1)
  • WorkerNodeSelector (41-44)
internal/gpuparams/gpuvars.go (1)
  • Labels (5-5)
internal/networkparams/networkvars.go (1)
  • Labels (5-5)
internal/tsparams/gpu-vars.go (1)
  • Labels (11-11)
internal/wait/wait.go (2)
  • ClusterPolicyNotReady (52-82)
  • ClusterPolicyReady (20-49)
pkg/mig/mig.go (8)
internal/wait/wait.go (1)
  • NodeLabelExists (144-171)
pkg/configmap/configmap.go (3)
  • Pull (36-69)
  • NewBuilder (72-103)
  • Builder (21-30)
pkg/nodes/node.go (2)
  • Pull (119-138)
  • Builder (30-36)
pkg/pod/pod.go (3)
  • Pull (91-130)
  • NewBuilder (48-88)
  • Builder (33-42)
vendor/golang.org/x/net/html/atom/table.go (2)
  • Object (190-190)
  • Name (181-181)
internal/gpu-burn/gpu-burn.go (1)
  • CreateGPUBurnConfigMap (38-59)
pkg/olm/operatorgroup.go (1)
  • PullOperatorGroup (142-176)
internal/get/get.go (1)
  • GetClusterArchitecture (74-101)
🔇 Additional comments (3)
pkg/mig/mig.go (2)

189-208: LGTM!

The cleanup orchestration is well-structured with individual helper functions for each resource type, making the code modular and maintainable.


498-532: LGTM!

The DeployGPUWorkload function properly creates the MIG-configured pod, creates it via the API client, and retrieves it for further operations. Good separation of concerns.

tests/nvidiagpu/deploy-gpu-test.go (1)

1111-1119: LGTM!

The two MIG test cases provide good flexibility - one for in-file testing and one delegating to the mig package. The cleanup flag is correctly computed for the package-level test.

Comment on lines 262 to 272
AfterAll(func() {

glog.V(gpuparams.Gpu10LogLevel).Infof("cleanup in AfterAll")
if nfdInstance.CleanupAfterInstall && cleanupAfterTest {
err := nfd.Cleanup(inittools.APIClient)
Expect(err).ToNot(HaveOccurred(), "Error cleaning up NFD resources: %v", err)
}
// Cleanup GPU Operator Resources, if requested
if cleanupAfterTest {
cleanupGPUOperatorResources()
}
})
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent cleanup logic: AfterAll doesn't check ShouldKeepOperator.

The defer blocks throughout the test check cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck) before cleanup, but AfterAll at line 270 only checks cleanupAfterTest. This could cause the GPU Operator resources to be cleaned up even when specific labels indicate they should be kept.

🔎 Suggested fix
 		AfterAll(func() {
 			glog.V(gpuparams.Gpu10LogLevel).Infof("cleanup in AfterAll")
 			if nfdInstance.CleanupAfterInstall && cleanupAfterTest {
 				err := nfd.Cleanup(inittools.APIClient)
 				Expect(err).ToNot(HaveOccurred(), "Error cleaning up NFD resources: %v", err)
 			}
 			// Cleanup GPU Operator Resources, if requested
-			if cleanupAfterTest {
+			if cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck) {
 				cleanupGPUOperatorResources()
 			}
 		})
🤖 Prompt for AI Agents
In tests/nvidiagpu/deploy-gpu-test.go around lines 262 to 272, the AfterAll
block only checks cleanupAfterTest before calling cleanupGPUOperatorResources(),
which is inconsistent with other defer blocks that also skip cleanup when
mig.ShouldKeepOperator(labelsToCheck) is true; update the AfterAll logic to
mirror the defers by calling cleanupGPUOperatorResources() only when
cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck) so operator resources
are preserved when labels indicate they should be kept.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the desired outcome. cleanupAfterTest is set based on environment variable and it supersedes any other setting in the very end of the execution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtruhpon, thank you for the clarification! I understand now that this is an intentional design decision. The cleanupAfterTest environment variable is meant to be the final authority in the AfterAll block, superseding the label-based logic that applies during test execution. This makes sense for ensuring a clean environment at the very end when explicitly requested via the environment variable.

I've noted this design pattern for future reviews. Thanks for explaining the rationale!


✏️ Learnings added
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:32.955Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/nvidiagpu/deploy-gpu-test.go (1)

1384-1410: Remove mig.strategy label during cleanup.

The cleanup defer block sets nvidia.com/mig.config to "all-disabled" but does not remove the nvidia.com/mig.strategy label. Line 1390 logs "NOT setting or removing MIG strategy label" which describes the current behavior but may not be the desired cleanup pattern.

Based on learnings, during cleanup the mig.strategy label should be removed entirely since the only valid values are "single" or "mixed". Consider adding a RemoveLabel call for nvidia.com/mig.strategy before setting mig.config.

🔎 Suggested fix
 	defer func() {
 		defer GinkgoRecover()
 		glog.V(gpuparams.Gpu100LogLevel).Infof("Setting MIG strategy label on GPU worker nodes")
 		nodeBuilders, err := nodes.List(inittools.APIClient, metav1.ListOptions{LabelSelector: labels.Set(WorkerNodeSelector).String()})
 		Expect(err).ToNot(HaveOccurred(), "Error listing worker nodes: %v", err)
 		for _, nodeBuilder := range nodeBuilders {
-			glog.V(gpuparams.Gpu100LogLevel).Infof("NOT setting or removing MIG strategy label on node '%s')", nodeBuilder.Definition.Name)
+			glog.V(gpuparams.Gpu100LogLevel).Infof("Removing MIG strategy label from node '%s'", nodeBuilder.Definition.Name)
+			nodeBuilder = nodeBuilder.RemoveLabel("nvidia.com/mig.strategy", "")
+			_, err = nodeBuilder.Update()
+			Expect(err).ToNot(HaveOccurred(), "Error removing MIG strategy label from node '%s': %v", nodeBuilder.Definition.Name, err)
 
 			glog.V(gpuparams.Gpu100LogLevel).Infof("Setting MIG configuration label on node '%s' (overwrite=true)", nodeBuilder.Definition.Name)
 			nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", "all-disabled")

Based on learnings, use RemoveLabel for mig.strategy during cleanup.

🧹 Nitpick comments (1)
pkg/mig/mig.go (1)

38-44: Optional: Fix typo in color constants comment.

Line 39 has a minor typo in the comment: /033[31m should be \033[31m (backslash instead of forward slash) for consistency with the actual ANSI escape codes used.

🔎 Suggested fix
 // ANSI color constants for console output highlighting
-// colors are /033[31m - red through /033[37m - white
+// colors are \033[31m - red through \033[37m - white
 const (
 	colorReset = "\033[0m"
 	colorCyan  = "\033[36m"
 	colorBold  = "\033[1m"
 )
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a48571 and 6c6f0b9.

📒 Files selected for processing (2)
  • pkg/mig/mig.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (20 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-12-21T00:43:32.955Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:32.955Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Applied to files:

  • pkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.

Applied to files:

  • pkg/mig/mig.go
🧬 Code graph analysis (1)
tests/nvidiagpu/deploy-gpu-test.go (5)
pkg/mig/mig.go (1)
  • ShouldKeepOperator (282-306)
pkg/nodes/node.go (2)
  • Pull (119-138)
  • Builder (30-36)
pkg/nvidiagpu/clusterpolicy.go (2)
  • Pull (127-152)
  • Builder (23-33)
internal/get/get.go (1)
  • GetFirstPodNameWithLabel (56-71)
internal/wait/wait.go (1)
  • NodeLabelExists (144-171)
🔇 Additional comments (1)
pkg/mig/mig.go (1)

413-436: No changes needed. The function correctly sets nvidia.com/mig.config to "all-disabled", which is the valid and official way to disable MIG on nodes per NVIDIA GPU Operator documentation. The mig.strategy label management is handled at the ClusterPolicy level rather than node labels, so the function's approach is consistent with NVIDIA's recommended patterns.

@vtruhpon vtruhpon requested a review from empovit January 5, 2026 08:12
@empovit
Copy link
Member

empovit commented Jan 6, 2026

Thank you @vtruhpon! That's a lot of hard work.

@empovit
Copy link
Member

empovit commented Jan 6, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 6, 2026
@openshift-ci
Copy link

openshift-ci bot commented Jan 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: empovit, vtruhpon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@empovit
Copy link
Member

empovit commented Jan 6, 2026

/hold

Looks like the PR needs rebasing. Also, let's squash the commits.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
internal/gpu-burn/gpu-burn.go (1)

61-63: Remove the unused timeout parameter.

The timeout parameter is declared but never used in the function body. This differs from CreateGPUBurnPod which also doesn't use its timeout parameter, but carrying forward this pattern adds confusion.

♻️ Proposed fix
-func CreateGPUBurnPodWithMIG(apiClient *clients.Settings, podName, podNamespace string,
-	gpuBurnImage string, migProfile string, migCount int, timeout time.Duration) (*corev1.Pod, error) {
+func CreateGPUBurnPodWithMIG(apiClient *clients.Settings, podName, podNamespace string,
+	gpuBurnImage string, migProfile string, migCount int) (*corev1.Pod, error) {
README.md (1)

162-207: Optional markdown formatting improvements.

Static analysis flagged several minor formatting issues in the new MIG testing section:

  • Lines 162, 168, 200, 208: Heading style inconsistency (use setext style for consistency with rest of doc)
  • Lines 171-181, 187-197, 204-205: Missing language specifiers in code blocks and dollar signs in commands

These are purely cosmetic and don't affect functionality. Consider addressing them if you're updating the formatting across the entire README.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c6f0b9 and 6866d2c.

📒 Files selected for processing (15)
  • README.md
  • internal/get/get.go
  • internal/gpu-burn/gpu-burn.go
  • internal/gpuparams/const.go
  • internal/nvidiagpuconfig/config.go
  • internal/tsparams/mig-vars.go
  • internal/wait/wait.go
  • pkg/configmap/configmap.go
  • pkg/mig/mig.go
  • pkg/nodes/node.go
  • pkg/nvidiagpu/consts.go
  • pkg/pod/pod.go
  • tests/mig/mig-test.go
  • tests/mig/mig_suite_test.go
  • tests/nvidiagpu/deploy-gpu-test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/pod/pod.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • internal/get/get.go
  • pkg/nvidiagpu/consts.go
  • internal/gpuparams/const.go
  • pkg/nodes/node.go
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.

Applied to files:

  • pkg/configmap/configmap.go
  • internal/nvidiagpuconfig/config.go
  • internal/gpu-burn/gpu-burn.go
  • internal/wait/wait.go
  • tests/mig/mig-test.go
  • tests/mig/mig_suite_test.go
  • internal/tsparams/mig-vars.go
  • tests/nvidiagpu/deploy-gpu-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Applied to files:

  • internal/nvidiagpuconfig/config.go
  • internal/gpu-burn/gpu-burn.go
  • internal/wait/wait.go
  • tests/mig/mig-test.go
  • tests/mig/mig_suite_test.go
  • internal/tsparams/mig-vars.go
  • README.md
  • pkg/mig/mig.go
📚 Learning: 2025-12-21T00:43:37.494Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:37.494Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).

Applied to files:

  • tests/mig/mig-test.go
  • tests/mig/mig_suite_test.go
  • internal/tsparams/mig-vars.go
  • README.md
  • tests/nvidiagpu/deploy-gpu-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-06-18T14:59:40.987Z
Learnt from: wabouhamad
Repo: rh-ecosystem-edge/nvidia-ci PR: 205
File: tests/nvidianetwork/deploy-nno-test.go:1411-1414
Timestamp: 2025-06-18T14:59:40.987Z
Learning: In SRIOV RDMA testcases in tests/nvidianetwork/deploy-nno-test.go, the string "sriov" is intentionally passed as the device parameter to CreateRdmaWorkloadPod instead of the actual mlx5 device ID. The workload container is designed to dynamically discover the correct mlx5 device for SRIOV cases at runtime.

Applied to files:

  • tests/mig/mig-test.go
  • README.md
📚 Learning: 2025-12-12T10:59:56.555Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.

Applied to files:

  • README.md
📚 Learning: 2025-07-20T08:18:03.794Z
Learnt from: TomerNewman
Repo: rh-ecosystem-edge/nvidia-ci PR: 226
File: scripts/gpu-operator-tests-must-gather.sh:1-2
Timestamp: 2025-07-20T08:18:03.794Z
Learning: In the nvidia-ci repository, must-gather scripts should use best-effort execution rather than fail-fast behavior. Unlike workflow scripts that should fail immediately on errors, diagnostic collection scripts should continue running even if one component fails, to maximize the amount of diagnostic data collected. For example, if NFD must-gather fails, the GPU must-gather should still run.

Applied to files:

  • README.md
📚 Learning: 2025-11-09T13:12:43.631Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 359
File: workflows/gpu_operator_versions/update_versions.py:100-106
Timestamp: 2025-11-09T13:12:43.631Z
Learning: In the nvidia-ci repository's `workflows/gpu_operator_versions/update_versions.py`, the `gpu_releases` variable is intentionally limited to the latest 2 GPU operator versions (via `get_latest_versions(list(gpu_versions.keys()), 2)` in main()). For maintenance OCP versions with pinned GPU operators, if the pinned version is not among the latest 2 releases, the test is intentionally skipped with a warning rather than falling back to test with any available version. This is because testing with outdated GPU operators would fail in their infrastructure, and they prefer to notice missing tests rather than deal with failed test runs, especially since GitHub organization repositories don't send notification emails about failed workflows.

Applied to files:

  • README.md
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.

Applied to files:

  • pkg/mig/mig.go
🧬 Code graph analysis (7)
pkg/configmap/configmap.go (2)
pkg/pod/pod.go (1)
  • Builder (33-42)
pkg/namespace/namespace.go (1)
  • Builder (23-32)
internal/wait/wait.go (3)
pkg/nodes/node.go (1)
  • Pull (119-138)
pkg/nvidiagpu/clusterpolicy.go (1)
  • Pull (127-152)
pkg/nodes/list.go (1)
  • List (20-58)
tests/mig/mig-test.go (9)
pkg/operatorconfig/config.go (2)
  • NewCustomConfig (18-27)
  • CustomConfig (7-14)
pkg/nvidiagpu/gpuBurnConfig.go (1)
  • NewDefaultGPUBurnConfig (15-22)
pkg/nvidiagpu/consts.go (1)
  • NvidiaGPULabel (8-8)
pkg/global/const.go (2)
  • UndefinedValue (5-5)
  • OpenShiftVersionFile (7-7)
internal/gpuparams/const.go (4)
  • Label (5-5)
  • Gpu10LogLevel (9-9)
  • Gpu100LogLevel (11-11)
  • GpuLogLevel (8-8)
internal/tsparams/consts.go (1)
  • LabelSuite (5-5)
internal/inittools/inittools.go (1)
  • GetOpenShiftVersion (47-69)
tests/nvidiagpu/deploy-gpu-test.go (1)
  • BurnImageName (62-65)
pkg/nfd/utils.go (1)
  • EnsureNFDIsInstalled (27-105)
tests/mig/mig_suite_test.go (3)
internal/reporter/reporter.go (1)
  • ReportIfFailed (54-90)
internal/tsparams/mig-vars.go (2)
  • MigReporterNamespacesToDump (14-18)
  • MigReporterCRDsToDump (21-23)
pkg/clients/clients.go (1)
  • SetScheme (130-168)
internal/tsparams/mig-vars.go (1)
internal/tsparams/consts.go (1)
  • LabelSuite (5-5)
tests/nvidiagpu/deploy-gpu-test.go (3)
pkg/mig/mig.go (1)
  • ShouldKeepOperator (282-306)
internal/gpuparams/const.go (1)
  • Label (5-5)
internal/inittools/inittools.go (1)
  • GeneralConfig (22-22)
pkg/mig/mig.go (4)
internal/nvidiagpuconfig/config.go (1)
  • NvidiaGPUConfig (9-21)
internal/wait/wait.go (2)
  • ClusterPolicyNotReady (52-82)
  • ClusterPolicyReady (20-49)
internal/gpu-burn/gpu-burn.go (1)
  • CreateGPUBurnConfigMap (38-59)
internal/get/get.go (1)
  • GetClusterArchitecture (74-101)
🪛 LanguageTool
README.md

[uncategorized] ~68-~68: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...t - optional - TEST_TRACE: includes full stack trace from ginkgo tests when a failure ...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 markdownlint-cli2 (0.18.1)
README.md

162-162: Heading style
Expected: setext; Actual: atx

(MD003, heading-style)


168-168: Heading style
Expected: setext; Actual: atx

(MD003, heading-style)


171-171: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


172-172: Dollar signs used before commands without showing output

(MD014, commands-show-output)


173-173: Dollar signs used before commands without showing output

(MD014, commands-show-output)


174-174: Dollar signs used before commands without showing output

(MD014, commands-show-output)


175-175: Dollar signs used before commands without showing output

(MD014, commands-show-output)


176-176: Dollar signs used before commands without showing output

(MD014, commands-show-output)


177-177: Dollar signs used before commands without showing output

(MD014, commands-show-output)


178-178: Dollar signs used before commands without showing output

(MD014, commands-show-output)


179-179: Dollar signs used before commands without showing output

(MD014, commands-show-output)


180-180: Dollar signs used before commands without showing output

(MD014, commands-show-output)


181-181: Dollar signs used before commands without showing output

(MD014, commands-show-output)


187-187: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


188-188: Dollar signs used before commands without showing output

(MD014, commands-show-output)


189-189: Dollar signs used before commands without showing output

(MD014, commands-show-output)


190-190: Dollar signs used before commands without showing output

(MD014, commands-show-output)


191-191: Dollar signs used before commands without showing output

(MD014, commands-show-output)


192-192: Dollar signs used before commands without showing output

(MD014, commands-show-output)


193-193: Dollar signs used before commands without showing output

(MD014, commands-show-output)


194-194: Dollar signs used before commands without showing output

(MD014, commands-show-output)


195-195: Dollar signs used before commands without showing output

(MD014, commands-show-output)


196-196: Dollar signs used before commands without showing output

(MD014, commands-show-output)


197-197: Dollar signs used before commands without showing output

(MD014, commands-show-output)


200-200: Heading style
Expected: setext; Actual: atx

(MD003, heading-style)


204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


205-205: Dollar signs used before commands without showing output

(MD014, commands-show-output)


208-208: Heading style
Expected: setext; Actual: atx

(MD003, heading-style)


211-211: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (10)
internal/nvidiagpuconfig/config.go (1)

19-20: LGTM!

The new MIG configuration fields follow the existing pattern and are properly tagged for environment variable configuration.

internal/gpu-burn/gpu-burn.go (2)

70-73: LGTM!

The comment and resource name construction correctly distinguish between single strategy MIGs (using "gpu" → "nvidia.com/gpu") and mixed strategy MIGs (using profile names like "mig-1g.5gb" → "nvidia.com/mig-1g.5gb").

Based on learnings, this correctly implements the MIG resource naming convention.


75-147: LGTM!

The Pod specification is well-configured with:

  • Proper security context (non-root, seccomp, dropped capabilities)
  • Correct MIG resource requests/limits using the constructed resource name
  • Appropriate tolerations for GPU taints
  • Valid ConfigMap volume mount for the entrypoint script
internal/wait/wait.go (2)

51-82: LGTM!

The ClusterPolicyNotReady function correctly mirrors ClusterPolicyReady and properly checks for the "notReady" state with appropriate error handling and logging.


143-171: LGTM!

The NodeLabelExists function properly polls for the specified label on nodes matching the selector, with correct error handling and appropriate logging at different verbosity levels.

pkg/configmap/configmap.go (1)

146-177: LGTM!

The WaitUntilDeleted method correctly implements polling until the ConfigMap is deleted, with proper error handling for NotFound (success case) and other errors (propagated to caller).

tests/mig/mig_suite_test.go (1)

1-31: LGTM!

The MIG test suite follows the standard Ginkgo/Gomega pattern with proper:

  • Reporter configuration using inittools.GeneralConfig
  • Suite labeling with tsparams.Labels
  • Failed test reporting in JustAfterEach hook
tests/mig/mig-test.go (1)

46-89: LGTM!

The test suite structure is well-organized with proper lifecycle hooks. The nil check for nvidiaGPUConfig and initialization in BeforeAll address previous review concerns. The cleanup logic correctly uses both cleanupAfterTest and CleanupGPUOperatorResources for proper resource management.

internal/tsparams/mig-vars.go (1)

1-24: LGTM!

Clean variable declarations for MIG test configuration and reporting. The structure is consistent with existing patterns in the tsparams package.

pkg/mig/mig.go (1)

57-186: LGTM!

The TestSingleMIGGPUWorkload function is well-structured with clear separation of concerns. It properly:

  • Validates MIG capability
  • Cleans up previous resources
  • Selects and configures MIG profiles
  • Waits for cluster policy transitions
  • Deploys and validates GPU workload
  • Includes appropriate cleanup defers

The implementation addresses previous review concerns regarding error handling in defer blocks and format string arguments.

- Added MIG testcase implementation in tests/mig/
- Created mig package (pkg/mig/mig.go) with MIG-related functions
- Added MIG variables to internal/tsparams/mig-vars.go
- Updated README.md with MIG testcase documentation
- Enhanced configmap package with WaitUntilDeleted method
- Added wait.ClusterPolicyNotReady and wait.NodeLabelExists functions
- Moved timeout and interval constants to pkg/nvidiagpu/consts.go
- Improved error handling and code cleanup throughout
- Updated test labels to use 'single-mig' for MIG testcase execution

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
NO_COLOR parameter behavior change (earlier NO_COLOR=false would act similar to NO_COLOR=true)

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@empovit
Copy link
Member

empovit commented Jan 7, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 7, 2026
@empovit
Copy link
Member

empovit commented Jan 7, 2026

/unhold

@empovit
Copy link
Member

empovit commented Jan 7, 2026

/test

@openshift-ci
Copy link

openshift-ci bot commented Jan 7, 2026

@empovit: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test 4.12-stable-images
/test 4.12-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.12-stable-nvidia-gpu-operator-e2e-master
/test 4.13-stable-images
/test 4.13-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.13-stable-nvidia-gpu-operator-e2e-master
/test 4.14-stable-images
/test 4.14-stable-nvidia-gpu-operator-e2e-25-10-x
/test 4.14-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.14-stable-nvidia-gpu-operator-e2e-master
/test 4.15-stable-images
/test 4.15-stable-nvidia-gpu-operator-e2e-25-10-x
/test 4.15-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.15-stable-nvidia-gpu-operator-e2e-master
/test 4.16-stable-images
/test 4.16-stable-nvidia-gpu-operator-e2e-25-10-x
/test 4.16-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.16-stable-nvidia-gpu-operator-e2e-master
/test 4.17-stable-images
/test 4.17-stable-nvidia-gpu-operator-e2e-25-10-x
/test 4.17-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.17-stable-nvidia-gpu-operator-e2e-master
/test 4.18-rt-images
/test 4.18-rt-nvidia-gpu-operator-e2e-rt-25-10-x
/test 4.18-stable-images
/test 4.18-stable-nvidia-gpu-operator-e2e-25-10-x
/test 4.18-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.18-stable-nvidia-gpu-operator-e2e-master
/test 4.19-stable-images
/test 4.19-stable-nvidia-gpu-operator-e2e-25-10-x
/test 4.19-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.19-stable-nvidia-gpu-operator-e2e-master
/test 4.20-stable-images
/test 4.20-stable-nvidia-gpu-operator-e2e-25-10-x
/test 4.20-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.20-stable-nvidia-gpu-operator-e2e-master
/test 4.21-stable-images
/test 4.21-stable-nvidia-gpu-operator-e2e-25-10-x
/test 4.21-stable-nvidia-gpu-operator-e2e-25-3-x
/test 4.21-stable-nvidia-gpu-operator-e2e-master
/test doca4-images
/test doca4-nvidia-network-operator-legacy-sriov-rdma
/test doca4-nvidia-network-operator-shared-device-eth-rdma
/test images
/test lint

Use /test all to run the following jobs that were automatically triggered:

pull-ci-rh-ecosystem-edge-nvidia-ci-main-images
pull-ci-rh-ecosystem-edge-nvidia-ci-main-lint
Details

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@empovit
Copy link
Member

empovit commented Jan 7, 2026

/test 4.20-stable-nvidia-gpu-operator-e2e-25-10-x

@openshift-ci
Copy link

openshift-ci bot commented Jan 7, 2026

@vtruhpon: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.20-stable-nvidia-gpu-operator-e2e-25-10-x c568ada link true /test 4.20-stable-nvidia-gpu-operator-e2e-25-10-x

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

…failing if it is missing of false

Implementation is done for both gpu-burn-mig and single-mig.

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@openshift-ci openshift-ci bot removed the lgtm label Jan 7, 2026
@openshift-ci
Copy link

openshift-ci bot commented Jan 7, 2026

New changes are detected. LGTM label has been removed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @pkg/mig/mig.go:
- Around line 422-445: ResetMIGLabelsToDisabled currently sets
"nvidia.com/mig.config" to "all-disabled" but leaves stale
"nvidia.com/mig.strategy" labels; update the loop in ResetMIGLabelsToDisabled to
also remove the strategy label by calling the node builder's RemoveLabel (e.g.,
nodeBuilder = nodeBuilder.RemoveLabel("nvidia.com/mig.strategy")) before calling
Update(), keep the existing Expect(err)... checks and logging around the
Update(), and ensure the removal happens for each node so the operator can
reapply the correct strategy later.

In @tests/nvidiagpu/deploy-gpu-test.go:
- Around line 1386-1412: The cleanup defer block currently sets the
"nvidia.com/mig.config" label to "all-disabled" but leaves any existing
"nvidia.com/mig.strategy" label in place; update the defer cleanup (the
anonymous defer containing nodeBuilders loop and calls to nodeBuilder.WithLabel
and nodeBuilder.Update) to first call
nodeBuilder.RemoveLabel("nvidia.com/mig.strategy", "") (or the appropriate
RemoveLabel signature) for each nodeBuilder, then proceed to set
"nvidia.com/mig.config" via nodeBuilder.WithLabel and call nodeBuilder.Update;
ensure you check and Expect(err) after RemoveLabel and before setting mig.config
so failures are surfaced.
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c568ada and c387882.

📒 Files selected for processing (3)
  • internal/wait/wait.go
  • pkg/mig/mig.go
  • tests/nvidiagpu/deploy-gpu-test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/wait/wait.go
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-12-21T00:43:37.494Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:37.494Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.

Applied to files:

  • tests/nvidiagpu/deploy-gpu-test.go
  • pkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Applied to files:

  • pkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.

Applied to files:

  • pkg/mig/mig.go

Comment on lines +422 to +445
// ResetMIGLabelsToDisabled sets MIG strategy and configuration labels to "all-disabled" on GPU worker nodes.
func ResetMIGLabelsToDisabled(WorkerNodeSelector map[string]string) {
glog.V(gpuparams.Gpu10LogLevel).Infof("%s", colorLog(colorCyan+colorBold, "Reset MIG labels to disabled"))
nodeBuilders, err := nodes.List(inittools.APIClient, metav1.ListOptions{LabelSelector: labels.Set(WorkerNodeSelector).String()})
Expect(err).ToNot(HaveOccurred(), "Error listing worker nodes: %v", err)
for _, nodeBuilder := range nodeBuilders {
glog.V(gpuparams.Gpu10LogLevel).Infof("Setting MIG configuration label to 'all-disabled' on node '%s' (overwrite=true)", nodeBuilder.Definition.Name)
nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", "all-disabled")
_, err = nodeBuilder.Update()
Expect(err).ToNot(HaveOccurred(), "Error updating node '%s' with MIG label: %v", nodeBuilder.Definition.Name, err)
glog.V(gpuparams.Gpu10LogLevel).Infof("Successfully set MIG configuration label on node '%s'", nodeBuilder.Definition.Name)
// Nitpick comment: Deleting strategy label does not help, it reappears after a while on its own
}

// Wait for ClusterPolicy to be notReady
_ = wait.ClusterPolicyNotReady(inittools.APIClient, nvidiagpu.ClusterPolicyName,
nvidiagpu.ClusterPolicyNotReadyCheckInterval, nvidiagpu.ClusterPolicyNotReadyTimeout)

glog.V(gpuparams.GpuLogLevel).Infof("Waiting for ClusterPolicy to be ready after setting MIG node labels")
err = wait.ClusterPolicyReady(inittools.APIClient, nvidiagpu.ClusterPolicyName,
nvidiagpu.ClusterPolicyReadyCheckInterval, nvidiagpu.ClusterPolicyReadyTimeout)
Expect(err).ToNot(HaveOccurred(), "Error waiting for ClusterPolicy to be ready after node label changes: %v", err)
glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy is ready after node label changes")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove mig.strategy label during cleanup.

The ResetMIGLabelsToDisabled function sets nvidia.com/mig.config to "all-disabled" but doesn't remove the nvidia.com/mig.strategy label. Based on learnings, during cleanup you should remove the mig.strategy label entirely since the only valid values are "single" or "mixed", and leaving it with a stale value could cause issues in subsequent tests.

The comment at line 433 mentions that "Deleting strategy label does not help, it reappears after a while on its own" — however, the proper cleanup approach is still to remove it and let the operator re-apply it as needed during the next configuration.

Based on learnings, use RemoveLabel for mig.strategy during cleanup.

🤖 Prompt for AI Agents
In @pkg/mig/mig.go around lines 422 - 445, ResetMIGLabelsToDisabled currently
sets "nvidia.com/mig.config" to "all-disabled" but leaves stale
"nvidia.com/mig.strategy" labels; update the loop in ResetMIGLabelsToDisabled to
also remove the strategy label by calling the node builder's RemoveLabel (e.g.,
nodeBuilder = nodeBuilder.RemoveLabel("nvidia.com/mig.strategy")) before calling
Update(), keep the existing Expect(err)... checks and logging around the
Update(), and ensure the removal happens for each node so the operator can
reapply the correct strategy later.

Comment on lines +1386 to +1412
defer func() {
defer GinkgoRecover()
glog.V(gpuparams.Gpu100LogLevel).Infof("Setting MIG strategy label on GPU worker nodes")
nodeBuilders, err := nodes.List(inittools.APIClient, metav1.ListOptions{LabelSelector: labels.Set(WorkerNodeSelector).String()})
Expect(err).ToNot(HaveOccurred(), "Error listing worker nodes: %v", err)
for _, nodeBuilder := range nodeBuilders {
glog.V(gpuparams.Gpu100LogLevel).Infof("NOT setting or removing MIG strategy label on node '%s')", nodeBuilder.Definition.Name)

glog.V(gpuparams.Gpu100LogLevel).Infof("Setting MIG configuration label on node '%s' (overwrite=true)", nodeBuilder.Definition.Name)
nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", "all-disabled")
_, err = nodeBuilder.Update()
Expect(err).ToNot(HaveOccurred(), "Error updating node '%s' with MIG label: %v", nodeBuilder.Definition.Name, err)
glog.V(gpuparams.Gpu10LogLevel).Infof("Successfully set MIG configuration label on node '%s'", nodeBuilder.Definition.Name)
}

By("Wait for GPU operator to reconcile after setting MIG node labels")
glog.V(gpuparams.GpuLogLevel).Infof("Sleeping for 30 seconds to allow GPU operator to process node label changes")
time.Sleep(30 * time.Second)
// Sleeping to ensure ClusterPolicy gets notReady before entering the next step

By(fmt.Sprintf("Wait up to %s for ClusterPolicy to be ready after node label changes", nvidiagpu.ClusterPolicyReadyTimeout))
glog.V(gpuparams.GpuLogLevel).Infof("Waiting for ClusterPolicy to be ready after setting MIG node labels")
err = wait.ClusterPolicyReady(inittools.APIClient, nvidiagpu.ClusterPolicyName,
nvidiagpu.ClusterPolicyReadyCheckInterval, nvidiagpu.ClusterPolicyReadyTimeout)
Expect(err).ToNot(HaveOccurred(), "Error waiting for ClusterPolicy to be ready: %v", err)
glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy is ready after node label changes")
}()
Copy link
Contributor

@coderabbitai coderabbitai bot Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove mig.strategy label during cleanup instead of leaving it.

The cleanup defer sets nvidia.com/mig.config to "all-disabled" but doesn't remove the nvidia.com/mig.strategy label. Based on learnings, during cleanup you should remove the mig.strategy label entirely rather than leaving it with a potentially stale value, as the only valid values for mig.strategy are "single" or "mixed".

Consider using nodeBuilder.RemoveLabel("nvidia.com/mig.strategy", "") before setting the mig.config label to ensure clean state for subsequent tests.

Based on learnings, use RemoveLabel for mig.strategy during cleanup.

🤖 Prompt for AI Agents
In @tests/nvidiagpu/deploy-gpu-test.go around lines 1386 - 1412, The cleanup
defer block currently sets the "nvidia.com/mig.config" label to "all-disabled"
but leaves any existing "nvidia.com/mig.strategy" label in place; update the
defer cleanup (the anonymous defer containing nodeBuilders loop and calls to
nodeBuilder.WithLabel and nodeBuilder.Update) to first call
nodeBuilder.RemoveLabel("nvidia.com/mig.strategy", "") (or the appropriate
RemoveLabel signature) for each nodeBuilder, then proceed to set
"nvidia.com/mig.config" via nodeBuilder.WithLabel and call nodeBuilder.Update;
ensure you check and Expect(err) after RemoveLabel and before setting mig.config
so failures are surfaced.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the implementation earlier where it was removed, but since the label reappeared on it's own, I had removed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtruhpon, thank you for that clarification! That makes sense — if the GPU Operator or another controller automatically re-applies the nvidia.com/mig.strategy label, then attempting to remove it during cleanup would indeed be ineffective. Your current implementation is the right approach given that behavior.

I appreciate you sharing this context about the label's automatic management.


✏️ Learnings added
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1386-1412
Timestamp: 2026-01-07T14:49:46.309Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during MIG cleanup after tests, the nvidia.com/mig.strategy label should not be removed because it gets automatically re-applied by the GPU Operator. Setting nvidia.com/mig.config to "all-disabled" is sufficient for cleanup.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.

Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:37.494Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).

Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.

@vtruhpon
Copy link
Collaborator Author

vtruhpon commented Jan 7, 2026

/test 4.20-stable-nvidia-gpu-operator-e2e-25-10-x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants