-
Notifications
You must be signed in to change notification settings - Fork 11
Single mig first commit #391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds MIG (Multi-Instance GPU) support: new MIG-aware GPU-burn pod creation, MIG profile discovery/parsing, node labeling and ClusterPolicy orchestration, polling/wait helpers, test suites and test integrations, config/constant additions, docs, and cleanup/logging utilities. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Suite as MIG Test Suite
participant API as Kubernetes API
participant Nodes as Worker Nodes
participant Policy as GPU Operator ClusterPolicy
participant Pod as GPU-burn Pod
participant Logs as Log Collector
Suite->>API: list nodes (nodeSelector)
API-->>Suite: node list
Suite->>Suite: parse/select MIG profile(s)
Suite->>API: label nodes for chosen MIG strategy
API-->>Nodes: labels applied
Suite->>Policy: fetch ClusterPolicy (capture resourceVersion)
Suite->>Policy: update ClusterPolicy with MIG config
Policy-->>Suite: updated resourceVersion/state
Suite->>Policy: poll ClusterPolicy readiness / notReady
Suite->>API: poll until node label exists
Suite->>API: create namespace & configmap for burn
Suite->>API: create GPU-burn Pod requesting nvidia.com/<migProfile>
Pod->>Pod: run burn across MIG instances
Pod-->>Logs: emit per-slice logs
Suite->>Pod: wait for completion
Suite->>API: fetch pod logs
Suite->>Suite: parse & validate logs per MIG instance
Suite->>API: cleanup pod/configmap/namespace and reset labels/ClusterPolicy
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🧹 Nitpick comments (17)
internal/gpuparams/gpuvars.go (1)
7-28: Remove commented-out code before merging.These commented-out function definitions add noise without value. If they're needed in the future, they can be added at that time. Version control preserves history, so there's no need to keep dead code in comments.
internal/gpu-burn/gpu-burn.go (2)
61-63: Consider adding input validation formigProfileandmigCount.An empty
migProfilewould create an invalid resource namenvidia.com/, and a non-positivemigCountwould result in invalid resource requests. Defensive validation would prevent subtle runtime failures.func CreateGPUBurnPodWithMIG(apiClient *clients.Settings, podName, podNamespace string, gpuBurnImage string, migProfile string, migCount int, timeout time.Duration) (*corev1.Pod, error) { + if migProfile == "" { + return nil, fmt.Errorf("migProfile cannot be empty") + } + if migCount <= 0 { + return nil, fmt.Errorf("migCount must be positive, got %d", migCount) + } + var volumeDefaultMode int32 = 0777
62-63: Thetimeoutparameter is unused.The
timeoutparameter is declared but never referenced in the function body. Either use it (e.g., for pod readiness waiting) or remove it to avoid confusion. Note thatCreateGPUBurnPodhas the same issue.tests/mig/mig_suite_test.go (1)
23-23: Consider usingtsparams.MigLabelsfor consistency.The
mig-vars.gofile definesMigLabelsspecifically for MIG tests (which extendsgpuparams.LabelswithLabelSuite). Usingtsparams.Labelshere may not include MIG-specific labeling. Verify if this is intentional.- RunSpecs(t, "MIG", Label(tsparams.Labels...), reporterConfig) + RunSpecs(t, "MIG", Label(tsparams.MigLabels...), reporterConfig)tests/mig/mig-test.go (3)
3-42: Clean up commented-out imports.Multiple commented-out imports indicate incomplete refactoring. Consider removing them to improve code clarity, or add a TODO comment explaining why they're preserved.
64-68: Unused variables declared.
ScaleCluster,UseSingleMIGProfile,UseMixedMIGProfile,SingleMigProfile, andMixedMigProfileare declared but never used in this file. Remove them or add a TODO if they're planned for future use.- ScaleCluster = false - UseSingleMIGProfile = false - UseMixedMIGProfile = false - SingleMigProfile = UndefinedValue - MixedMigProfile = UndefinedValue
116-119: UseSkip()for unimplemented tests.The placeholder test should use Ginkgo's
Skip()function to clearly indicate it's not implemented, rather than just logging. This provides better test reporting.It("Test GPU Burn with mixed strategy MIG Configuration", Label("gpu-mixed-mig"), func() { - glog.V(gpuparams.Gpu10LogLevel).Infof("gpu-mixed-mig testcase not yet implemented") - // mig.TestMixedMIGGPUBurn(nvidiaGPUConfig, burn, BurnImageName, WorkerNodeSelector, cleanupAfterTest) + Skip("gpu-mixed-mig testcase not yet implemented") })tests/nvidiagpu/deploy-gpu-test.go (3)
1184-1205: Remove commented-out code.The
shouldKeepOperatorfunction is entirely commented out. If it's been replaced bymig.ShouldKeepOperator, remove this dead code to improve maintainability.
1267-1271: Redundant return after Skip.
Skip()in Ginkgo panics to abort the test, so thereturnstatement on line 1270 is unreachable. Remove it for clarity.if len(migCapabilities) == 0 { glog.V(gpuparams.GpuLogLevel).Infof("No MIG configurations available") Skip("No MIG configurations found") - return }
1111-1182: Consider extractingcleanupGPUOperatorResourcesto shared package.This function largely duplicates
mig.CleanupGPUOperatorResources(shown in relevant snippets atpkg/mig/mig.golines 186-254). Consider reusing the existing function or consolidating to avoid drift.internal/get/get.go (2)
130-144: Hardcoded namespace may limit flexibility.The namespace
"nvidia-gpu-operator"is hardcoded. If the GPU operator is installed in a different namespace (e.g., custom installation), this function will fail to discover MIG capabilities. Consider making this configurable or deriving it from the cluster configuration.
315-333: Fragile state management between line1 and line2 parsing.The
excludeflag is used to track whether to skip line2 after an excluded line1. However, if the input contains unexpected patterns (e.g., malformed output, extra blank lines between pairs), the line2 regex could match an unrelated line and incorrectly update the last profile.Consider a more robust approach like tracking expected line2 state explicitly or using a multiline regex to match both lines together.
pkg/mig/mig.go (5)
185-255: Consider softer error handling in cleanup function.Using
Expectassertions throughout cleanup (lines 204, 216, 226, etc.) will abort cleanup on the first error, potentially leaving resources behind. Consider accumulating errors and reporting them at the end, or logging warnings instead.
273-286: Remove commented-out code and consider exact label matching.
- The commented-out code block (lines 273-279) should be removed before merging.
- Using
strings.Contains(filterQuery, label)can cause false positives - e.g., label "mig" would match filter "nomig". Consider using word boundaries or the commented approach withMatchesLabelFilter.- // for _, label := range labelsToCheck { - // glog.V(gpuparams.GpuLogLevel).Infof("Checking if label %s is present in CurrentSpecReport", label) - // if matches, _ := specReport.MatchesLabelFilter(label); matches { - // glog.V(gpuparams.Gpu100LogLevel).Infof("Label %s is present in CurrentSpecReport", label) - // return true - // } - // } for _, label := range labelsToCheck { glog.V(gpuparams.GpuLogLevel).Infof("Checking if label %s is present in Ginkgo label filter", label) - if strings.Contains(filterQuery, label) { + // Use word boundary check to avoid partial matches + labelPattern := regexp.MustCompile(`\b` + regexp.QuoteMeta(label) + `\b`) + if labelPattern.MatchString(filterQuery) { glog.V(gpuparams.Gpu100LogLevel).Infof("Label %s is present in Ginkgo label filter", label) return true } }
402-419: Inefficient: TwoUpdate()calls per node.Each node is updated twice - once for the strategy label and once for the config label. Consider setting both labels before calling
Update()to reduce API calls.for _, nodeBuilder := range nodeBuilders { glog.V(gpuparams.GpuLogLevel).Infof("Setting MIG %s strategy label on node '%s' (overwrite=true)", strategy, nodeBuilder.Definition.Name) nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.strategy", strategy) - _, err = nodeBuilder.Update() - if err != nil { - glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG label: %v, error: %v", nodeBuilder.Definition.Name, strategy, err) - } else { - glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG %s strategy label on node '%s'", strategy, nodeBuilder.Definition.Name) - } glog.V(gpuparams.GpuLogLevel).Infof("Setting MIG configuration label %s on node '%s' (overwrite=true)", MigProfile, nodeBuilder.Definition.Name) nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", MigProfile) _, err = nodeBuilder.Update() if err != nil { - glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG label: %v", nodeBuilder.Definition.Name, err) + glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG labels: %v", nodeBuilder.Definition.Name, err) } else { - glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG configuration label on node '%s' with %s", nodeBuilder.Definition.Name, MigProfile) + glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG labels on node '%s' (strategy=%s, config=%s)", nodeBuilder.Definition.Name, strategy, MigProfile) } }
744-752: UseBeTrue()instead ofToNot(BeFalse())for clarity.The double negation
ToNot(BeFalse())is harder to read thanTo(BeTrue()). Both are functionally equivalent.for i := 0; i < migInstanceCount; i++ { match1Mig := strings.Contains(gpuBurnMigLogs, fmt.Sprintf("GPU %d: OK", i)) glog.V(gpuparams.Gpu10LogLevel).Infof("Checking if GPU %d: OK is present in logs: %v", i, match1Mig) - Expect(match1Mig).ToNot(BeFalse(), "gpu-burn pod execution with MIG was FAILED") + Expect(match1Mig).To(BeTrue(), "gpu-burn pod execution with MIG was FAILED for GPU %d", i) } match2Mig := strings.Contains(gpuBurnMigLogs, "100.0% proc'd:") - Expect(match2Mig).ToNot(BeFalse(), "gpu-burn pod execution with MIG was FAILED") + Expect(match2Mig).To(BeTrue(), "gpu-burn pod execution with MIG did not reach 100%% processed")
761-775: Remove commented-out code before merging.The commented functions
colorlogs()andclog()should be removed. If needed later, they can be retrieved from version control.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
.gitignore(1 hunks)internal/get/get.go(3 hunks)internal/gpu-burn/gpu-burn.go(2 hunks)internal/gpuparams/const.go(1 hunks)internal/gpuparams/gpuvars.go(1 hunks)internal/nvidiagpuconfig/config.go(1 hunks)internal/tsparams/mig-vars.go(1 hunks)pkg/mig/mig.go(1 hunks)pkg/nodes/node.go(3 hunks)tests/mig/mig-test.go(1 hunks)tests/mig/mig_suite_test.go(1 hunks)tests/nvidiagpu/deploy-gpu-test.go(17 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/mig/mig_suite_test.go (3)
internal/reporter/reporter.go (1)
ReportIfFailed(54-90)internal/tsparams/mig-vars.go (2)
MigReporterNamespacesToDump(14-18)MigReporterCRDsToDump(21-23)pkg/clients/clients.go (1)
SetScheme(130-168)
tests/mig/mig-test.go (9)
pkg/operatorconfig/config.go (1)
NewCustomConfig(18-27)pkg/nvidiagpu/gpuBurnConfig.go (1)
NewDefaultGPUBurnConfig(15-22)pkg/nvidiagpu/consts.go (1)
NvidiaGPULabel(8-8)internal/nvidiagpuconfig/config.go (2)
NvidiaGPUConfig(9-21)NewNvidiaGPUConfig(25-37)pkg/global/const.go (1)
UndefinedValue(5-5)internal/tsparams/consts.go (1)
LabelSuite(5-5)internal/gpuparams/const.go (3)
Gpu100LogLevel(11-11)Gpu10LogLevel(9-9)GpuLogLevel(8-8)pkg/mig/mig.go (2)
ReportOpenShiftVersionAndEnsureNFD(152-164)CleanupGPUOperatorResources(187-255)internal/inittools/inittools.go (1)
APIClient(20-20)
internal/tsparams/mig-vars.go (1)
internal/tsparams/consts.go (1)
LabelSuite(5-5)
pkg/nodes/node.go (1)
internal/gpuparams/const.go (2)
Gpu100LogLevel(11-11)Gpu10LogLevel(9-9)
tests/nvidiagpu/deploy-gpu-test.go (2)
pkg/mig/mig.go (1)
ShouldKeepOperator(258-289)pkg/global/const.go (1)
UndefinedValue(5-5)
pkg/mig/mig.go (1)
internal/get/get.go (4)
MIGProfileInfo(185-201)GetFirstPodNameWithLabel(62-79)MIGCapabilities(114-182)GetClusterArchitecture(82-109)
🔇 Additional comments (16)
.gitignore (1)
5-6: LGTM!The additions are appropriate:
/archive/excludes archived data from tracking, and*.testexcludes compiled Go test binaries (standard practice for Go projects).internal/gpuparams/const.go (1)
8-11: LGTM!The new log level constants provide appropriate granularity for GPU-related logging and follow the existing naming convention.
pkg/nodes/node.go (1)
235-258: Good addition for overwrite-capable label setting.The
WithLabelmethod properly complementsWithNewLabelby providing explicit overwrite semantics. The implementation correctly handles nil label maps and follows the builder pattern consistently.internal/gpu-burn/gpu-burn.go (1)
82-145: Good security posture and pod configuration.The pod spec follows security best practices:
RunAsNonRoot,SeccompProfile: RuntimeDefault, dropping all capabilities, and setting both resource limits and requests for guaranteed QoS class.internal/nvidiagpuconfig/config.go (1)
19-20: LGTM!The new MIG profile configuration fields follow the existing pattern and naming conventions in the struct. The envconfig tags are consistent with the established prefix format.
tests/mig/mig_suite_test.go (1)
26-31: LGTM!The
JustAfterEachhook correctly uses the MIG-specific reporter configuration (MigReporterNamespacesToDump,MigReporterCRDsToDump) for failure reporting, which aligns with the definitions inmig-vars.go.tests/mig/mig-test.go (1)
15-15: Verify blank import is necessary.The blank import
_ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients"suggests side-effect initialization. Confirm this is required; if not, remove it to avoid unnecessary dependencies.internal/tsparams/mig-vars.go (1)
9-23: LGTM!The MIG-specific test parameters are well-defined. The
MigLabelsproperly extends the base labels, and the reporter configuration includes the relevant namespaces and CRDs for MIG test diagnostics.tests/nvidiagpu/deploy-gpu-test.go (1)
1102-1108: LGTM!The new MIG test entries are properly labeled and delegate to the appropriate test functions.
internal/get/get.go (4)
3-19: LGTM - Imports are appropriately updated for new functionality.The added imports (
context,regexp,strconv,strings,corev1) are all required for the new MIG capability discovery functions.
62-79: LGTM - Improved error handling for pod listing.The explicit error handling and logging for empty pod lists improves debuggability. Minor note: Line 78 returns
errwhich is alwaysnilat that point; consider returningnilexplicitly for clarity.
184-201: LGTM - Well-documented struct for MIG profile information.The
MIGProfileInfostruct is well-documented with clear comments explaining each field's purpose.
205-258: LGTM - Robust command execution with proper status checks.The function properly validates pod and container readiness before execution, and appropriately handles non-zero exit codes by returning output when available.
pkg/mig/mig.go (3)
1-51: LGTM - Package setup and imports.Imports are comprehensive for the MIG testing functionality. Color constants for logging are a nice touch for observability.
166-183: LGTM - Clean label checking implementation.The function properly handles comma-separated labels with whitespace trimming.
351-385: LGTM - Proper MIG profile selection with bounds checking.The function handles random selection when no profile is specified and gracefully handles out-of-range indices. Note: since Go 1.20, the global random source is automatically seeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (9)
pkg/nodes/node.go (1)
11-11: New import adds coupling to internal package.The
gpuparamsimport introduces a dependency from a general-purposepkg/nodespackage to an internal GPU-specific package. This creates tight coupling. Consider using the standard log level100(as used elsewhere in this file) or passing the log level as a parameter.- "github.com/rh-ecosystem-edge/nvidia-ci/internal/gpuparams"Then in the
WithLabelmethod, useglog.V(100)consistently with the rest of the file.tests/nvidiagpu/deploy-gpu-test.go (3)
1114-1123: Remove commented-out code.The commented
HasLabel("cleanup")block (lines 1116-1122) appears to be leftover from development. Either implement this feature or remove the dead code.func cleanupGPUOperatorResources() { if !cleanupAfterTest { - // if HasLabel("cleanup") { - // glog.V(gpuparams.GpuLogLevel).Infof("Cleanup is enabled via cleanup label, running cleanup") - // } else { - // glog.V(gpuparams.GpuLogLevel).Infof("Cleanup is disabled, skipping GPU operator cleanup") - // return - // } return }
1184-1205: Remove commented-out function.The
shouldKeepOperatorfunction is entirely commented out and appears to be superseded bymig.ShouldKeepOperator. Remove this dead code to improve maintainability.
1102-1108: Consider consolidating duplicate MIG test implementations.There are two test cases for single MIG strategy:
gpu-burn-mig(line 1102) uses localtestGPUBurnWithMIGConfiguration()single-mig(line 1106) usesmig.TestSingleMIGGPUBurn()frompkg/migIf both test the same scenario, consider removing one to reduce maintenance burden. If they test different aspects, add comments clarifying the distinction.
internal/gpu-burn/gpu-burn.go (1)
62-63: Unusedtimeoutparameter.The
timeoutparameter is declared but never used inCreateGPUBurnPodWithMIG. Either use it (e.g., for context deadline) or remove it to match the function's actual behavior.func CreateGPUBurnPodWithMIG(apiClient *clients.Settings, podName, podNamespace string, - gpuBurnImage string, migProfile string, migCount int, timeout time.Duration) (*corev1.Pod, error) { + gpuBurnImage string, migProfile string, migCount int) (*corev1.Pod, error) {Note: The same issue exists in
CreateGPUBurnPod(line 151), so this may be intentional for API consistency. If so, consider documenting why the parameter exists.tests/mig/mig-test.go (4)
3-42: Clean up commented-out imports.Many imports are commented out (lines 4-9, 13, 19, 22, 30-36, 40-41). This suggests work-in-progress code. Remove unused commented imports before merging to keep the code clean.
import ( - // "context" - // "encoding/json" - // "fmt" - // "math/rand" - // "strings" - // "time" - "github.com/rh-ecosystem-edge/nvidia-ci/internal/inittools" - - // "github.com/rh-ecosystem-edge/nvidia-ci/internal/nfd" "github.com/rh-ecosystem-edge/nvidia-ci/internal/nvidiagpuconfig" _ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients" . "github.com/rh-ecosystem-edge/nvidia-ci/pkg/global" "github.com/rh-ecosystem-edge/nvidia-ci/pkg/mig" - - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/mig" nfd "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nfd" - - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nodes" "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nvidiagpu" "github.com/rh-ecosystem-edge/nvidia-ci/pkg/operatorconfig" "github.com/golang/glog" . "github.com/onsi/ginkgo/v2" . "github.com/onsi/gomega" - - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/configmap" - - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/namespace" - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/pod" - - // "github.com/operator-framework/api/pkg/operators/v1alpha1" - // "github.com/rh-ecosystem-edge/nvidia-ci/internal/check" - "github.com/rh-ecosystem-edge/nvidia-ci/internal/gpuparams" "github.com/rh-ecosystem-edge/nvidia-ci/internal/tsparams" - // "github.com/rh-ecosystem-edge/nvidia-ci/internal/wait" - // corev1 "k8s.io/api/core/v1" )
50-58: Consider centralizing shared test configuration.
WorkerNodeSelectorandBurnImageNameare duplicated fromtests/nvidiagpu/deploy-gpu-test.go. Consider moving these to a shared test utilities package to avoid drift between test suites.
64-73: Remove or use declared variables.Several package-level variables are declared but never used:
ScaleCluster(line 64)UseSingleMIGProfile(line 65)UseMixedMIGProfile(line 66)SingleMigProfile(line 67)MixedMigProfile(line 68)Additionally, lines 71-72 contain commented-out variable declarations. Either implement the intended functionality or remove these to avoid confusion.
113-116: Mark placeholder test as pending or skip it.The mixed MIG test (line 113-116) only logs a message and doesn't perform any assertions. This will pass silently and may give false confidence. Either:
- Use
Skip("not yet implemented")to clearly indicate it's not ready- Use Ginkgo's
PIt(pending It) to mark it as pending- Remove it until implementation is ready
- It("Test GPU Burn with mixed strategy MIG Configuration", Label("gpu-mixed-mig"), func() { - glog.V(gpuparams.Gpu10LogLevel).Infof("gpu-mixed-mig testcase not yet implemented") - // mig.TestMixedMIGGPUBurn(nvidiaGPUConfig, burn, BurnImageName, WorkerNodeSelector, cleanupAfterTest) - }) + PIt("Test GPU Burn with mixed strategy MIG Configuration", Label("gpu-mixed-mig"), func() { + // TODO: Implement mixed MIG test + // mig.TestMixedMIGGPUBurn(nvidiaGPUConfig, burn, BurnImageName, WorkerNodeSelector, cleanupAfterTest) + })
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
internal/gpu-burn/gpu-burn.go(2 hunks)pkg/mig/mig.go(1 hunks)pkg/nodes/node.go(3 hunks)tests/mig/mig-test.go(1 hunks)tests/nvidiagpu/deploy-gpu-test.go(17 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/mig/mig.go
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Applied to files:
pkg/nodes/node.gotests/mig/mig-test.gointernal/gpu-burn/gpu-burn.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.
Applied to files:
pkg/nodes/node.gotests/nvidiagpu/deploy-gpu-test.gotests/mig/mig-test.gointernal/gpu-burn/gpu-burn.go
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-06-18T14:59:40.987Z
Learnt from: wabouhamad
Repo: rh-ecosystem-edge/nvidia-ci PR: 205
File: tests/nvidianetwork/deploy-nno-test.go:1411-1414
Timestamp: 2025-06-18T14:59:40.987Z
Learning: In SRIOV RDMA testcases in tests/nvidianetwork/deploy-nno-test.go, the string "sriov" is intentionally passed as the device parameter to CreateRdmaWorkloadPod instead of the actual mlx5 device ID. The workload container is designed to dynamically discover the correct mlx5 device for SRIOV cases at runtime.
Applied to files:
tests/mig/mig-test.go
🧬 Code graph analysis (3)
tests/nvidiagpu/deploy-gpu-test.go (3)
pkg/mig/mig.go (1)
ShouldKeepOperator(264-295)pkg/global/const.go (1)
UndefinedValue(5-5)internal/wait/wait.go (1)
ClusterPolicyReady(17-46)
tests/mig/mig-test.go (10)
pkg/operatorconfig/config.go (1)
NewCustomConfig(18-27)pkg/nvidiagpu/gpuBurnConfig.go (1)
NewDefaultGPUBurnConfig(15-22)internal/nvidiagpuconfig/config.go (2)
NvidiaGPUConfig(9-21)NewNvidiaGPUConfig(25-37)pkg/global/const.go (1)
UndefinedValue(5-5)internal/tsparams/consts.go (1)
LabelSuite(5-5)vendor/github.com/golang/glog/glog.go (2)
V(401-403)Infof(507-509)internal/gpuparams/const.go (3)
Gpu10LogLevel(9-9)Gpu100LogLevel(11-11)GpuLogLevel(8-8)pkg/mig/mig.go (2)
ReportOpenShiftVersionAndEnsureNFD(158-170)CleanupGPUOperatorResources(193-261)internal/inittools/inittools.go (1)
APIClient(20-20)tests/nvidiagpu/deploy-gpu-test.go (1)
BurnImageName(62-65)
internal/gpu-burn/gpu-burn.go (3)
pkg/clients/clients.go (1)
Settings(45-60)internal/gpuparams/gpuvars.go (1)
Labels(5-5)internal/tsparams/gpu-vars.go (1)
Labels(11-11)
🔇 Additional comments (8)
pkg/nodes/node.go (2)
203-205: Good documentation clarification.The updated comment clearly distinguishes
WithNewLabel(no overwrite) from the newWithLabel(overwrites). This helps API consumers choose the right method.
235-257: LGTM - Clean implementation of overwrite-capable label setter.The method correctly:
- Validates input (empty key check)
- Initializes the Labels map if nil
- Unconditionally sets the label value (overwrite behavior)
- Returns the builder for chaining
Minor note: The error logging order (errorMsg set before logging at line 245-246) differs slightly from
WithNewLabel(log then errorMsg at lines 213-214), but this doesn't affect functionality.tests/nvidiagpu/deploy-gpu-test.go (4)
93-93: Package-level slice initialization looks correct.The
labelsToCheckis initialized as an empty slice and populated inBeforeAll(line 154). This avoids issues with package-level initialization order.
153-162: Good test coordination pattern.The
labelsToCheckarray andShouldKeepOperatorintegration properly prevents premature cleanup when subsequent MIG or upgrade tests need the operator to remain installed.
1273-1282: Random MIG profile selection may complicate debugging.Using
rand.Intnfor MIG profile selection provides good test variety, but failures may be harder to reproduce. Consider logging the selected index prominently or allowing an environment variable override for deterministic testing.The code at line 1276 already logs the selected index, which helps.
Consider verifying if the
NVIDIAGPU_SINGLE_MIG_PROFILEenv var fromnvidiaGPUConfig.SingleMIGProfileis intended to override this random selection. The current code doesn't appear to use it here.
1592-1605: LGTM - MIG-aware log validation.The validation loop correctly checks for "GPU X: OK" for each available MIG slice (line 1595-1600), and separately verifies the completion message "100.0% proc'd:" (line 1602). This properly validates MIG workload distribution.
internal/gpu-burn/gpu-burn.go (2)
70-73: LGTM - Correct MIG resource naming.The comment accurately explains the MIG resource naming convention:
- Single strategy:
migProfile="gpu"→"nvidia.com/gpu"- Mixed strategy:
migProfile="mig-1g.5gb"→"nvidia.com/mig-1g.5gb"This aligns with the retrieved learnings about NVIDIA MIG resource naming.
75-146: LGTM - Well-structured MIG Pod configuration.The Pod configuration correctly includes:
- MIG resource limits and requests using the constructed resource name
- Appropriate security context (non-root, seccomp profile, dropped capabilities)
- GPU tolerations for scheduling
- ConfigMap volume mount for the entrypoint script
- Node selector requiring GPU-present worker nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
pkg/mig/mig.go (4)
279-285: Remove commented-out code.The commented-out loop (lines 279-285) appears to be an alternative implementation using
MatchesLabelFilter. If this approach is no longer needed, remove it to improve code clarity.Apply this diff:
// Check if test has any of these labels - // for _, label := range labelsToCheck { - // glog.V(gpuparams.GpuLogLevel).Infof("Checking if label %s is present in CurrentSpecReport", label) - // if matches, _ := specReport.MatchesLabelFilter(label); matches { - // glog.V(gpuparams.Gpu100LogLevel).Infof("Label %s is present in CurrentSpecReport", label) - // return true - // } - // } for _, label := range labelsToCheck {
408-425: Consider combining label updates into a single API call per node.Currently, each node receives two separate
Update()calls—one fornvidia.com/mig.strategyand another fornvidia.com/mig.config. This results in two network round-trips per node. Chain bothWithLabel()calls before updating to reduce API calls.Apply this diff:
for _, nodeBuilder := range nodeBuilders { glog.V(gpuparams.GpuLogLevel).Infof("Setting MIG %s strategy label on node '%s' (overwrite=true)", strategy, nodeBuilder.Definition.Name) - nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.strategy", strategy) - _, err = nodeBuilder.Update() - if err != nil { - glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG label: %v, error: %v", nodeBuilder.Definition.Name, strategy, err) - } else { - glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG %s strategy label on node '%s'", strategy, nodeBuilder.Definition.Name) - } glog.V(gpuparams.GpuLogLevel).Infof("Setting MIG configuration label %s on node '%s' (overwrite=true)", MigProfile, nodeBuilder.Definition.Name) + nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.strategy", strategy) nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", MigProfile) _, err = nodeBuilder.Update() if err != nil { - glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG label: %v", nodeBuilder.Definition.Name, err) + glog.V(gpuparams.GpuLogLevel).Infof("Failed to update node '%s' with MIG labels: %v", nodeBuilder.Definition.Name, err) } else { - glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG configuration label on node '%s' with %s", nodeBuilder.Definition.Name, MigProfile) + glog.V(gpuparams.GpuLogLevel).Infof("Successfully set MIG %s strategy and %s configuration labels on node '%s'", strategy, MigProfile, nodeBuilder.Definition.Name) } }
576-585: Remove unreachable error check.Line 584's
Expect()assertion is unreachable: iferris non-nil, line 578 returns early; otherwiseerris nil and theExpect()always succeeds. Remove the redundant check.Apply this diff:
glog.V(gpuparams.Gpu10LogLevel).Infof("Getting cluster architecture from nodes with "+ "WorkerNodeSelector: %v", WorkerNodeSelector) clusterArch, err := get.GetClusterArchitecture(inittools.APIClient, WorkerNodeSelector) if err != nil { return "", err } glog.V(gpuparams.Gpu10LogLevel).Infof("cluster architecture for GPU enabled worker node is: %s", clusterArch) - Expect(err).ToNot(HaveOccurred(), "error configuring MIG strategy and getting cluster architecture: %v", err) time.Sleep(30 * time.Second) return clusterArch, nil
767-781: Consider removing commented-out example code.The commented functions
colorlogs()andclog()appear to be test/example code that's no longer needed. Remove them to keep the codebase clean.Apply this diff:
return fmt.Sprintf("%s%s%s", color, message, colorReset) } - -// // colorlogs prints sample colored log messages for testing color output. -// // This function demonstrates all available color options for console highlighting. -// func colorlogs() { -// glog.V(101).Infof("%s", colorLog(colorCyan+colorBold, "Cyan")) -// glog.V(101).Infof("%s", colorLog(colorGreen+colorBold, "Green")) -// glog.V(101).Infof("%s", colorLog(colorYellow+colorBold, "Yellow")) -// glog.V(101).Infof("%s", colorLog(colorBlue+colorBold, "Blue")) -// glog.V(101).Infof("%s", colorLog(colorMagenta+colorBold, "Magenta")) -// glog.V(101).Infof("%s", colorLog(colorWhite+colorBold, "White")) -// glog.V(101).Infof("%s", colorLog(colorRed+colorBold, "Red")) -// } - -// func clog(verbosity int32, color, message string, args ...interface{}) { -// glog.V(glog.Level(verbosity)).Infof("%s", Cl(color, fmt.Sprintf(message, args...))) -// }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/mig/mig.go(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Applied to files:
pkg/mig/mig.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.
Applied to files:
pkg/mig/mig.go
🧬 Code graph analysis (1)
pkg/mig/mig.go (8)
internal/nvidiagpuconfig/config.go (1)
NvidiaGPUConfig(9-21)internal/get/get.go (2)
MIGProfileInfo(185-201)MIGCapabilities(114-182)internal/inittools/inittools.go (1)
GetOpenShiftVersion(47-69)pkg/nodes/node.go (2)
Pull(119-138)Builder(30-36)pkg/olm/subscription.go (1)
PullSubscription(237-270)internal/wait/wait.go (1)
ClusterPolicyReady(17-46)internal/check/check.go (1)
NodeWithLabel(62-87)internal/gpu-burn/gpu-burn.go (1)
CreateGPUBurnConfigMap(38-59)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
pkg/mig/mig.go (1)
777-791: Consider removing commented-out code.The commented-out functions
colorlogsandclogat the end of the file could be removed if they're not needed for future reference, or moved to a separate utilities file if they might be useful later.Apply this diff if you want to remove them:
-// // colorlogs prints sample colored log messages for testing color output. -// // This function demonstrates all available color options for console highlighting. -// func colorlogs() { -// glog.V(101).Infof("%s", colorLog(colorCyan+colorBold, "Cyan")) -// glog.V(101).Infof("%s", colorLog(colorGreen+colorBold, "Green")) -// glog.V(101).Infof("%s", colorLog(colorYellow+colorBold, "Yellow")) -// glog.V(101).Infof("%s", colorLog(colorBlue+colorBold, "Blue")) -// glog.V(101).Infof("%s", colorLog(colorMagenta+colorBold, "Magenta")) -// glog.V(101).Infof("%s", colorLog(colorWhite+colorBold, "White")) -// glog.V(101).Infof("%s", colorLog(colorRed+colorBold, "Red")) -// } - -// func clog(verbosity int32, color, message string, args ...interface{}) { -// glog.V(glog.Level(verbosity)).Infof("%s", Cl(color, fmt.Sprintf(message, args...))) -// }tests/mig/mig-test.go (1)
3-42: Consider cleaning up commented imports.The test file contains many commented-out imports (lines 4-41). If these aren't needed for reference or upcoming development, consider removing them to improve code clarity.
Apply this diff to remove unnecessary commented imports:
import ( - // "context" - // "encoding/json" - // "fmt" - // "math/rand" - // "strings" - // "time" - "github.com/rh-ecosystem-edge/nvidia-ci/internal/inittools" - - // "github.com/rh-ecosystem-edge/nvidia-ci/internal/nfd" "github.com/rh-ecosystem-edge/nvidia-ci/internal/nvidiagpuconfig" _ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients" . "github.com/rh-ecosystem-edge/nvidia-ci/pkg/global" "github.com/rh-ecosystem-edge/nvidia-ci/pkg/mig" - - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/mig" nfd "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nfd" - - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nodes" "github.com/rh-ecosystem-edge/nvidia-ci/pkg/nvidiagpu" "github.com/rh-ecosystem-edge/nvidia-ci/pkg/operatorconfig" "github.com/golang/glog" . "github.com/onsi/ginkgo/v2" . "github.com/onsi/gomega" - - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/configmap" - - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/namespace" - // "github.com/rh-ecosystem-edge/nvidia-ci/pkg/pod" - - // "github.com/operator-framework/api/pkg/operators/v1alpha1" - // "github.com/rh-ecosystem-edge/nvidia-ci/internal/check" - "github.com/rh-ecosystem-edge/nvidia-ci/internal/gpuparams" "github.com/rh-ecosystem-edge/nvidia-ci/internal/tsparams" - // "github.com/rh-ecosystem-edge/nvidia-ci/internal/wait" - // corev1 "k8s.io/api/core/v1" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/mig/mig.go(1 hunks)tests/mig/mig-test.go(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Applied to files:
tests/mig/mig-test.gopkg/mig/mig.go
📚 Learning: 2025-06-18T14:59:40.987Z
Learnt from: wabouhamad
Repo: rh-ecosystem-edge/nvidia-ci PR: 205
File: tests/nvidianetwork/deploy-nno-test.go:1411-1414
Timestamp: 2025-06-18T14:59:40.987Z
Learning: In SRIOV RDMA testcases in tests/nvidianetwork/deploy-nno-test.go, the string "sriov" is intentionally passed as the device parameter to CreateRdmaWorkloadPod instead of the actual mlx5 device ID. The workload container is designed to dynamically discover the correct mlx5 device for SRIOV cases at runtime.
Applied to files:
tests/mig/mig-test.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.
Applied to files:
tests/mig/mig-test.gopkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.
Applied to files:
pkg/mig/mig.go
🧬 Code graph analysis (1)
pkg/mig/mig.go (8)
internal/nvidiagpuconfig/config.go (1)
NvidiaGPUConfig(9-21)internal/get/get.go (2)
MIGProfileInfo(185-201)GetFirstPodNameWithLabel(62-79)internal/gpuparams/const.go (3)
Gpu10LogLevel(9-9)Gpu100LogLevel(11-11)GpuLogLevel(8-8)pkg/global/const.go (1)
OpenShiftVersionFile(7-7)pkg/nodes/node.go (2)
Pull(119-138)Builder(30-36)pkg/olm/subscription.go (1)
PullSubscription(237-270)internal/wait/wait.go (1)
ClusterPolicyReady(17-46)internal/check/check.go (1)
NodeWithLabel(62-87)
🔇 Additional comments (4)
pkg/mig/mig.go (3)
52-162: LGTM! Comprehensive MIG test orchestration.The main test function properly orchestrates the MIG GPU burn test flow with appropriate error handling, cleanup defers, and step-by-step validation. All previously identified issues (defer assertions, error checking) have been addressed.
469-508: Correct label cleanup implementation.The function properly removes the
nvidia.com/mig.strategylabel usingRemoveLabel(line 478) and setsnvidia.com/mig.configto"all-disabled"(line 486), which aligns with the valid values for these labels as documented in the learnings.
681-769: LGTM! Pod deployment and validation logic is sound.The pod deployment, completion waiting, log retrieval, and validation functions are well-structured. All format string issues from previous reviews have been corrected, and the validation logic properly checks for GPU success indicators across all MIG instances.
tests/mig/mig-test.go (1)
75-119: Well-structured test suite with proper lifecycle management.The test suite properly initializes configuration in
BeforeAll(with nil check on line 82) and performs cleanup inAfterAll. The single MIG test case is implemented, and the mixed MIG test case is appropriately marked as not yet implemented.
.gitignore
Outdated
| /archive/ | ||
| *.test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these? They don't look like common patterns to exclude. If it's something only you use, please find a different solution, e.g. change the global Git config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mig.test seems to be a binary that was produced with some command (go test -c ./tests/mig)
archive was a folder that I moved some log files from previous days which I didn't want to discard yet.
Neither are necessary in general, I can use other locations for the archive folder and mig.test or any *.test binaries can be deleted manually before committing anything.
internal/get/get.go
Outdated
| glog.V(gpuparams.GpuLogLevel).Infof("error listing pods in namespace %s with label selector %s: %v", podNamespace, podLabelSelector, err) | ||
| return "", err | ||
| } | ||
|
|
||
| if len(podList) == 0 { | ||
| glog.V(gpuparams.GpuLogLevel).Infof("no pods found in namespace %s with label selector %s", podNamespace, podLabelSelector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logging is actually a bad pattern, although it's widely used. Here, you already add a useful context to the error message, and pass it on. You don't know how the error will be used, but likely it will be acted upon or displayed to the user, or propagated until it will be acted upon or displayed to the user (and so on). If the client code follows the same error handling pattern, this error will be logged again, and again. And notice it's not a debug (which is normally turned off and is there only in case something needs fixing), but an Info. So we'll end up with a lot of log messages related to the same error. Two issues:
- Too many log messages to process in a meaningful way (by humans).
- Multiple log messages related to the same error, but with different wording makes it look like there were multiple errors.
So, I'd say leave it up to the client code to handle errors your code doesn't know how to handle. They may log and do something useful, or they may add useful context to the error and propagate it further.
I encourage you to read about best practices for logging and prove me wrong :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used those logs a lot to keep track on how the execution goes and I did clean up lots of them before I committed. Some of the logs are generated by cursor (I recognize and remember that this part is cursor generated), and here they were for 2 seemingly different scenarios, although maybe theoretical.
I'll remove those extra logs, they are of no interest.
internal/get/get.go
Outdated
| } | ||
|
|
||
| // MIGCapabilities queries GPU hardware directly using nvidia-smi | ||
| // to discover MIG capabilities. This is a fallback when GFD labels are not available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fall back, or just fail the test if what's expected is not there (the GFD labels in this case).
If we're testing this feature end-to-end, we can miss a real customer problem by implementing this fallback, IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the MIGprofile is not found, I have tested 2 scenarios with mig package:
- gpu-operator is installed, mig is disabled
--> normal situation, testcase selects the mig profile and continues - gpu-operator is not installed
--> testcase does not crash, it skips the test
In this case I need to see the nvidia-smi printout. I couldn't see another way to get it.
Comments may be distracting, maybe just leave out the fallback comment?
internal/get/get.go
Outdated
| firstNode := nodeBuilder[0] | ||
| nodeName := firstNode.Object.Name | ||
|
|
||
| // Find a driver pod or GFD pod on this node to query hardware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry why do we need a GFD pod? Can't we use the nvidia.com/mig.capable node label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the printout of nvidia-smi to get the profiles:
$ oc exec -ti -n nvidia-gpu-operator nvidia-driver-daemonset-9.6.20250925-0-9fq7p -- nvidia-smi mig -lgip | grep 'MIG|GPU' | grep -v '+me'
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| 0 MIG 1g.5gb 19 7/7 4.75 No 14 0 0 |
| 0 MIG 1g.10gb 15 4/4 9.75 No 14 1 0 |
| 0 MIG 2g.10gb 14 3/3 9.75 No 28 1 0 |
| 0 MIG 3g.20gb 9 2/2 19.62 No 42 2 0 |
| 0 MIG 4g.20gb 5 1/1 19.62 No 56 2 0 |
| 0 MIG 7g.40gb 0 1/1 39.38 No 98 5 0 |
internal/get/get.go
Outdated
| // MIGCapabilities queries GPU hardware directly using nvidia-smi | ||
| // to discover MIG capabilities. This is a fallback when GFD labels are not available. | ||
| // Returns true if MIG is supported, along with available MIG instance profiles. | ||
| func MIGCapabilities(apiClient *clients.Settings, nodeSelector map[string]string) (bool, []MIGProfileInfo, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm strongly against putting this in the get package. Code should be grouped around what domain it operates on, not which actions it performs. Unless it's a common operation like, for instance, get from K8s API (get pod/node/deployment/.../anything). Even then, if it's get pod, get node, ... they probably belong in "pod", "node", etc. (and not "get").
We didn't have time to refactor this, but let's not make it worse 😉
All MIG-related code should probably be somewhere in "gpu", or "mig", or "gpu/mig".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, moving this and related parts into mig.go
internal/get/get.go
Outdated
|
|
||
| // Query MIG capabilities using nvidia-smi | ||
| // First, try to get MIG instance profiles directly (works even if MIG mode is not enabled) | ||
| cmd := []string{"nvidia-smi", "mig", "-lgip"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all this nvidia-smi code can be very useful and should be made reusable. Finding the right pod, running the command. The function will be generic, receive command arguments and return it output.
Regardless of that, this entire function seems too long. It includes a few steps that would be easier to understand and maintain as separate functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we have to check it further on how it could be shortened. Or did you mean that it should split into smaller pieces?
The functions can be split further to be reuseable for some other purposes, as they are currently quite specific for this particular case.
internal/get/get.go
Outdated
| } | ||
| return true, profiles, nil | ||
| } else { | ||
| // Even if command failed, check if we got any output (non-zero exit codes still produce output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's useful. What kind of output will it produce if failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some specific situation where I remember having seen that there was an error code while running the case and when I manually checked it, it was returning valid printout. I'm not sure if it returned a non-zero return code while I checked it manually. I guess this may be unnecessary, but I tend to put code into places where case previously crashed or failed (to prevent the crash or failure).
internal/get/get.go
Outdated
| // Internal functions serving the external functions | ||
|
|
||
| // execCommandInPod executes a command in a pod and returns the output | ||
| func execCommandInPod(apiClient *clients.Settings, podName, namespace string, command []string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one can be useful as well. Don't we already have something like it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch :) I had taken mps_test.go as a basis for mig package and tests and there is executeCommandInPod in mps_test.go
I'll change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the execCommandInPod is calling the podBuilder.ExecCommand. Even though the function name is quite descriptive in this case, I think I will change it so that it wouldn't be misleading in any way.
internal/gpuparams/gpuvars.go
Outdated
| // Labels represents the range of labels that can be used for test cases selection. | ||
| Labels = []string{"nvidia-ci", Label} | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any commented out could should be removed. We either need it, or we don't, no in-between :) Don't worry, we'll have it in Git history anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing, did not notice earlier
pkg/mig/mig.go
Outdated
| // colorRed = "\033[31m" | ||
| // colorGreen = "\033[32m" | ||
| // colorYellow = "\033[33m" | ||
| // colorBlue = "\033[34m" | ||
| // colorMagenta = "\033[35m" | ||
| colorCyan = "\033[36m" | ||
| // colorWhite = "\033[37m" | ||
| colorBold = "\033[1m" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What console? Will it be capable of displaying colors? Or will it be a Prow log file with gibberish that only confuses the readers? :) Also here, too, I think we should remove all unused/commented-out code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
above is an example of the output. I don't know how it works on on prow
|
@vtruhpon great first PR! You've done a lot of work. I had a few questions/comments, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
tests/nvidiagpu/deploy-gpu-test.go (2)
93-95: Consider defining timeout constants in a shared location.The
QuickTimeoutandQuickIntervalconstants are defined at package scope here, but similar constants also appear inpkg/mig/mig.go(lines 45-46). This duplication can lead to inconsistencies if timeout values need adjustment.Consider centralizing MIG-related timing constants
Move these constants to
internal/gpuparams/const.goor a dedicated MIG parameters file to ensure consistent timeout behavior across the codebase.
1235-1593: Refactor testWorkloadWithSingleMig to improve maintainability.The
testWorkloadWithSingleMigfunction spans 358 lines and handles multiple responsibilities: MIG profile selection, node labeling, ClusterPolicy orchestration, namespace/configmap/pod creation, and log validation. This makes it difficult to test individual components and understand the flow.Recommended refactoring approach
The function could be split into well-defined phases using helper functions (similar to what's done in
pkg/mig/mig.go):
validateMIGCapability()- Check mig.capable labelsetupMIGConfiguration()- Query profiles, set labels, wait for ClusterPolicydeployMIGWorkload()- Create namespace, configmap, podvalidateMIGWorkloadExecution()- Wait for completion, parse logsThis would align with the single responsibility principle and make the test flow clearer. Note that
pkg/mig/mig.goalready provides many of these helper functions (e.g.,CheckMIGCapabilityOnNodes,SelectMigProfile,SetMIGLabelsOnNodes, etc.) which could potentially be reused here.pkg/mig/mig.go (3)
36-42: Colorized output may not display correctly in all environments.The ANSI color codes (lines 36-42) are used throughout via
colorLog()(lines 610-614). As noted in a past comment, these may render as gibberish in Prow logs or other non-terminal environments that don't support ANSI escape sequences.Optional: Make colors conditional
Consider detecting terminal capability or providing an environment variable to disable colors:
var useColors = os.Getenv("NO_COLOR") == "" && isTerminal() func colorLog(color, message string) string { if !useColors { return message } return fmt.Sprintf("%s%s%s", color, message, colorReset) }This ensures logs remain readable in all environments while preserving color support where available.
Also applies to: 610-614
711-785: Complex regex parsing in parseMIGProfiles requires careful testing.The
parseMIGProfilesfunction uses multiple regex patterns to parsenvidia-smi mig -lgipoutput across two lines per profile. The logic handles exclusion of+meprofiles and maintains state between line1 and line2 parsing. While the implementation appears sound, this type of parsing is fragile and sensitive to nvidia-smi output format changes.Recommendations for robustness
- Add unit tests with sample nvidia-smi outputs covering various GPU types and MIG configurations
- Consider adding validation that profiles array is not empty before returning
- Log a warning if parsing fails to match expected patterns rather than silently returning empty array
Example validation:
if len(profiles) == 0 { glog.Warningf("No MIG profiles parsed from nvidia-smi output. Output was: %s", output) } return profilesPlease ensure comprehensive testing of
parseMIGProfileswith various GPU hardware and nvidia-smi versions to catch potential parsing issues early.
679-707: ExecCmdInPod has robust validation but consider timeout.The function properly validates pod and container readiness before executing commands. However, the
ExecCommandmethod may hang indefinitely if the command doesn't complete. Consider whether a timeout context would be appropriate for long-running or potentially stuck commands.Optional: Add command timeout
For production-grade robustness, consider adding a timeout parameter:
func ExecCmdInPod(apiClient *clients.Settings, podName, namespace string, command []string, timeout time.Duration) (string, error) { ctx, cancel := context.WithTimeout(context.Background(), timeout) defer cancel() // Use ctx in ExecCommand if the underlying implementation supports it ... }This would prevent test hangs if nvidia-smi becomes unresponsive.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
README.md(5 hunks)internal/get/get.go(2 hunks)internal/wait/wait.go(3 hunks)pkg/configmap/configmap.go(3 hunks)pkg/mig/mig.go(1 hunks)tests/mig/mig-test.go(1 hunks)tests/nvidiagpu/deploy-gpu-test.go(20 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/get/get.go
- tests/mig/mig-test.go
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Applied to files:
README.mdinternal/wait/wait.gopkg/mig/mig.go
📚 Learning: 2025-12-12T10:59:56.555Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Applied to files:
README.md
📚 Learning: 2025-06-18T14:59:40.987Z
Learnt from: wabouhamad
Repo: rh-ecosystem-edge/nvidia-ci PR: 205
File: tests/nvidianetwork/deploy-nno-test.go:1411-1414
Timestamp: 2025-06-18T14:59:40.987Z
Learning: In SRIOV RDMA testcases in tests/nvidianetwork/deploy-nno-test.go, the string "sriov" is intentionally passed as the device parameter to CreateRdmaWorkloadPod instead of the actual mlx5 device ID. The workload container is designed to dynamically discover the correct mlx5 device for SRIOV cases at runtime.
Applied to files:
README.md
📚 Learning: 2025-07-20T08:18:03.794Z
Learnt from: TomerNewman
Repo: rh-ecosystem-edge/nvidia-ci PR: 226
File: scripts/gpu-operator-tests-must-gather.sh:1-2
Timestamp: 2025-07-20T08:18:03.794Z
Learning: In the nvidia-ci repository, must-gather scripts should use best-effort execution rather than fail-fast behavior. Unlike workflow scripts that should fail immediately on errors, diagnostic collection scripts should continue running even if one component fails, to maximize the amount of diagnostic data collected. For example, if NFD must-gather fails, the GPU must-gather should still run.
Applied to files:
README.md
📚 Learning: 2025-11-09T13:12:43.631Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 359
File: workflows/gpu_operator_versions/update_versions.py:100-106
Timestamp: 2025-11-09T13:12:43.631Z
Learning: In the nvidia-ci repository's `workflows/gpu_operator_versions/update_versions.py`, the `gpu_releases` variable is intentionally limited to the latest 2 GPU operator versions (via `get_latest_versions(list(gpu_versions.keys()), 2)` in main()). For maintenance OCP versions with pinned GPU operators, if the pinned version is not among the latest 2 releases, the test is intentionally skipped with a warning rather than falling back to test with any available version. This is because testing with outdated GPU operators would fail in their infrastructure, and they prefer to notice missing tests rather than deal with failed test runs, especially since GitHub organization repositories don't send notification emails about failed workflows.
Applied to files:
README.md
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.
Applied to files:
internal/wait/wait.gopkg/configmap/configmap.gotests/nvidiagpu/deploy-gpu-test.gopkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.
Applied to files:
pkg/mig/mig.go
🧬 Code graph analysis (2)
internal/wait/wait.go (2)
pkg/nodes/node.go (1)
Pull(119-138)pkg/nodes/list.go (1)
List(20-58)
pkg/configmap/configmap.go (5)
pkg/namespace/namespace.go (1)
Builder(23-32)pkg/nfd/nodefeaturediscovery.go (1)
Builder(22-32)pkg/deployment/deployment.go (1)
Builder(25-34)pkg/nvidiagpu/clusterpolicy.go (1)
Builder(23-33)pkg/pod/pod.go (1)
Builder(33-42)
🪛 markdownlint-cli2 (0.18.1)
README.md
161-161: Heading style
Expected: setext; Actual: atx
(MD003, heading-style)
167-167: Heading style
Expected: setext; Actual: atx
(MD003, heading-style)
170-170: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
171-171: Dollar signs used before commands without showing output
(MD014, commands-show-output)
172-172: Dollar signs used before commands without showing output
(MD014, commands-show-output)
173-173: Dollar signs used before commands without showing output
(MD014, commands-show-output)
174-174: Dollar signs used before commands without showing output
(MD014, commands-show-output)
175-175: Dollar signs used before commands without showing output
(MD014, commands-show-output)
176-176: Dollar signs used before commands without showing output
(MD014, commands-show-output)
177-177: Dollar signs used before commands without showing output
(MD014, commands-show-output)
178-178: Dollar signs used before commands without showing output
(MD014, commands-show-output)
179-179: Dollar signs used before commands without showing output
(MD014, commands-show-output)
180-180: Dollar signs used before commands without showing output
(MD014, commands-show-output)
186-186: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
187-187: Dollar signs used before commands without showing output
(MD014, commands-show-output)
188-188: Dollar signs used before commands without showing output
(MD014, commands-show-output)
189-189: Dollar signs used before commands without showing output
(MD014, commands-show-output)
190-190: Dollar signs used before commands without showing output
(MD014, commands-show-output)
191-191: Dollar signs used before commands without showing output
(MD014, commands-show-output)
192-192: Dollar signs used before commands without showing output
(MD014, commands-show-output)
193-193: Dollar signs used before commands without showing output
(MD014, commands-show-output)
194-194: Dollar signs used before commands without showing output
(MD014, commands-show-output)
195-195: Dollar signs used before commands without showing output
(MD014, commands-show-output)
196-196: Dollar signs used before commands without showing output
(MD014, commands-show-output)
199-199: Heading style
Expected: setext; Actual: atx
(MD003, heading-style)
203-203: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
204-204: Dollar signs used before commands without showing output
(MD014, commands-show-output)
207-207: Heading style
Expected: setext; Actual: atx
(MD003, heading-style)
210-210: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (8)
pkg/configmap/configmap.go (1)
146-177: LGTM! Clean implementation of WaitUntilDeleted.The new
WaitUntilDeletedmethod follows established patterns from similar builders (pod, namespace, deployment) and properly validates, polls, and handles errors. The use ofwait.PollUntilContextTimeoutwithIsNotFoundcheck is the standard approach for resource deletion verification.internal/wait/wait.go (2)
51-82: LGTM! ClusterPolicyNotReady properly mirrors ClusterPolicyReady.The new function correctly polls for the "notReady" state transition, following the same pattern as the existing
ClusterPolicyReadyfunction. Error handling and logging are appropriate.
143-171: LGTM! NodeLabelExists implements proper label polling.The function correctly uses
nodes.Listwith aLabelSelectorand iterates through nodes to find a matching label key/value pair. The polling logic and error handling are sound.README.md (1)
85-206: LGTM! Comprehensive MIG testing documentation.The new MIG testing section provides clear step-by-step instructions, environment variable documentation, and concrete examples. The documentation properly explains the workflow and aligns well with the code changes in this PR.
Note: The markdownlint warnings flagged by static analysis are minor style issues (heading style, fenced code language specification, dollar signs in commands) that don't impact readability and can be addressed separately if desired.
pkg/mig/mig.go (3)
60-191: LGTM! TestSingleMIGGPUWorkload provides clear MIG test orchestration.The main test function properly orchestrates the MIG testing workflow with clear phases:
- MIG capability validation
- Resource cleanup
- MIG profile selection
- Node labeling and ClusterPolicy readiness
- Workload deployment and validation
The use of helper functions (
CleanupWorkloadResources,SelectMigProfile,SetMIGLabelsOnNodes, etc.) makes the flow readable and maintainable. Error handling withExpectis appropriate for test code, and the defer block properly handles cleanup.
195-284: LGTM! Cleanup helper functions follow consistent patterns.The cleanup functions (
cleanupClusterPolicy,cleanupCSV,cleanupSubscription,cleanupOperatorGroup,cleanupGPUOperatorNamespace,cleanupGPUBurnNamespace) properly check for resource existence before deletion and handle errors appropriately. The pattern of Pull/Check/Delete is consistent and safe.
287-311: ShouldKeepOperator label checking logic is clear.The function correctly retrieves the Ginkgo label filter and checks if any of the specified labels are present. The logging is verbose enough for debugging. The implementation properly handles the case where tests with specific labels (like "operator-upgrade", "gpu-burn-mig", "single-mig") should preserve the operator state.
tests/nvidiagpu/deploy-gpu-test.go (1)
1388-1388: "all-disabled" is a valid MIG configuration profile and the correct value for this test setup."all-disabled" is a valid and default MIG configuration for disabling MIG on nodes. This code correctly sets the MIG configuration label as part of test setup to establish a known baseline state before running tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
pkg/pod/pod.go (1)
284-314: Good improvement to context propagation.The change on line 295 correctly uses the context provided by the polling function instead of
context.TODO(), enabling proper cancellation and timeout handling.However, similar context handling issues exist in other Wait methods that should be addressed for consistency:
- Line 274 in
WaitUntilInStatus- Line 340 in
WaitUntilConditionBoth still use
context.TODO()instead of thectxparameter provided by their respectivewait.PollUntilContextTimeoutcalls.🔎 Proposed fix for consistent context usage
For
WaitUntilInStatusmethod:return wait.PollUntilContextTimeout( context.TODO(), time.Second, timeout, true, func(ctx context.Context) (bool, error) { updatePod, err := builder.apiClient.Pods(builder.Object.Namespace).Get( - context.TODO(), builder.Object.Name, metav1.GetOptions{}) + ctx, builder.Object.Name, metav1.GetOptions{}) if err != nil { return false, nil }For
WaitUntilConditionmethod:return wait.PollUntilContextTimeout( context.TODO(), time.Second, timeout, true, func(ctx context.Context) (bool, error) { updatePod, err := builder.apiClient.Pods(builder.Object.Namespace).Get( - context.TODO(), builder.Object.Name, metav1.GetOptions{}) + ctx, builder.Object.Name, metav1.GetOptions{}) if err != nil { return false, nil }
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
pkg/mig/mig.go(1 hunks)pkg/nvidiagpu/consts.go(2 hunks)pkg/pod/pod.go(1 hunks)tests/nvidiagpu/deploy-gpu-test.go(20 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.gopkg/pod/pod.gopkg/nvidiagpu/consts.gopkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Applied to files:
pkg/nvidiagpu/consts.gopkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.
Applied to files:
pkg/mig/mig.go
🧬 Code graph analysis (1)
tests/nvidiagpu/deploy-gpu-test.go (1)
pkg/mig/mig.go (1)
ShouldKeepOperator(283-307)
🔇 Additional comments (6)
pkg/nvidiagpu/consts.go (1)
66-70: LGTM! Well-structured constants for MIG polling operations.The new time constants for ClusterPolicy readiness checks and label validation are properly defined and follow the existing naming conventions in this file. The intervals (15s) and timeouts (3min) are reasonable for polling MIG-related state transitions.
tests/nvidiagpu/deploy-gpu-test.go (2)
151-162: LGTM! Proper cleanup gating logic.The integration of
labelsToCheckandmig.ShouldKeepOperatorprovides fine-grained control over cleanup based on test labels. This prevents premature cleanup when running operator-upgrade or MIG-related tests.
1122-1232: LGTM! Well-structured cleanup helpers.The cleanup functions are properly decomposed and each checks for resource existence before deletion. Using
Expectassertions is appropriate for test code to ensure cleanup failures are visible.pkg/mig/mig.go (3)
589-596: LGTM! Proper NO_COLOR support.The
colorLogfunction correctly respects theNO_COLORenvironment variable convention, which addresses concerns about ANSI escape codes in CI/Prow logs. WhenNO_COLORis set, plain text is returned; otherwise, colored output is generated.
659-707: LGTM! Robust command execution with proper timeout handling.The
ExecCmdInPodfunction implements proper timeout handling using goroutines and channels, validates pod/container state before execution, and provides clear error messages. The defensive checks (pod running, container ready, non-empty output) prevent common failure modes.
709-786: LGTM! Comprehensive MIG profile parsing.The
parseMIGProfilesfunction correctly parses the two-line format fromnvidia-smi mig -lgipoutput using appropriate regex patterns. The exclusion of+me(media extension) profiles and the handling of both lines to populateMIGProfileInfostructs is well-implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (3)
pkg/mig/mig.go (2)
397-410: Remove contradictory log message.Line 403 logs
"NOT setting or removing MIG strategy label on node"immediately after successfully setting thenvidia.com/mig.strategylabel on lines 399-402. This contradictory message is confusing.
415-437: Remove mig.strategy label during cleanup.
ResetMIGLabelsToDisabledsetsnvidia.com/mig.configto"all-disabled"but doesn't remove thenvidia.com/mig.strategylabel. Based on learnings, during cleanup you should remove themig.strategylabel entirely since the only valid values are"single"or"mixed".tests/nvidiagpu/deploy-gpu-test.go (1)
1376-1402: Remove mig.strategy label during cleanup.The cleanup defer sets
nvidia.com/mig.configto"all-disabled"but doesn't remove thenvidia.com/mig.strategylabel. Based on learnings, during cleanup you should remove themig.strategylabel entirely usingRemoveLabelmethod.
🧹 Nitpick comments (7)
pkg/mig/mig.go (3)
583-596: Remove commented-out code.The commented-out
colorLogfunction (lines 585-587) should be removed since there's a working replacement below it.🔎 Suggested fix
// colorLog returns a colored log message with ANSI escape codes // Usage: glog.V(level).Infof("%s", colorLog(colorCyan+colorBold, "Your message")) -// func colorLog(color, message string) string { -// return fmt.Sprintf("%s%s%s", color, message, colorReset) -// } var useColors = os.Getenv("NO_COLOR") == "" func colorLog(color, message string) string {
692-706: Potential goroutine leak on timeout.When the context times out (line 698-699), the goroutine spawned at line 692 continues running indefinitely since
podBuilder.ExecCommandcannot be cancelled. The goroutine will eventually complete, but the result is discarded. This is a minor resource concern, though acceptable given the test context.Consider adding a comment documenting this behavior, or using a buffered channel to ensure the goroutine can always send its result:
// Note: On timeout, the spawned goroutine continues until ExecCommand completes, // but its result is discarded. This is acceptable in test contexts.
722-724: Minor: Remove redundant type declaration.The explicit
booltype on line 723 is redundant since it's inferred from the initialization value.🔎 Suggested fix
excludeRegex := regexp.MustCompile(`\|\s+\d+\s+MIG\s+\d+g\.\d+gb\+me`) flavor := "gpu" - var exclude bool = true + exclude := truetests/nvidiagpu/deploy-gpu-test.go (4)
1137-1193: Cleanup functions will fail if resources don't exist.Unlike the cleanup functions in
pkg/mig/mig.go(which checkif err == nil && builder.Exists()before deleting), these cleanup functions useExpect(err).ToNot(HaveOccurred())after pulling resources, causing the test to fail if resources don't exist. This is fragile for cleanup scenarios where resources may have already been deleted or never created.Consider adding existence checks or using the same pattern as
pkg/mig/mig.go:🔎 Suggested fix for cleanupClusterPolicy
func cleanupClusterPolicy() { By("Deleting ClusterPolicy") - // Since this is out of defer functions, the ClusterPolicy need to be pulled before deleting it. clusterPolicyBuilder, err := nvidiagpu.Pull(inittools.APIClient, nvidiagpu.ClusterPolicyName) - Expect(err).ToNot(HaveOccurred(), "Error pulling ClusterPolicy: %v", err) - _, err = clusterPolicyBuilder.Delete() - Expect(err).ToNot(HaveOccurred(), "Error deleting ClusterPolicy: %v", err) - glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy deleted successfully") + if err == nil && clusterPolicyBuilder.Exists() { + _, err = clusterPolicyBuilder.Delete() + Expect(err).ToNot(HaveOccurred(), "Error deleting ClusterPolicy: %v", err) + glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy deleted successfully") + } else { + glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy not found or already deleted") + } }Apply similar pattern to
cleanupSubscription,cleanupOperatorGroup, andcleanupGPUOperatorNamespace.
1316-1354: Code duplication withpkg/mig/mig.go:SetMIGLabelsOnNodes.This node labeling logic (lines 1326-1354) duplicates functionality from
SetMIGLabelsOnNodesinpkg/mig/mig.go. Consider refactoring to use the shared function from the mig package, or consolidating the improved logic (which checks existing label values) into the mig package and reusing it here.The logic here is actually more thorough (checking if label exists with correct value before updating), so consider enhancing
mig.SetMIGLabelsOnNodeswith this logic and calling it from both places.
1508-1516: Consider placing sleep after error check.The
time.Sleepon line 1513 runs before theExpectcheck on line 1514. If deletion fails, the test will wait 15 seconds before the assertion fails. Consider moving the sleep after the assertion for faster failure detection.🔎 Suggested fix
defer func() { defer GinkgoRecover() glog.V(gpuparams.Gpu100LogLevel).Infof("defer11 (configmapBuilder) sleeping for 15 seconds") if cleanupAfterTest { err := configmapBuilder.Delete() - time.Sleep(time.Second * 15) Expect(err).ToNot(HaveOccurred()) + time.Sleep(time.Second * 15) } }()
1577-1591: Remove commented-out code.Line 1582 and 1586 contain commented-out code that should be removed for clarity.
🔎 Suggested fix
// Need to add checking for other possible GPU's By("Parse the gpu-burn pod logs and check for successful execution with MIG") - // migCapabilities[useMigIndex].Available for i := 0; i < migCapabilities[useMigIndex].Total; i++ { match1Mig := strings.Contains(gpuBurnMigLogs, fmt.Sprintf("GPU %d: OK", i)) - // match2Mig := strings.Contains(gpuBurnMigLogs, fmt.Sprintf("100.0% proc'd: %d", i)) glog.V(gpuparams.Gpu10LogLevel).Infof("Checking if GPU %d: OK is present in logs: %v", i, match1Mig) Expect(match1Mig).ToNot(BeFalse(), "gpu-burn pod execution with MIG was FAILED") } - // match1Mig := strings.Contains(gpuBurnMigLogs, "GPU 0: OK") match2Mig := strings.Contains(gpuBurnMigLogs, "100.0% proc'd:")
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/mig/mig.go(1 hunks)tests/nvidiagpu/deploy-gpu-test.go(20 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.gopkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Applied to files:
pkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.
Applied to files:
pkg/mig/mig.go
🧬 Code graph analysis (2)
tests/nvidiagpu/deploy-gpu-test.go (8)
pkg/mig/mig.go (2)
ShouldKeepOperator(283-307)MIGProfiles(601-636)internal/get/get.go (1)
GetFirstPodNameWithLabel(56-71)internal/networkparams/const.go (1)
Label(5-5)tests/nvidianetwork/deploy-nno-test.go (1)
WorkerNodeSelector(41-44)internal/gpuparams/gpuvars.go (1)
Labels(5-5)internal/networkparams/networkvars.go (1)
Labels(5-5)internal/tsparams/gpu-vars.go (1)
Labels(11-11)internal/wait/wait.go (2)
ClusterPolicyNotReady(52-82)ClusterPolicyReady(20-49)
pkg/mig/mig.go (8)
internal/wait/wait.go (1)
NodeLabelExists(144-171)pkg/configmap/configmap.go (3)
Pull(36-69)NewBuilder(72-103)Builder(21-30)pkg/nodes/node.go (2)
Pull(119-138)Builder(30-36)pkg/pod/pod.go (3)
Pull(91-130)NewBuilder(48-88)Builder(33-42)vendor/golang.org/x/net/html/atom/table.go (2)
Object(190-190)Name(181-181)internal/gpu-burn/gpu-burn.go (1)
CreateGPUBurnConfigMap(38-59)pkg/olm/operatorgroup.go (1)
PullOperatorGroup(142-176)internal/get/get.go (1)
GetClusterArchitecture(74-101)
🔇 Additional comments (3)
pkg/mig/mig.go (2)
189-208: LGTM!The cleanup orchestration is well-structured with individual helper functions for each resource type, making the code modular and maintainable.
498-532: LGTM!The
DeployGPUWorkloadfunction properly creates the MIG-configured pod, creates it via the API client, and retrieves it for further operations. Good separation of concerns.tests/nvidiagpu/deploy-gpu-test.go (1)
1111-1119: LGTM!The two MIG test cases provide good flexibility - one for in-file testing and one delegating to the mig package. The cleanup flag is correctly computed for the package-level test.
| AfterAll(func() { | ||
|
|
||
| glog.V(gpuparams.Gpu10LogLevel).Infof("cleanup in AfterAll") | ||
| if nfdInstance.CleanupAfterInstall && cleanupAfterTest { | ||
| err := nfd.Cleanup(inittools.APIClient) | ||
| Expect(err).ToNot(HaveOccurred(), "Error cleaning up NFD resources: %v", err) | ||
| } | ||
| // Cleanup GPU Operator Resources, if requested | ||
| if cleanupAfterTest { | ||
| cleanupGPUOperatorResources() | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent cleanup logic: AfterAll doesn't check ShouldKeepOperator.
The defer blocks throughout the test check cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck) before cleanup, but AfterAll at line 270 only checks cleanupAfterTest. This could cause the GPU Operator resources to be cleaned up even when specific labels indicate they should be kept.
🔎 Suggested fix
AfterAll(func() {
glog.V(gpuparams.Gpu10LogLevel).Infof("cleanup in AfterAll")
if nfdInstance.CleanupAfterInstall && cleanupAfterTest {
err := nfd.Cleanup(inittools.APIClient)
Expect(err).ToNot(HaveOccurred(), "Error cleaning up NFD resources: %v", err)
}
// Cleanup GPU Operator Resources, if requested
- if cleanupAfterTest {
+ if cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck) {
cleanupGPUOperatorResources()
}
})🤖 Prompt for AI Agents
In tests/nvidiagpu/deploy-gpu-test.go around lines 262 to 272, the AfterAll
block only checks cleanupAfterTest before calling cleanupGPUOperatorResources(),
which is inconsistent with other defer blocks that also skip cleanup when
mig.ShouldKeepOperator(labelsToCheck) is true; update the AfterAll logic to
mirror the defers by calling cleanupGPUOperatorResources() only when
cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck) so operator resources
are preserved when labels indicate they should be kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the desired outcome. cleanupAfterTest is set based on environment variable and it supersedes any other setting in the very end of the execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vtruhpon, thank you for the clarification! I understand now that this is an intentional design decision. The cleanupAfterTest environment variable is meant to be the final authority in the AfterAll block, superseding the label-based logic that applies during test execution. This makes sense for ensuring a clean environment at the very end when explicitly requested via the environment variable.
I've noted this design pattern for future reviews. Thanks for explaining the rationale!
✏️ Learnings added
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:32.955Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/nvidiagpu/deploy-gpu-test.go (1)
1384-1410: Remove mig.strategy label during cleanup.The cleanup defer block sets
nvidia.com/mig.configto"all-disabled"but does not remove thenvidia.com/mig.strategylabel. Line 1390 logs "NOT setting or removing MIG strategy label" which describes the current behavior but may not be the desired cleanup pattern.Based on learnings, during cleanup the
mig.strategylabel should be removed entirely since the only valid values are"single"or"mixed". Consider adding a RemoveLabel call fornvidia.com/mig.strategybefore setting mig.config.🔎 Suggested fix
defer func() { defer GinkgoRecover() glog.V(gpuparams.Gpu100LogLevel).Infof("Setting MIG strategy label on GPU worker nodes") nodeBuilders, err := nodes.List(inittools.APIClient, metav1.ListOptions{LabelSelector: labels.Set(WorkerNodeSelector).String()}) Expect(err).ToNot(HaveOccurred(), "Error listing worker nodes: %v", err) for _, nodeBuilder := range nodeBuilders { - glog.V(gpuparams.Gpu100LogLevel).Infof("NOT setting or removing MIG strategy label on node '%s')", nodeBuilder.Definition.Name) + glog.V(gpuparams.Gpu100LogLevel).Infof("Removing MIG strategy label from node '%s'", nodeBuilder.Definition.Name) + nodeBuilder = nodeBuilder.RemoveLabel("nvidia.com/mig.strategy", "") + _, err = nodeBuilder.Update() + Expect(err).ToNot(HaveOccurred(), "Error removing MIG strategy label from node '%s': %v", nodeBuilder.Definition.Name, err) glog.V(gpuparams.Gpu100LogLevel).Infof("Setting MIG configuration label on node '%s' (overwrite=true)", nodeBuilder.Definition.Name) nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", "all-disabled")Based on learnings, use
RemoveLabelformig.strategyduring cleanup.
🧹 Nitpick comments (1)
pkg/mig/mig.go (1)
38-44: Optional: Fix typo in color constants comment.Line 39 has a minor typo in the comment:
/033[31mshould be\033[31m(backslash instead of forward slash) for consistency with the actual ANSI escape codes used.🔎 Suggested fix
// ANSI color constants for console output highlighting -// colors are /033[31m - red through /033[37m - white +// colors are \033[31m - red through \033[37m - white const ( colorReset = "\033[0m" colorCyan = "\033[36m" colorBold = "\033[1m" )
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/mig/mig.go(1 hunks)tests/nvidiagpu/deploy-gpu-test.go(20 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-12-21T00:43:32.955Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:32.955Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).
Applied to files:
tests/nvidiagpu/deploy-gpu-test.gopkg/mig/mig.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.gopkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Applied to files:
pkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.
Applied to files:
pkg/mig/mig.go
🧬 Code graph analysis (1)
tests/nvidiagpu/deploy-gpu-test.go (5)
pkg/mig/mig.go (1)
ShouldKeepOperator(282-306)pkg/nodes/node.go (2)
Pull(119-138)Builder(30-36)pkg/nvidiagpu/clusterpolicy.go (2)
Pull(127-152)Builder(23-33)internal/get/get.go (1)
GetFirstPodNameWithLabel(56-71)internal/wait/wait.go (1)
NodeLabelExists(144-171)
🔇 Additional comments (1)
pkg/mig/mig.go (1)
413-436: No changes needed. The function correctly setsnvidia.com/mig.configto"all-disabled", which is the valid and official way to disable MIG on nodes per NVIDIA GPU Operator documentation. Themig.strategylabel management is handled at the ClusterPolicy level rather than node labels, so the function's approach is consistent with NVIDIA's recommended patterns.
|
Thank you @vtruhpon! That's a lot of hard work. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: empovit, vtruhpon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold Looks like the PR needs rebasing. Also, let's squash the commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
internal/gpu-burn/gpu-burn.go (1)
61-63: Remove the unusedtimeoutparameter.The
timeoutparameter is declared but never used in the function body. This differs fromCreateGPUBurnPodwhich also doesn't use its timeout parameter, but carrying forward this pattern adds confusion.♻️ Proposed fix
-func CreateGPUBurnPodWithMIG(apiClient *clients.Settings, podName, podNamespace string, - gpuBurnImage string, migProfile string, migCount int, timeout time.Duration) (*corev1.Pod, error) { +func CreateGPUBurnPodWithMIG(apiClient *clients.Settings, podName, podNamespace string, + gpuBurnImage string, migProfile string, migCount int) (*corev1.Pod, error) {README.md (1)
162-207: Optional markdown formatting improvements.Static analysis flagged several minor formatting issues in the new MIG testing section:
- Lines 162, 168, 200, 208: Heading style inconsistency (use setext style for consistency with rest of doc)
- Lines 171-181, 187-197, 204-205: Missing language specifiers in code blocks and dollar signs in commands
These are purely cosmetic and don't affect functionality. Consider addressing them if you're updating the formatting across the entire README.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
README.mdinternal/get/get.gointernal/gpu-burn/gpu-burn.gointernal/gpuparams/const.gointernal/nvidiagpuconfig/config.gointernal/tsparams/mig-vars.gointernal/wait/wait.gopkg/configmap/configmap.gopkg/mig/mig.gopkg/nodes/node.gopkg/nvidiagpu/consts.gopkg/pod/pod.gotests/mig/mig-test.gotests/mig/mig_suite_test.gotests/nvidiagpu/deploy-gpu-test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/pod/pod.go
🚧 Files skipped from review as they are similar to previous changes (4)
- internal/get/get.go
- pkg/nvidiagpu/consts.go
- internal/gpuparams/const.go
- pkg/nodes/node.go
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.
Applied to files:
pkg/configmap/configmap.gointernal/nvidiagpuconfig/config.gointernal/gpu-burn/gpu-burn.gointernal/wait/wait.gotests/mig/mig-test.gotests/mig/mig_suite_test.gointernal/tsparams/mig-vars.gotests/nvidiagpu/deploy-gpu-test.gopkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Applied to files:
internal/nvidiagpuconfig/config.gointernal/gpu-burn/gpu-burn.gointernal/wait/wait.gotests/mig/mig-test.gotests/mig/mig_suite_test.gointernal/tsparams/mig-vars.goREADME.mdpkg/mig/mig.go
📚 Learning: 2025-12-21T00:43:37.494Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:37.494Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).
Applied to files:
tests/mig/mig-test.gotests/mig/mig_suite_test.gointernal/tsparams/mig-vars.goREADME.mdtests/nvidiagpu/deploy-gpu-test.gopkg/mig/mig.go
📚 Learning: 2025-06-18T14:59:40.987Z
Learnt from: wabouhamad
Repo: rh-ecosystem-edge/nvidia-ci PR: 205
File: tests/nvidianetwork/deploy-nno-test.go:1411-1414
Timestamp: 2025-06-18T14:59:40.987Z
Learning: In SRIOV RDMA testcases in tests/nvidianetwork/deploy-nno-test.go, the string "sriov" is intentionally passed as the device parameter to CreateRdmaWorkloadPod instead of the actual mlx5 device ID. The workload container is designed to dynamically discover the correct mlx5 device for SRIOV cases at runtime.
Applied to files:
tests/mig/mig-test.goREADME.md
📚 Learning: 2025-12-12T10:59:56.555Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Applied to files:
README.md
📚 Learning: 2025-07-20T08:18:03.794Z
Learnt from: TomerNewman
Repo: rh-ecosystem-edge/nvidia-ci PR: 226
File: scripts/gpu-operator-tests-must-gather.sh:1-2
Timestamp: 2025-07-20T08:18:03.794Z
Learning: In the nvidia-ci repository, must-gather scripts should use best-effort execution rather than fail-fast behavior. Unlike workflow scripts that should fail immediately on errors, diagnostic collection scripts should continue running even if one component fails, to maximize the amount of diagnostic data collected. For example, if NFD must-gather fails, the GPU must-gather should still run.
Applied to files:
README.md
📚 Learning: 2025-11-09T13:12:43.631Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 359
File: workflows/gpu_operator_versions/update_versions.py:100-106
Timestamp: 2025-11-09T13:12:43.631Z
Learning: In the nvidia-ci repository's `workflows/gpu_operator_versions/update_versions.py`, the `gpu_releases` variable is intentionally limited to the latest 2 GPU operator versions (via `get_latest_versions(list(gpu_versions.keys()), 2)` in main()). For maintenance OCP versions with pinned GPU operators, if the pinned version is not among the latest 2 releases, the test is intentionally skipped with a warning rather than falling back to test with any available version. This is because testing with outdated GPU operators would fail in their infrastructure, and they prefer to notice missing tests rather than deal with failed test runs, especially since GitHub organization repositories don't send notification emails about failed workflows.
Applied to files:
README.md
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.
Applied to files:
pkg/mig/mig.go
🧬 Code graph analysis (7)
pkg/configmap/configmap.go (2)
pkg/pod/pod.go (1)
Builder(33-42)pkg/namespace/namespace.go (1)
Builder(23-32)
internal/wait/wait.go (3)
pkg/nodes/node.go (1)
Pull(119-138)pkg/nvidiagpu/clusterpolicy.go (1)
Pull(127-152)pkg/nodes/list.go (1)
List(20-58)
tests/mig/mig-test.go (9)
pkg/operatorconfig/config.go (2)
NewCustomConfig(18-27)CustomConfig(7-14)pkg/nvidiagpu/gpuBurnConfig.go (1)
NewDefaultGPUBurnConfig(15-22)pkg/nvidiagpu/consts.go (1)
NvidiaGPULabel(8-8)pkg/global/const.go (2)
UndefinedValue(5-5)OpenShiftVersionFile(7-7)internal/gpuparams/const.go (4)
Label(5-5)Gpu10LogLevel(9-9)Gpu100LogLevel(11-11)GpuLogLevel(8-8)internal/tsparams/consts.go (1)
LabelSuite(5-5)internal/inittools/inittools.go (1)
GetOpenShiftVersion(47-69)tests/nvidiagpu/deploy-gpu-test.go (1)
BurnImageName(62-65)pkg/nfd/utils.go (1)
EnsureNFDIsInstalled(27-105)
tests/mig/mig_suite_test.go (3)
internal/reporter/reporter.go (1)
ReportIfFailed(54-90)internal/tsparams/mig-vars.go (2)
MigReporterNamespacesToDump(14-18)MigReporterCRDsToDump(21-23)pkg/clients/clients.go (1)
SetScheme(130-168)
internal/tsparams/mig-vars.go (1)
internal/tsparams/consts.go (1)
LabelSuite(5-5)
tests/nvidiagpu/deploy-gpu-test.go (3)
pkg/mig/mig.go (1)
ShouldKeepOperator(282-306)internal/gpuparams/const.go (1)
Label(5-5)internal/inittools/inittools.go (1)
GeneralConfig(22-22)
pkg/mig/mig.go (4)
internal/nvidiagpuconfig/config.go (1)
NvidiaGPUConfig(9-21)internal/wait/wait.go (2)
ClusterPolicyNotReady(52-82)ClusterPolicyReady(20-49)internal/gpu-burn/gpu-burn.go (1)
CreateGPUBurnConfigMap(38-59)internal/get/get.go (1)
GetClusterArchitecture(74-101)
🪛 LanguageTool
README.md
[uncategorized] ~68-~68: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...t - optional - TEST_TRACE: includes full stack trace from ginkgo tests when a failure ...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
🪛 markdownlint-cli2 (0.18.1)
README.md
162-162: Heading style
Expected: setext; Actual: atx
(MD003, heading-style)
168-168: Heading style
Expected: setext; Actual: atx
(MD003, heading-style)
171-171: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
172-172: Dollar signs used before commands without showing output
(MD014, commands-show-output)
173-173: Dollar signs used before commands without showing output
(MD014, commands-show-output)
174-174: Dollar signs used before commands without showing output
(MD014, commands-show-output)
175-175: Dollar signs used before commands without showing output
(MD014, commands-show-output)
176-176: Dollar signs used before commands without showing output
(MD014, commands-show-output)
177-177: Dollar signs used before commands without showing output
(MD014, commands-show-output)
178-178: Dollar signs used before commands without showing output
(MD014, commands-show-output)
179-179: Dollar signs used before commands without showing output
(MD014, commands-show-output)
180-180: Dollar signs used before commands without showing output
(MD014, commands-show-output)
181-181: Dollar signs used before commands without showing output
(MD014, commands-show-output)
187-187: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
188-188: Dollar signs used before commands without showing output
(MD014, commands-show-output)
189-189: Dollar signs used before commands without showing output
(MD014, commands-show-output)
190-190: Dollar signs used before commands without showing output
(MD014, commands-show-output)
191-191: Dollar signs used before commands without showing output
(MD014, commands-show-output)
192-192: Dollar signs used before commands without showing output
(MD014, commands-show-output)
193-193: Dollar signs used before commands without showing output
(MD014, commands-show-output)
194-194: Dollar signs used before commands without showing output
(MD014, commands-show-output)
195-195: Dollar signs used before commands without showing output
(MD014, commands-show-output)
196-196: Dollar signs used before commands without showing output
(MD014, commands-show-output)
197-197: Dollar signs used before commands without showing output
(MD014, commands-show-output)
200-200: Heading style
Expected: setext; Actual: atx
(MD003, heading-style)
204-204: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
205-205: Dollar signs used before commands without showing output
(MD014, commands-show-output)
208-208: Heading style
Expected: setext; Actual: atx
(MD003, heading-style)
211-211: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (10)
internal/nvidiagpuconfig/config.go (1)
19-20: LGTM!The new MIG configuration fields follow the existing pattern and are properly tagged for environment variable configuration.
internal/gpu-burn/gpu-burn.go (2)
70-73: LGTM!The comment and resource name construction correctly distinguish between single strategy MIGs (using "gpu" → "nvidia.com/gpu") and mixed strategy MIGs (using profile names like "mig-1g.5gb" → "nvidia.com/mig-1g.5gb").
Based on learnings, this correctly implements the MIG resource naming convention.
75-147: LGTM!The Pod specification is well-configured with:
- Proper security context (non-root, seccomp, dropped capabilities)
- Correct MIG resource requests/limits using the constructed resource name
- Appropriate tolerations for GPU taints
- Valid ConfigMap volume mount for the entrypoint script
internal/wait/wait.go (2)
51-82: LGTM!The
ClusterPolicyNotReadyfunction correctly mirrorsClusterPolicyReadyand properly checks for the "notReady" state with appropriate error handling and logging.
143-171: LGTM!The
NodeLabelExistsfunction properly polls for the specified label on nodes matching the selector, with correct error handling and appropriate logging at different verbosity levels.pkg/configmap/configmap.go (1)
146-177: LGTM!The
WaitUntilDeletedmethod correctly implements polling until the ConfigMap is deleted, with proper error handling for NotFound (success case) and other errors (propagated to caller).tests/mig/mig_suite_test.go (1)
1-31: LGTM!The MIG test suite follows the standard Ginkgo/Gomega pattern with proper:
- Reporter configuration using inittools.GeneralConfig
- Suite labeling with tsparams.Labels
- Failed test reporting in JustAfterEach hook
tests/mig/mig-test.go (1)
46-89: LGTM!The test suite structure is well-organized with proper lifecycle hooks. The nil check for
nvidiaGPUConfigand initialization inBeforeAlladdress previous review concerns. The cleanup logic correctly uses bothcleanupAfterTestandCleanupGPUOperatorResourcesfor proper resource management.internal/tsparams/mig-vars.go (1)
1-24: LGTM!Clean variable declarations for MIG test configuration and reporting. The structure is consistent with existing patterns in the tsparams package.
pkg/mig/mig.go (1)
57-186: LGTM!The
TestSingleMIGGPUWorkloadfunction is well-structured with clear separation of concerns. It properly:
- Validates MIG capability
- Cleans up previous resources
- Selects and configures MIG profiles
- Waits for cluster policy transitions
- Deploys and validates GPU workload
- Includes appropriate cleanup defers
The implementation addresses previous review concerns regarding error handling in defer blocks and format string arguments.
- Added MIG testcase implementation in tests/mig/ - Created mig package (pkg/mig/mig.go) with MIG-related functions - Added MIG variables to internal/tsparams/mig-vars.go - Updated README.md with MIG testcase documentation - Enhanced configmap package with WaitUntilDeleted method - Added wait.ClusterPolicyNotReady and wait.NodeLabelExists functions - Moved timeout and interval constants to pkg/nvidiagpu/consts.go - Improved error handling and code cleanup throughout - Updated test labels to use 'single-mig' for MIG testcase execution rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
NO_COLOR parameter behavior change (earlier NO_COLOR=false would act similar to NO_COLOR=true) rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
|
/lgtm |
|
/unhold |
|
/test |
|
@empovit: The Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test 4.20-stable-nvidia-gpu-operator-e2e-25-10-x |
|
@vtruhpon: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
…failing if it is missing of false Implementation is done for both gpu-burn-mig and single-mig. rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
|
New changes are detected. LGTM label has been removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @pkg/mig/mig.go:
- Around line 422-445: ResetMIGLabelsToDisabled currently sets
"nvidia.com/mig.config" to "all-disabled" but leaves stale
"nvidia.com/mig.strategy" labels; update the loop in ResetMIGLabelsToDisabled to
also remove the strategy label by calling the node builder's RemoveLabel (e.g.,
nodeBuilder = nodeBuilder.RemoveLabel("nvidia.com/mig.strategy")) before calling
Update(), keep the existing Expect(err)... checks and logging around the
Update(), and ensure the removal happens for each node so the operator can
reapply the correct strategy later.
In @tests/nvidiagpu/deploy-gpu-test.go:
- Around line 1386-1412: The cleanup defer block currently sets the
"nvidia.com/mig.config" label to "all-disabled" but leaves any existing
"nvidia.com/mig.strategy" label in place; update the defer cleanup (the
anonymous defer containing nodeBuilders loop and calls to nodeBuilder.WithLabel
and nodeBuilder.Update) to first call
nodeBuilder.RemoveLabel("nvidia.com/mig.strategy", "") (or the appropriate
RemoveLabel signature) for each nodeBuilder, then proceed to set
"nvidia.com/mig.config" via nodeBuilder.WithLabel and call nodeBuilder.Update;
ensure you check and Expect(err) after RemoveLabel and before setting mig.config
so failures are surfaced.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
internal/wait/wait.gopkg/mig/mig.gotests/nvidiagpu/deploy-gpu-test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/wait/wait.go
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
📚 Learning: 2025-12-15T13:26:50.784Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:50.784Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during cleanup of MIG labels nvidia.com/mig.strategy and nvidia.com/mig.config on GPU nodes after tests, use RemoveLabel to remove these labels instead of assigning invalid values. Only valid mig.strategy values are 'single' or 'mixed'.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.go
📚 Learning: 2025-12-21T00:43:37.494Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:37.494Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).
Applied to files:
tests/nvidiagpu/deploy-gpu-test.gopkg/mig/mig.go
📚 Learning: 2025-12-12T10:59:47.614Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:47.614Z
Learning: When coding Kubernetes/NVIDIA MIG handling in Go, distinguish MIG profiles by their resource names: use 'nvidia.com/gpu' for the single MIG strategy, and use the MIG-profile-specific resources such as 'nvidia.com/mig-1g.5gb' for the mixed strategy. Implement checks or constants that reference these exact resource names, and avoid assuming a single generic resource name. Consider centralizing these names (e.g., constants or a small map) and unit-test logic that validates resource-name selection for each MIG profile to prevent misclassification.
Applied to files:
tests/nvidiagpu/deploy-gpu-test.gopkg/mig/mig.go
📚 Learning: 2025-12-15T13:26:55.741Z
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Applied to files:
pkg/mig/mig.go
📚 Learning: 2025-03-09T17:13:24.077Z
Learnt from: empovit
Repo: rh-ecosystem-edge/nvidia-ci PR: 81
File: workflows/nvidia_gpu_operator.py:10-33
Timestamp: 2025-03-09T17:13:24.077Z
Learning: In the nvidia-ci repository, the API interaction functions should use fail-fast behavior rather than graceful error handling. The workflow should fail explicitly if APIs change or don't return expected data (like authentication tokens or tags), as this makes problems immediately visible.
Applied to files:
pkg/mig/mig.go
| // ResetMIGLabelsToDisabled sets MIG strategy and configuration labels to "all-disabled" on GPU worker nodes. | ||
| func ResetMIGLabelsToDisabled(WorkerNodeSelector map[string]string) { | ||
| glog.V(gpuparams.Gpu10LogLevel).Infof("%s", colorLog(colorCyan+colorBold, "Reset MIG labels to disabled")) | ||
| nodeBuilders, err := nodes.List(inittools.APIClient, metav1.ListOptions{LabelSelector: labels.Set(WorkerNodeSelector).String()}) | ||
| Expect(err).ToNot(HaveOccurred(), "Error listing worker nodes: %v", err) | ||
| for _, nodeBuilder := range nodeBuilders { | ||
| glog.V(gpuparams.Gpu10LogLevel).Infof("Setting MIG configuration label to 'all-disabled' on node '%s' (overwrite=true)", nodeBuilder.Definition.Name) | ||
| nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", "all-disabled") | ||
| _, err = nodeBuilder.Update() | ||
| Expect(err).ToNot(HaveOccurred(), "Error updating node '%s' with MIG label: %v", nodeBuilder.Definition.Name, err) | ||
| glog.V(gpuparams.Gpu10LogLevel).Infof("Successfully set MIG configuration label on node '%s'", nodeBuilder.Definition.Name) | ||
| // Nitpick comment: Deleting strategy label does not help, it reappears after a while on its own | ||
| } | ||
|
|
||
| // Wait for ClusterPolicy to be notReady | ||
| _ = wait.ClusterPolicyNotReady(inittools.APIClient, nvidiagpu.ClusterPolicyName, | ||
| nvidiagpu.ClusterPolicyNotReadyCheckInterval, nvidiagpu.ClusterPolicyNotReadyTimeout) | ||
|
|
||
| glog.V(gpuparams.GpuLogLevel).Infof("Waiting for ClusterPolicy to be ready after setting MIG node labels") | ||
| err = wait.ClusterPolicyReady(inittools.APIClient, nvidiagpu.ClusterPolicyName, | ||
| nvidiagpu.ClusterPolicyReadyCheckInterval, nvidiagpu.ClusterPolicyReadyTimeout) | ||
| Expect(err).ToNot(HaveOccurred(), "Error waiting for ClusterPolicy to be ready after node label changes: %v", err) | ||
| glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy is ready after node label changes") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove mig.strategy label during cleanup.
The ResetMIGLabelsToDisabled function sets nvidia.com/mig.config to "all-disabled" but doesn't remove the nvidia.com/mig.strategy label. Based on learnings, during cleanup you should remove the mig.strategy label entirely since the only valid values are "single" or "mixed", and leaving it with a stale value could cause issues in subsequent tests.
The comment at line 433 mentions that "Deleting strategy label does not help, it reappears after a while on its own" — however, the proper cleanup approach is still to remove it and let the operator re-apply it as needed during the next configuration.
Based on learnings, use RemoveLabel for mig.strategy during cleanup.
🤖 Prompt for AI Agents
In @pkg/mig/mig.go around lines 422 - 445, ResetMIGLabelsToDisabled currently
sets "nvidia.com/mig.config" to "all-disabled" but leaves stale
"nvidia.com/mig.strategy" labels; update the loop in ResetMIGLabelsToDisabled to
also remove the strategy label by calling the node builder's RemoveLabel (e.g.,
nodeBuilder = nodeBuilder.RemoveLabel("nvidia.com/mig.strategy")) before calling
Update(), keep the existing Expect(err)... checks and logging around the
Update(), and ensure the removal happens for each node so the operator can
reapply the correct strategy later.
| defer func() { | ||
| defer GinkgoRecover() | ||
| glog.V(gpuparams.Gpu100LogLevel).Infof("Setting MIG strategy label on GPU worker nodes") | ||
| nodeBuilders, err := nodes.List(inittools.APIClient, metav1.ListOptions{LabelSelector: labels.Set(WorkerNodeSelector).String()}) | ||
| Expect(err).ToNot(HaveOccurred(), "Error listing worker nodes: %v", err) | ||
| for _, nodeBuilder := range nodeBuilders { | ||
| glog.V(gpuparams.Gpu100LogLevel).Infof("NOT setting or removing MIG strategy label on node '%s')", nodeBuilder.Definition.Name) | ||
|
|
||
| glog.V(gpuparams.Gpu100LogLevel).Infof("Setting MIG configuration label on node '%s' (overwrite=true)", nodeBuilder.Definition.Name) | ||
| nodeBuilder = nodeBuilder.WithLabel("nvidia.com/mig.config", "all-disabled") | ||
| _, err = nodeBuilder.Update() | ||
| Expect(err).ToNot(HaveOccurred(), "Error updating node '%s' with MIG label: %v", nodeBuilder.Definition.Name, err) | ||
| glog.V(gpuparams.Gpu10LogLevel).Infof("Successfully set MIG configuration label on node '%s'", nodeBuilder.Definition.Name) | ||
| } | ||
|
|
||
| By("Wait for GPU operator to reconcile after setting MIG node labels") | ||
| glog.V(gpuparams.GpuLogLevel).Infof("Sleeping for 30 seconds to allow GPU operator to process node label changes") | ||
| time.Sleep(30 * time.Second) | ||
| // Sleeping to ensure ClusterPolicy gets notReady before entering the next step | ||
|
|
||
| By(fmt.Sprintf("Wait up to %s for ClusterPolicy to be ready after node label changes", nvidiagpu.ClusterPolicyReadyTimeout)) | ||
| glog.V(gpuparams.GpuLogLevel).Infof("Waiting for ClusterPolicy to be ready after setting MIG node labels") | ||
| err = wait.ClusterPolicyReady(inittools.APIClient, nvidiagpu.ClusterPolicyName, | ||
| nvidiagpu.ClusterPolicyReadyCheckInterval, nvidiagpu.ClusterPolicyReadyTimeout) | ||
| Expect(err).ToNot(HaveOccurred(), "Error waiting for ClusterPolicy to be ready: %v", err) | ||
| glog.V(gpuparams.GpuLogLevel).Infof("ClusterPolicy is ready after node label changes") | ||
| }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove mig.strategy label during cleanup instead of leaving it.
The cleanup defer sets nvidia.com/mig.config to "all-disabled" but doesn't remove the nvidia.com/mig.strategy label. Based on learnings, during cleanup you should remove the mig.strategy label entirely rather than leaving it with a potentially stale value, as the only valid values for mig.strategy are "single" or "mixed".
Consider using nodeBuilder.RemoveLabel("nvidia.com/mig.strategy", "") before setting the mig.config label to ensure clean state for subsequent tests.
Based on learnings, use RemoveLabel for mig.strategy during cleanup.
🤖 Prompt for AI Agents
In @tests/nvidiagpu/deploy-gpu-test.go around lines 1386 - 1412, The cleanup
defer block currently sets the "nvidia.com/mig.config" label to "all-disabled"
but leaves any existing "nvidia.com/mig.strategy" label in place; update the
defer cleanup (the anonymous defer containing nodeBuilders loop and calls to
nodeBuilder.WithLabel and nodeBuilder.Update) to first call
nodeBuilder.RemoveLabel("nvidia.com/mig.strategy", "") (or the appropriate
RemoveLabel signature) for each nodeBuilder, then proceed to set
"nvidia.com/mig.config" via nodeBuilder.WithLabel and call nodeBuilder.Update;
ensure you check and Expect(err) after RemoveLabel and before setting mig.config
so failures are surfaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the implementation earlier where it was removed, but since the label reappeared on it's own, I had removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vtruhpon, thank you for that clarification! That makes sense — if the GPU Operator or another controller automatically re-applies the nvidia.com/mig.strategy label, then attempting to remove it during cleanup would indeed be ineffective. Your current implementation is the right approach given that behavior.
I appreciate you sharing this context about the label's automatic management.
✏️ Learnings added
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1386-1412
Timestamp: 2026-01-07T14:49:46.309Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, during MIG cleanup after tests, the nvidia.com/mig.strategy label should not be removed because it gets automatically re-applied by the GPU Operator. Setting nvidia.com/mig.config to "all-disabled" is sufficient for cleanup.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:1343-1368
Timestamp: 2025-12-15T13:26:55.741Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, when cleaning up MIG configuration labels (nvidia.com/mig.strategy and nvidia.com/mig.config) from GPU nodes after tests, use the RemoveLabel method instead of setting them to invalid values like "all-disabled". The valid values for nvidia.com/mig.strategy are "single" or "mixed" only.
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: tests/nvidiagpu/deploy-gpu-test.go:262-272
Timestamp: 2025-12-21T00:43:37.494Z
Learning: In tests/nvidiagpu/deploy-gpu-test.go, the AfterAll block intentionally uses only cleanupAfterTest (without checking mig.ShouldKeepOperator) when calling cleanupGPUOperatorResources(). This is by design: the cleanupAfterTest environment variable (NVIDIAGPU_CLEANUP) supersedes label-based cleanup logic at the very end of test execution, while defer blocks during test execution check both cleanupAfterTest && !mig.ShouldKeepOperator(labelsToCheck).
Learnt from: vtruhpon
Repo: rh-ecosystem-edge/nvidia-ci PR: 391
File: internal/gpu-burn/gpu-burn.go:70-72
Timestamp: 2025-12-12T10:59:56.555Z
Learning: In Kubernetes with NVIDIA GPUs, MIG single strategy profile uses the resource name "nvidia.com/gpu", while MIG mixed strategy profile uses specific resource names like "nvidia.com/mig-1g.5gb" based on the MIG profile.
|
/test 4.20-stable-nvidia-gpu-operator-e2e-25-10-x |

Summary by CodeRabbit
New Features
Tests
Bug Fixes
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.