-
Notifications
You must be signed in to change notification settings - Fork 49
ocp-sriov: add comprehensive test documentation #987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds a CNF-style SR-IOV test framework for OCP: new internal packages (ocpsriovconfig, ocpsriovinittools, sriovocpenv, sriovenv, tsparams changes), device/VF parsing and defaults, many test constants, and a new Ginkgo "basic" SR-IOV test suite. README env var names/formats updated. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ec77e65 to
941b840
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (4)
tests/ocp/sriov/internal/tsparams/consts.go (1)
1-70: Well-structured test constants file.The constants are well-documented with appropriate timeout values. The labeling mechanism with a base
Labelsslice that individual tests can augment is a clean pattern for test filtering.A minor note: the MAC addresses
20:04:0f:f1:88:01and20:04:0f:f1:88:02have the locally administered bit set (bit 1 of the first octet,0x02), which is correct for test/virtual interfaces. However, the current values (0x20) have the multicast bit set instead. For strictly correct locally administered unicast MACs, consider using addresses like02:04:0f:f1:88:01.- // TestPodClientMAC is the default MAC address for test client pods. - TestPodClientMAC = "20:04:0f:f1:88:01" - // TestPodServerMAC is the default MAC address for test server pods. - TestPodServerMAC = "20:04:0f:f1:88:02" + // TestPodClientMAC is the default MAC address for test client pods. + TestPodClientMAC = "02:04:0f:f1:88:01" + // TestPodServerMAC is the default MAC address for test server pods. + TestPodServerMAC = "02:04:0f:f1:88:02"tests/ocp/sriov/sriov_suite_test.go (1)
36-80: BeforeSuite handles namespace cleanup well.The logic to detect pre-existing namespaces from previous partial failures and recreate them is robust. However, the namespace label setup is duplicated between lines 49-51 and 65-68. Consider extracting to a helper function.
+func createTestNamespace() *namespace.Builder { + ns := namespace.NewBuilder(APIClient, tsparams.TestNamespaceName) + for key, value := range params.PrivilegedNSLabels { + ns.WithLabel(key, value) + } + return ns +} + var _ = BeforeSuite(func() { // ... validation and cleanup ... By("Creating test namespace with privileged labels") - testNS = namespace.NewBuilder(APIClient, tsparams.TestNamespaceName) - for key, value := range params.PrivilegedNSLabels { - testNS.WithLabel(key, value) - } + testNS = createTestNamespace() _, err = testNS.Create() if err != nil { if apierrors.IsAlreadyExists(err) { // ... delete existing ... - testNS = namespace.NewBuilder(APIClient, tsparams.TestNamespaceName) - for key, value := range params.PrivilegedNSLabels { - testNS.WithLabel(key, value) - } + testNS = createTestNamespace() // ...tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
716-760: Consider returning accumulated errors for observability.The cleanup function ignores all errors and always returns
nil. While best-effort cleanup is appropriate, consider logging errors at a higher verbosity level or returning an aggregate error for observability.func CleanupLeftoverResources() error { + var errs []error sriovOpNs := SriovOcpConfig.OcpSriovOperatorNamespace klog.V(90).Info("Cleaning up leftover test resources") // Cleanup test namespaces - namespaces, _ := namespace.List(APIClient, metav1.ListOptions{}) + namespaces, err := namespace.List(APIClient, metav1.ListOptions{}) + if err != nil { + klog.V(90).Infof("Warning: failed to list namespaces: %v", err) + } // ... rest of cleanup ... klog.V(90).Info("Cleanup completed") return nil }
732-740: Document the naming convention matched by cleanup regex.The regex
^\d{5}-|\w+dpdknet$matches specific naming patterns. Consider adding a comment explaining what test resource names follow these patterns to help future maintainers understand which resources are targeted for cleanup.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between b1c772b and 941b840fe473a30924a9bdb82b503809368aac2c.
📒 Files selected for processing (10)
tests/ocp/internal/ocpconfig/config.go(2 hunks)tests/ocp/sriov/README.md(1 hunks)tests/ocp/sriov/internal/ocpsriovconfig/config.go(4 hunks)tests/ocp/sriov/internal/ocpsriovconfig/default.yaml(0 hunks)tests/ocp/sriov/internal/sriovenv/sriovenv.go(1 hunks)tests/ocp/sriov/internal/tsparams/consts.go(1 hunks)tests/ocp/sriov/internal/tsparams/sriovvars.go(1 hunks)tests/ocp/sriov/sriov_suite_test.go(1 hunks)tests/ocp/sriov/tests/README.md(1 hunks)tests/ocp/sriov/tests/basic.go(1 hunks)
💤 Files with no reviewable changes (1)
- tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
tests/ocp/sriov/sriov_suite_test.go (5)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
SriovOcpConfig(21-26)tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (2)
SriovOcpConfig(13-13)APIClient(11-11)tests/ocp/sriov/internal/tsparams/consts.go (1)
Labels(69-69)tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)
Labels(14-14)tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
CleanupLeftoverResources(717-760)CheckSriovOperatorStatus(40-42)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)
SriovOcpConfig(13-13)
🪛 LanguageTool
tests/ocp/sriov/tests/README.md
[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🔇 Additional comments (22)
tests/ocp/internal/ocpconfig/config.go (1)
38-40: Pointer preservation around YAML decode looks reasonable; confirm null‑intent semanticsCapturing the existing
GeneralConfigpointer and restoring it only when YAML decoding leaves itnilis a pragmatic way to avoid losing the initialized config duringDecode, and should help prevent nil dereferences downstream.One subtle behavioral change to be aware of: if the YAML were ever to explicitly set this embedded pointer to
null, this restoration logic would override that and keep a non‑nilGeneralConfig. If "config cannot be explicitly disabled/null" is the intended contract, this is fine; otherwise you might want a more specific condition for when to restore (e.g., only for an actually empty/default file or based on file size/contents).Also applies to: 52-55
tests/ocp/sriov/README.md (3)
1-10: Comprehensive and well-structured documentation.The README provides excellent coverage of test development guidelines, including directory structure, naming conventions, Ginkgo patterns, and best practices. This will be valuable for maintaining consistency across the SR-IOV test suite.
404-406: Good choice using klog over glog.Using
klogis the correct choice as it's the standard logging library in the Kubernetes ecosystem and provides better integration with Kubernetes tooling.
224-257: Excellent guidance on helper function design.The clear distinction between where Gomega/Ginkgo can be used (test files only) versus helper functions (return errors instead) is a critical best practice that prevents common anti-patterns in test code organization.
tests/ocp/sriov/tests/README.md (3)
1-15: Well-documented test suite README.The documentation provides clear information about the test file, labels, namespace, and container configuration. This will help users understand the test suite quickly.
86-116: Comprehensive environment variable documentation.Good coverage of both mandatory and optional environment variables, including the fallback behavior and default device configurations. The format specification for
SRIOV_DEVICESis particularly helpful.
475-520: Clear test execution flow documentation.The documentation of BeforeAll, test execution patterns, AfterAll hooks, and resource management behavior helps users understand the test lifecycle. The note about automatic cleanup of leftover resources is valuable.
Note: The static analysis hint about "AfterAll" spelling is a false positive—it's a valid Ginkgo hook name.
tests/ocp/sriov/internal/ocpsriovconfig/config.go (3)
28-41: Good decoupling of YAML/env config from embedded structs.The introduction of
sriovYAMLConfigandsriovEnvConfighelper types prevents the YAML decoder from zeroing out embeddedOcpConfig/GeneralConfigpointers during decoding. This is a solid approach to preserve the parent config state.
57-62: Defensive check for runtime.Caller failure.Good addition to handle the edge case where
runtime.Callerfails. This prevents potential panics from using invalid path information.
122-134: Correct conditional application of environment overrides.The pattern of checking for non-empty values before applying ensures that YAML defaults aren't overwritten by empty environment variables. This is the expected behavior for configuration layering.
tests/ocp/sriov/tests/basic.go (6)
22-32: Utility function for NO-CARRIER detection.Clean implementation checking for NO-CARRIER conditions. This enables graceful skipping of devices without physical link.
34-60: Well-structured namespace setup with automatic cleanup.The
setupTestNamespacehelper correctly creates namespaces with privileged labels and registers cleanup viaDeferCleanup. TheEventuallycheck ensures namespace existence before proceeding.
77-99: Proper test suite structure with BeforeAll validation.The test suite correctly uses
Ordered,Label, andContinueOnFailureas documented in the README. TheBeforeAllhook validates operator status and discovers worker nodes before any tests run.
101-119: Robust policy cleanup in AfterAll.Good approach collecting cleanup errors without failing early, then logging warnings for any failures. The final stability wait ensures the cluster is ready after cleanup.
121-157: Consistent test case pattern.The test case structure (device iteration, VF initialization, namespace/network setup, verification, skip on NO-CARRIER) is well-designed and consistently applied across all test cases.
458-465: Consider adding cleanup for NAD polling failure case.The
Eventuallyblock for NAD readiness doesn't have cleanup registered before it. If NAD polling fails, the namespace cleanup will still run, but the network created at line 456 may leave orphaned resources.This is a minor edge case since
DeferCleanupat line 69-74 (viasetupSriovNetwork) should handle it, but verify thatRemoveSriovNetworkhandles non-existent networks gracefully.tests/ocp/sriov/internal/tsparams/sriovvars.go (4)
15-22: Clear DeviceConfig struct definition.The struct captures all necessary device configuration with a useful
SupportsMinTxRateflag for conditional test behavior.
41-75: Robust environment variable parsing.The parser correctly handles edge cases: empty entries are skipped, malformed entries log warnings but don't fail parsing entirely, and all values are trimmed. The format documentation in the comment is helpful.
77-92: Fail-fast behavior for invalid configuration.The panic when
SRIOV_DEVICESis set but no valid entries are parsed is appropriate—it prevents tests from silently running with unintended default configurations when the user explicitly set the variable.
94-113: Proper fallback chain for VF count.The function correctly prioritizes
ECO_OCP_SRIOV_VF_NUMoverSRIOV_VF_NUM, validates for positive integers, logs warnings for invalid values, and falls back to a sensible default of 2.tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)
215-266: Verify single-node VF initialization is intentional.The function iterates through worker nodes but returns immediately after successfully creating a policy on the first node. This means VFs are only initialized on one node rather than all workers. If tests require VFs on multiple nodes, this logic would need adjustment.
tests/ocp/sriov/sriov_suite_test.go (1)
90-96: VerifyReporterNamespacesToDumpandReporterCRDsToDumpare defined in tsparams package.The
tsparams.ReporterNamespacesToDumpandtsparams.ReporterCRDsToDumpvariables are referenced in the test but were not found in the provided consts file. Confirm these are defined in thetsparamspackage or add the definitions if missing.
9999d37 to
7cc85e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
tests/ocp/sriov/tests/README.md (1)
11-16: Clarify single “Test Namespace” vs per-test e2e namespacesThe “Test Suite Information” section declares a single test namespace
sriov-tests, while later sections state that each test creates its own namespace with a test case ID prefix. Intests/ocp/sriov/tests/basic.go, pods are created in dynamically named namespaces likee2e-<testID><deviceName>, notsriov-tests.Consider clarifying that:
sriov-testsis the suite-level namespace (where applicable, e.g., reporter config), and- each basic test also creates its own
e2e-*namespace per device.That will help operators know where to look for resources and logs.
Also applies to: 515-521
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)
11-18: IncludingLabelBasicinLabelsaligns with new filtering docsExtending
Labelsto{LabelSuite, LabelBasic}matches the usage inbasic.goand the README examples (ocpsriov && basic). This keeps suite-level label-based filtering coherent across code and docs.You may also want to double-check other suites under
tests/ocp/sriov/to ensure they still behave as expected whenECO_TEST_LABELSis set to this expanded label list.tests/ocp/sriov/internal/tsparams/consts.go (1)
6-57: Good centralization of timeouts and default test dataCollecting label, timeout/interval, MAC/IP, and MTU constants here makes the SR-IOV tests much easier to tune and reason about. The values are consistent with usages in
sriovenv.goandbasic.go.Just make sure any future changes to
DefaultTestMTU,PodReadyTimeout, etc., are mirrored in the test documentation so operators don’t get stale numbers.tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
26-33: Device defaults andSRIOV_DEVICESparsing look good, but doc should mention the 5th flagThe
DeviceConfig/GetDefaultDevicessetup andparseDevicesFromEnvlogic are reasonable:
- Code defaults include
SupportsMinTxRateper device, used by the VLAN/rate-limiting test to skip NICs that don’t supportminTxRate.SRIOV_DEVICESsupports bothname:deviceid:vendor:interfaceand a 5th optionalsupportsMinTxRateflag (true|1|yes→true, anything else →false).Two follow-ups to keep things coherent:
The SR-IOV basic README currently documents only the 4-field form and lists a different default device set than
GetDefaultDevices()provides. It should be updated to:
- Reflect this exact
GetDefaultDevices()list, and- Mention the optional 5th
supportsMinTxRatefield so users can correctly mark devices that don’t support min TX rate.If
default.yamlalso carries a devices list, consider clearly documenting (or enforcing) which source “wins” (YAML vsSRIOV_DEVICESvs built-in defaults) to avoid surprises when operators override configuration.Also applies to: 76-88, 255-306
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 1dd58b3e8bb9cb90a5e797befbb72c018af18d5e and 9999d379c5eaa2504825f4765687ec2c9927f271.
📒 Files selected for processing (8)
tests/ocp/sriov/README.md(1 hunks)tests/ocp/sriov/internal/ocpsriovconfig/config.go(4 hunks)tests/ocp/sriov/internal/ocpsriovconfig/default.yaml(1 hunks)tests/ocp/sriov/internal/sriovenv/sriovenv.go(1 hunks)tests/ocp/sriov/internal/tsparams/consts.go(2 hunks)tests/ocp/sriov/internal/tsparams/ocpsriovvars.go(2 hunks)tests/ocp/sriov/tests/README.md(1 hunks)tests/ocp/sriov/tests/basic.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)
tests/ocp/sriov/internal/tsparams/consts.go (2)
LabelSuite(10-10)LabelBasic(15-15)
tests/ocp/sriov/tests/basic.go (5)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
DeviceConfig(27-33)tests/cnf/ran/ptp/internal/iface/iface.go (1)
Name(22-22)tests/system-tests/internal/link/link.go (1)
NewBuilder(48-77)tests/ocp/sriov/internal/tsparams/consts.go (7)
NamespaceTimeout(30-30)CleanupTimeout(34-34)LabelBasic(15-15)PodReadyTimeout(32-32)DefaultTestMTU(56-56)NADTimeout(40-40)PollingInterval(28-28)tests/ocp/sriov/internal/sriovenv/sriovenv.go (17)
NetworkOption(88-88)CreateSriovNetwork(140-155)RemoveSriovNetwork(168-177)CheckSriovOperatorStatus(40-42)RemoveSriovPolicy(184-203)WaitForSriovPolicyReady(45-47)InitVF(209-211)WithSpoof(91-95)WithTrust(98-102)CheckVFStatusWithPassTraffic(510-585)WithVLAN(105-109)WithVlanQoS(112-116)WithMinTxRate(119-123)WithMaxTxRate(126-130)WithLinkState(133-137)VerifyLinkStateConfiguration(589-638)GetPciAddress(729-765)
🪛 LanguageTool
tests/ocp/sriov/tests/README.md
[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🔇 Additional comments (1)
tests/ocp/sriov/README.md (1)
401-412: Logging wording now matches actual klog usageRenaming this bullet to “klog for verbose logging” aligns the text with the
klog.V(level).Infof()example below; no further changes needed here.
| // initVFWithDevType creates an SR-IOV policy for the specified device type. | ||
| // It iterates through worker nodes and creates a policy on the first node where | ||
| // the device is successfully discovered. Returns (true, nil) on success. | ||
| func initVFWithDevType(name, deviceID, interfaceName, vendor, devType string, vfNum int, | ||
| workerNodes []*nodes.Builder) (bool, error) { | ||
| sriovOpNs := SriovOcpConfig.OcpSriovOperatorNamespace | ||
|
|
||
| // Cleanup existing policy | ||
| _ = RemoveSriovPolicy(name, tsparams.NamespaceTimeout) | ||
|
|
||
| for _, node := range workerNodes { | ||
| nodeName := node.Definition.Name | ||
|
|
||
| // Try to discover the actual interface name | ||
| actualInterface := interfaceName | ||
|
|
||
| // Only discover interface if not explicitly provided | ||
| // This allows users to specify exact interface for multi-port NICs | ||
| if interfaceName == "" && vendor != "" && deviceID != "" { | ||
| if discovered, err := discoverInterfaceName(nodeName, vendor, deviceID); err == nil { | ||
| actualInterface = discovered | ||
| } | ||
| } | ||
|
|
||
| pfSelector := fmt.Sprintf("%s#0-%d", actualInterface, vfNum-1) | ||
|
|
||
| policy := sriov.NewPolicyBuilder(APIClient, name, sriovOpNs, name, vfNum, | ||
| []string{pfSelector}, map[string]string{"kubernetes.io/hostname": nodeName}). | ||
| WithDevType(devType) | ||
|
|
||
| if vendor != "" { | ||
| policy.Definition.Spec.NicSelector.Vendor = vendor | ||
| } | ||
|
|
||
| if deviceID != "" { | ||
| policy.Definition.Spec.NicSelector.DeviceID = deviceID | ||
| } | ||
|
|
||
| if _, err := policy.Create(); err != nil { | ||
| klog.V(90).Infof("Failed to create policy on node %q: %v", nodeName, err) | ||
|
|
||
| continue | ||
| } | ||
|
|
||
| if err := WaitForSriovPolicyReady(tsparams.PolicyApplicationTimeout); err != nil { | ||
| klog.V(90).Infof("Policy not ready on node %q: %v", nodeName, err) | ||
|
|
||
| _ = policy.Delete() | ||
|
|
||
| continue | ||
| } | ||
|
|
||
| klog.V(90).Infof("Successfully created policy %q on node %q", name, nodeName) | ||
|
|
||
| return true, nil | ||
| } | ||
|
|
||
| return false, fmt.Errorf("failed to create policy %q on any node", name) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initVFWithDevType never returns a non-error “not found” result, making graceful skips impossible
initVFWithDevType currently:
- Cleans up any existing policy,
- Iterates worker nodes trying to create a policy,
- On first success, waits for readiness and returns
(true, nil), - If all attempts fail, returns
falsewith an error:
return false, fmt.Errorf("failed to create policy %q on any node", name)Callers like InitVF / InitDpdkVF in turn are used by tests/basic.go, which expects to:
- Treat
err != nilas a hard failure viaExpect(err).ToNot(HaveOccurred()), and - Use the boolean
resultto decide whether to skip devices and ultimatelySkip("No SR-IOV devices matched...").
Because (false, nil) is never returned, “no matching device on any worker” can only manifest as an error, and tests can’t distinguish “device not present” from genuine failures. That contradicts the README’s intention to skip devices not found on any worker node.
Consider adjusting this API so callers can differentiate:
- Return
(false, nil)(or a sentinel error type) when no suitable NIC is found on any node, and reserve non-nil errors for real misconfigurations (e.g., API failures), or - Add a clear contract (docstring + error type) so tests can detect and
Skip“not found” while still failing on other errors.
This will allow the basic tests and documentation to implement the promised “device skipping” behavior without conflating it with infrastructure failures.
🤖 Prompt for AI Agents
tests/ocp/sriov/internal/sriovenv/sriovenv.go lines 221-279: The function
currently treats "no matching device on any worker" as an error and always
returns (false, error), which prevents callers from distinguishing "not found"
from real failures; change the API so the function returns (false, nil) when no
suitable NIC is discovered on any worker (or alternatively return a
well-documented sentinel error type like ErrDeviceNotFound), and only return
non-nil errors for real failures (API/creation/timeout). Update the function to
detect the "no-match" case and return (false, nil) (or ErrDeviceNotFound)
instead of fmt.Errorf(...), and update callers (InitVF/InitDpdkVF and
tests/basic.go) to check the boolean or sentinel error and Skip when appropriate
while still failing on real errors.
| var ( | ||
| vfNum = SriovOcpConfig.GetVFNum() | ||
| workerNodes []*nodes.Builder | ||
| testData = SriovOcpConfig.GetDevices() | ||
| ) | ||
|
|
||
| BeforeAll(func() { | ||
| By("Checking the SR-IOV operator is running") | ||
| err := sriovenv.CheckSriovOperatorStatus() | ||
| Expect(err).ToNot(HaveOccurred(), "SR-IOV operator is not running") | ||
|
|
||
| By("Discovering worker nodes") | ||
| workerNodes, err = nodes.List(APIClient, | ||
| metav1.ListOptions{LabelSelector: SriovOcpConfig.WorkerLabel}) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to discover nodes") | ||
| Expect(len(workerNodes)).To(BeNumerically(">", 0), "No worker nodes found") | ||
| }) | ||
|
|
||
| AfterAll(func() { | ||
| By("Cleaning up SR-IOV policies after all tests") | ||
|
|
||
| var cleanupErrors []string | ||
| for _, item := range testData { | ||
| err := sriovenv.RemoveSriovPolicy(item.Name, tsparams.DefaultTimeout) | ||
| if err != nil { | ||
| cleanupErrors = append(cleanupErrors, fmt.Sprintf("policy %q: %v", item.Name, err)) | ||
| } | ||
| } | ||
|
|
||
| if len(cleanupErrors) > 0 { | ||
| klog.Warningf("Some policies failed to clean up: %v", cleanupErrors) | ||
| } | ||
|
|
||
| By("Waiting for post-cleanup cluster stability") | ||
| err := sriovenv.WaitForSriovPolicyReady(tsparams.DefaultTimeout) | ||
| Expect(err).ToNot(HaveOccurred(), "Cluster did not stabilize after cleanup") | ||
| }) | ||
|
|
||
| It("SR-IOV VF with spoof checking enabled", reportxml.ID("25959"), func() { | ||
| caseID := "25959" | ||
| executed := false | ||
|
|
||
| for _, data := range testData { | ||
| By(fmt.Sprintf("Testing device: %s", data.Name)) | ||
|
|
||
| result, err := sriovenv.InitVF(data.Name, data.DeviceID, data.InterfaceName, | ||
| data.Vendor, vfNum, workerNodes) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to initialize VF for device %q", data.Name) | ||
|
|
||
| if !result { | ||
| By(fmt.Sprintf("Skipping device %q - VF init failed", data.Name)) | ||
|
|
||
| continue | ||
| } | ||
|
|
||
| executed = true | ||
| testNamespace := setupTestNamespace(caseID+"-", data) | ||
| networkName := caseID + "-" + data.Name | ||
| setupSriovNetwork(networkName, data.Name, testNamespace, | ||
| sriovenv.WithSpoof(true), sriovenv.WithTrust(false)) | ||
|
|
||
| By("Verifying VF status with pass traffic") | ||
| err = sriovenv.CheckVFStatusWithPassTraffic(networkName, data.InterfaceName, | ||
| testNamespace, "spoof checking on", tsparams.PodReadyTimeout) | ||
| if isNoCarrierError(err) { | ||
| Skip("Interface has NO-CARRIER status") | ||
| } | ||
|
|
||
| Expect(err).ToNot(HaveOccurred(), "Test verification failed") | ||
| } | ||
|
|
||
| if !executed { | ||
| Skip("No SR-IOV devices matched the requested configuration") | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InitVF error handling prevents “device not found” from being treated as a skip
Across the tests (starting here with Test ID 25959), the pattern is:
result, err := sriovenv.InitVF(...)
Expect(err).ToNot(HaveOccurred(), ...)
if !result {
// skip this device
continue
}
executed = true
...However, initVFWithDevType in sriovenv.go currently only returns:
(true, nil)when a policy is successfully created, or(false, error)when it fails on all nodes.
It never returns (false, nil), so a “device not present on any worker” scenario produces an error and fails the test before the if !result or final if !executed { Skip(...) } paths can run. That contradicts the README, which says devices not found on any worker node are automatically skipped.
Consider one of these options:
- Teach
initVFWithDevType(and thusInitVF/InitDpdkVF) to distinguish “no suitable node/device found” from real failures, e.g. returning(false, nil)or a well-known sentinel error for the former; then the existingresult/executedlogic will work as intended, or - Keep
InitVFreturning an error, but in the tests explicitly detect a “no device found” error (e.g., viaerrors.Isor substring) andSkipin that case instead of unconditionallyExpect(err).ToNot(HaveOccurred()).
Without such a change, running these tests on clusters lacking one or more configured devices will yield hard failures instead of clean skips.
7cc85e6 to
acd5dac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)
221-279:initVFWithDevTypestill conflates "device not found" with real errors.The previous review correctly identified that this function returns
(false, error)for all failure cases, making it impossible for callers to distinguish between "no matching device on any node" (which should trigger a skip) and actual infrastructure failures (which should fail the test).The function currently returns:
return false, fmt.Errorf("failed to create policy %q on any node", name)Consider returning
(false, nil)when no device matches, reserving errors for genuine failures:+var errNoMatchingDevice = fmt.Errorf("no matching device found on any worker node") func initVFWithDevType(...) (bool, error) { // ... existing logic ... + // Track if any attempt was made (vs no nodes at all) + var attemptedNodes int for _, node := range workerNodes { + attemptedNodes++ // ... policy creation attempts ... } - return false, fmt.Errorf("failed to create policy %q on any node", name) + if attemptedNodes == 0 { + return false, fmt.Errorf("no worker nodes provided") + } + // No matching device - callers can skip + return false, nil }This allows tests to use
Skip()for "device not present" scenarios as documented in the README.tests/ocp/sriov/tests/README.md (1)
420-443: MTU value in documentation still doesn't match code.The README documents MTU as
1800(line 431), buttsparams.DefaultTestMTUinconsts.gois defined as9000. This discrepancy was flagged in a previous review and marked as addressed, but the mismatch persists.Update line 431 to reflect the actual default:
**Configuration**: -- MTU: `1800` +- MTU: `9000` (defined by `DefaultTestMTU`)And update line 438 accordingly:
-- Policy MTU is updated to 1800 +- Policy MTU is updated to 9000
🧹 Nitpick comments (4)
tests/ocp/sriov/internal/tsparams/consts.go (1)
55-56: Consider documenting MTU range constraint.The README mentions "MTU value must be in range 1-9192 as per SR-IOV specification." Consider adding a comment here about the valid range, or defining
MinTestMTUandMaxTestMTUconstants for validation.tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
308-328: Legacy VFNum parsing has a subtle ordering dependency.The check
if sriovOcpConfig.VFNum != DefaultVFNummeansSRIOV_VF_NUMis only read ifECO_OCP_SRIOV_VF_NUMwasn't set. This is correct but relies on the call order inNewSriovOcpConfig(). Consider adding a brief comment inNewSriovOcpConfig()noting this intentional precedence.tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
303-356: Policy recreation now preserves additional fields - good improvement.The
UpdateSriovPolicyMTUfunction now correctly preservesDeviceType,Vendor,DeviceID, andPriorityfrom the original spec. This addresses the previous review concern.However, there are still some spec fields that might be lost:
LinkTypeIsRdmaNicSelector.RootDevicesNicSelector.NetFilterExternallyManagedVdpaTypeFor a complete solution, consider copying all non-zero fields from the original spec, or document which fields are preserved.
167-177: Silent error swallowing inRemoveSriovNetwork.Returning
nilwhenPullNetworkfails assumes the network is already deleted, but the error could indicate other issues (e.g., API server unreachable). Consider logging the error or checking for specific "not found" errors:func RemoveSriovNetwork(name string, timeout time.Duration) error { network, err := sriov.PullNetwork(APIClient, name, sriovOpNs) if err != nil { - return nil // Already deleted + klog.V(90).Infof("Network %q not found or error pulling: %v", name, err) + return nil // Assume already deleted } return network.DeleteAndWait(timeout) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 9999d379c5eaa2504825f4765687ec2c9927f271 and acd5dacc020bde2aaebbc9622d18f297bde606b0.
📒 Files selected for processing (8)
tests/ocp/sriov/README.md(3 hunks)tests/ocp/sriov/internal/ocpsriovconfig/config.go(4 hunks)tests/ocp/sriov/internal/ocpsriovconfig/default.yaml(1 hunks)tests/ocp/sriov/internal/sriovenv/sriovenv.go(1 hunks)tests/ocp/sriov/internal/tsparams/consts.go(2 hunks)tests/ocp/sriov/internal/tsparams/ocpsriovvars.go(2 hunks)tests/ocp/sriov/tests/README.md(1 hunks)tests/ocp/sriov/tests/basic.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/ocp/sriov/README.md
- tests/ocp/sriov/internal/tsparams/ocpsriovvars.go
- tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
- tests/ocp/sriov/tests/basic.go
🧰 Additional context used
🧬 Code graph analysis (1)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (3)
tests/cnf/ran/ptp/internal/iface/iface.go (1)
Name(22-22)tests/ocp/internal/ocpconfig/config.go (1)
OcpConfig(20-22)tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)
SriovOcpConfig(13-13)
🪛 LanguageTool
tests/ocp/sriov/tests/README.md
[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🔇 Additional comments (18)
tests/ocp/sriov/tests/README.md (3)
1-15: Clear and well-structured documentation.The overview section effectively introduces the test suite, its purpose, and the key information users need (file location, labels, namespace, container). The structure follows documentation best practices.
16-85: Comprehensive prerequisites section with helpful examples.The cluster requirements and OpenShift configuration sections are thorough, including verification commands and a practical MachineConfig example for DPDK hugepages. This will help users set up their test environment correctly.
492-497: Minor grammar issue: "AfterAll" spelling.Static analysis flagged "AfterAll" - this is actually correct as it refers to the Ginkgo
AfterAllhook. No change needed.tests/ocp/sriov/internal/tsparams/consts.go (4)
14-16: LGTM - Well-defined label constant.The
LabelBasicconstant follows the existing naming convention and provides a clear label for test filtering.
21-44: Reasonable timeout constants with good documentation.The timeout values are well-chosen for SR-IOV operations:
PolicyApplicationTimeoutat 35 minutes matchesMCOWaitTimeout, appropriate for MCP updatesPodReadyTimeoutat 5 minutes allows for SR-IOV device attachment- Polling intervals (2-10 seconds) are reasonable for avoiding API server overload
46-53: Good use of locally administered MAC addresses.The
02:prefix in MAC addresses correctly indicates locally administered addresses, avoiding conflicts with real hardware MAC addresses. The/24CIDR notation for IPs is appropriate for test networks.
64-67: Rate limiting constants look reasonable.The
TestMinTxRate(40 Mbps) andTestMaxTxRate(100 Mbps) provide a valid range for rate limiting tests. Note that some devices (BCM) don't supportminTxRate, which is handled in the test code.tests/ocp/sriov/internal/ocpsriovconfig/config.go (5)
45-75: Excellent documentation for the temporary struct pattern.The detailed comments explaining why separate
sriovYAMLConfigandsriovEnvConfigstructs are needed to avoid yaml.v2's pointer-zeroing behavior is very helpful for future maintainers. This is a known gotcha with yaml.v2 and the workaround is correct.
76-88: Default devices are well-defined with feature flags.The
SupportsMinTxRatefield correctly distinguishes devices that support rate limiting from those that don't (BCM and x710 devices). This enables graceful skipping of unsupported features in tests.
104-113: Initialization order is correct.Setting defaults before reading config files ensures that:
- Defaults are available if config files don't specify values
- Config files can override defaults
- Environment variables can override config file values
The error handling for
runtime.Caller(0)failure is appropriate.
141-157: Defensive accessor methods with safe defaults.
GetDevices()andGetVFNum()properly return defaults when the config is empty or invalid. However,GetDevices()returnsGetDefaultDevices()whenlen(c.Devices) == 0, butc.Devicesis already set to defaults inNewSriovOcpConfig(). This redundancy is harmless but could be simplified.
259-306: Device parsing handles edge cases well.The
parseDevicesFromEnvfunction:
- Trims whitespace from entries and parts
- Handles optional 5th field for
SupportsMinTxRate- Logs warnings for invalid entries without failing
- Only overwrites defaults if at least one valid device is parsed
One minor improvement: the comment at line 301 could be clearer.
tests/ocp/sriov/internal/sriovenv/sriovenv.go (6)
39-42: Simple and effective operator status check.The
CheckSriovOperatorStatusfunction correctly delegates to the existingsriovoperator.IsSriovDeployedutility, maintaining consistency with other test suites.
87-137: Clean functional options pattern for network configuration.The
NetworkOptiontype andWith*functions provide a flexible, extensible API for configuring SR-IOV networks. This follows Go best practices for builder patterns.
509-585: Comprehensive VF status verification with traffic test.The
CheckVFStatusWithPassTrafficfunction properly:
- Creates client/server pods with cleanup via
defer- Verifies interface readiness and carrier status
- Conditionally verifies spoof checking based on description
- Tests actual connectivity with ping
Good defensive cleanup pattern.
683-722: Debug pod execution with proper cleanup.The
executeOnNodefunction correctly:
- Truncates pod name to 63 characters (K8s limit)
- Cleans up existing debug pods before creating new ones
- Uses
deferfor cleanup- Uses
nsenterto execute commands in the host namespaceOne consideration: the 63-character truncation might create collisions for very long node names with similar prefixes.
771-839: Best-effort cleanup with good logging.The
CleanupLeftoverResourcesfunction appropriately:
- Logs warnings but continues on errors (best-effort)
- Uses regex patterns to match test resources
- Cleans up namespaces, networks, and policies
The regex pattern
^\d{5}-|\w+dpdknet$correctly matches test case ID prefixes and DPDK network suffixes.
728-765: PCI address extraction handles namespaced network names.The
GetPciAddressfunction correctly strips the namespace prefix from network names usingstrings.LastIndex. The JSON struct for parsing network status is appropriately minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between acd5dacc020bde2aaebbc9622d18f297bde606b0 and 30bb5d969242eb732ce9a179a07f22a95dbb391f.
📒 Files selected for processing (1)
tests/ocp/sriov/tests/README.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
tests/ocp/sriov/tests/README.md
[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🔇 Additional comments (1)
tests/ocp/sriov/tests/README.md (1)
420-443: MTU documentation correctly references DefaultTestMTU.The MTU section now properly documents the value as
9000(defined byDefaultTestMTU), which aligns with the actual test implementation and addresses the previous documentation mismatch concern.
tests/ocp/sriov/tests/README.md
Outdated
| #### Device Configuration | ||
|
|
||
| If `SRIOV_DEVICES` is not set, the test suite uses default device configurations: | ||
|
|
||
| - `e810xxv` (DeviceID: 159b, Vendor: 8086, Interface: eno12409) | ||
| - `e810c` (DeviceID: 1593, Vendor: 8086, Interface: ens2f2) | ||
| - `x710` (DeviceID: 1572, Vendor: 8086, Interface: ens5f0) - NO-CARRIER | ||
| - `bcm57414` (DeviceID: 16d7, Vendor: 14e4, Interface: ens4f1np1) | ||
| - `bcm57508` (DeviceID: 1750, Vendor: 14e4, Interface: ens3f0np0) - NO-CARRIER | ||
| - `e810back` (DeviceID: 1591, Vendor: 8086, Interface: ens4f2) | ||
| - `cx7anl244` (DeviceID: 1021, Vendor: 15b3, Interface: ens2f0np0) | ||
|
|
||
| **Note**: Devices marked with "NO-CARRIER" may be skipped if the interface has no carrier signal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect GetDefaultDevices() implementation
echo "=== GetDefaultDevices() function ==="
rg -A 150 'func GetDefaultDevices' tests/ocp/sriov/internal/ocpsriovconfig/config.go
echo ""
echo "=== Check for device-related mappings/constants ==="
rg -B 2 -A 5 'map\[.*DeviceConfig|var.*Device' tests/ocp/sriov/internal/ocpsriovconfig/config.goRepository: rh-ecosystem-edge/eco-gotests
Length of output: 4745
🏁 Script executed:
#!/bin/bash
# Also check the README.md file structure
echo "=== README.md device configuration section (lines 100-120) ==="
sed -n '100,120p' tests/ocp/sriov/tests/README.mdRepository: rh-ecosystem-edge/eco-gotests
Length of output: 1123
🏁 Script executed:
#!/bin/bash
# Check for line 493 grammar issue
echo "=== README.md around line 493 ==="
sed -n '490,495p' tests/ocp/sriov/tests/README.mdRepository: rh-ecosystem-edge/eco-gotests
Length of output: 323
Device configuration list is out of sync with GetDefaultDevices() in config.go.
The documented device list has multiple discrepancies with the actual implementation:
- Missing devices:
cx6(ens6f0np0) andcx6dx(ens7f0np0) are not documented - Non-existent device:
e810backappears in documentation but doesn't exist in the code - Interface name mismatches:
bcm57414: documented asens4f1np1, actuallyens3f0bcm57508: documented asens3f0np0, actuallyens4f0cx7anl244: documented asens2f0np0, actuallycx7withens8f0np0
Regenerate the device list directly from GetDefaultDevices() to ensure users following the documentation have accurate information.
🤖 Prompt for AI Agents
In tests/ocp/sriov/tests/README.md around lines 103 to 115, the documented
default device list is out of sync with GetDefaultDevices() in config.go; update
the README by regenerating the device table directly from GetDefaultDevices()
and replacing the listed entries: remove the non-existent e810back, add missing
cx6 (ens6f0np0) and cx6dx (ens7f0np0), and correct interface names for bcm57414
to ens3f0, bcm57508 to ens4f0, and cx7anl244 to cx7 with ens8f0np0; ensure
NO-CARRIER annotations are preserved where applicable and commit the updated
README so docs match the implementation.
252dc0c to
493fc7d
Compare
tests/ocp/sriov/README.md
Outdated
| #### Optional | ||
| - `ECO_OCP_SRIOV_INTERFACE_LIST`: Comma-separated list of SR-IOV interfaces | ||
| - `ECO_OCP_SRIOV_TEST_IMAGE`: Container image for test workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - `ECO_OCP_SRIOV_TEST_CONTAINER`: Container image for test workloads |
| // MCPStableInterval represents polling interval for MCP stability checks. | ||
| MCPStableInterval = 10 * time.Second | ||
| // NADTimeout represents timeout for NAD creation. | ||
| NADTimeout = 2 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in worst case, this should be 5-10seconds.
if it takes more than that we better consider it a failure.
|
@zhiqiangf |
493fc7d to
5b0f95a
Compare
- Add DeviceConfig struct for SR-IOV device configuration - Add GetSriovDevices() method with ECO_OCP_SRIOV_DEVICES env var support - Add GetVFNum() method with error handling - Move all defaults to default.yaml - Fix suite-level labels to exclude feature-specific labels - Add constants for timeouts, labels, and test parameters
fd0aa7b to
0d69e5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
tests/ocp/sriov/tests/basic.go (1)
150-158: InitVF error handling still prevents graceful device skipping.The past review comment on this pattern remains applicable.
initVFWithDevTypeinsriovenv.go(line 248) returns(false, fmt.Errorf(...))when no matching device is found on any node, so theExpect(err).ToNot(HaveOccurred())on line 152 will fail before theif !resultskip logic can execute.This pattern is repeated in all 8 test cases. To implement the intended "skip devices not found" behavior, either:
- Modify
initVFWithDevTypeto return(false, nil)for "device not present" scenarios, or- Use a sentinel error type and check with
errors.Isbefore theExpect.tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)
243-249:initVFWithDevTypestill cannot gracefully indicate "device not found".The previous review comment remains unaddressed. Line 248 returns
(false, fmt.Errorf(...))for all failure cases, making it impossible for callers to distinguish "no suitable device on any worker" from genuine API/configuration failures.This prevents the test skip logic in
basic.gofrom working as documented. Consider returning(false, nil)when the device simply isn't present on any node, reserving errors for actual failures.
🧹 Nitpick comments (2)
tests/ocp/sriov/internal/tsparams/consts.go (1)
50-53: Consider consolidating duplicate IP constants.
TestPodClientIPandTestPodServerIPduplicateClientIPv4IPAddressandServerIPv4IPAddressfromocpsriovvars.go. Consider using a single source of truth for these values.tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
129-134: Consider documenting accepted values for minTxRate.The parsing accepts
true,1, oryes(case-insensitive) as true values, and anything else defaults to false. This is reasonable behavior but could be clearer in the format documentation on line 110-111.Suggested documentation improvement
// parseSriovDevicesEnv parses device configuration from ECO_OCP_SRIOV_DEVICES environment variable. -// Format: "name1:deviceid1:vendor1:interface1[:minTxRate],name2:deviceid2:vendor2:interface2[:minTxRate],..." +// Format: "name1:deviceid1:vendor1:interface1[:minTxRate],name2:deviceid2:vendor2:interface2[:minTxRate],..." +// minTxRate accepts: true/1/yes for enabled, any other value (including omitted) defaults to false. // Example: "e810xxv:159b:8086:ens2f0,e810c:1593:8086:ens2f2".
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 941b840fe473a30924a9bdb82b503809368aac2c and 0d69e5b.
📒 Files selected for processing (7)
tests/ocp/sriov/README.md(2 hunks)tests/ocp/sriov/internal/ocpsriovconfig/config.go(3 hunks)tests/ocp/sriov/internal/ocpsriovconfig/default.yaml(1 hunks)tests/ocp/sriov/internal/sriovenv/sriovenv.go(1 hunks)tests/ocp/sriov/internal/tsparams/consts.go(2 hunks)tests/ocp/sriov/internal/tsparams/ocpsriovvars.go(2 hunks)tests/ocp/sriov/tests/basic.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/ocp/sriov/README.md
🧰 Additional context used
🧬 Code graph analysis (3)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (2)
tests/ocp/internal/ocpconfig/config.go (1)
OcpConfig(20-22)tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)
SriovOcpConfig(13-13)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)
tests/ocp/sriov/internal/tsparams/consts.go (1)
LabelSuite(10-10)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
tests/internal/sriovoperator/sriovenv.go (4)
IsSriovDeployed(32-54)WaitForSriovAndMCPStable(80-100)RemoveAllSriovNetworks(156-176)RemoveAllPoliciesAndWaitForSriovAndMCPStable(180-195)tests/internal/cluster/cluster.go (1)
ExecCmdWithStdout(98-180)
🔇 Additional comments (6)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)
16-18: LGTM!Clear documentation distinguishing suite-level labels from feature-specific labels. The consolidation of
DefaultTimeouttoconsts.goreduces duplication.tests/ocp/sriov/tests/basic.go (2)
45-72: LGTM!Well-structured helper with proper namespace creation, label application, and cleanup via
DeferCleanup. TheEventuallywait ensures namespace readiness before proceeding.
454-532: DPDK test implementation looks correct.The test properly skips BCM devices with a clear bug reference (OCPBUGS-30909), uses per-device namespaces, and verifies PCI address assignment. The
DeferCleanupcorrectly captures the per-iteration namespace via closure.tests/ocp/sriov/internal/tsparams/consts.go (1)
21-44: Well-organized timeout constants.The timeout values are appropriate for SR-IOV operations.
PolicyApplicationTimeoutat 35 minutes aligns withMCOWaitTimeoutfor MachineConfigPool operations, which is correct since SR-IOV policy changes often trigger node reconfiguration.tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
109-125: Well-designed functional options pattern for network configuration.The
NetworkOptionpattern withWithSpoof,WithTrust,WithVLAN, etc. provides a clean, extensible API for network creation. TheCreateSriovNetworkfunction composes these options nicely.
713-749: Cleanup function handles errors gracefully.Using warnings instead of returning errors on cleanup failures is appropriate since cleanup should be best-effort and not block test completion. The function properly cleans up test namespaces, networks, and policies.
| // GetVFNum returns the configured number of virtual functions. | ||
| // Returns an error if VFNum is not properly configured. | ||
| func (sriovOcpConfig *SriovOcpConfig) GetVFNum() (int, error) { | ||
| if sriovOcpConfig.VFNum <= 0 { | ||
| return 0, fmt.Errorf( | ||
| "no SR-IOV VFs configured, check env var ECO_OCP_SRIOV_VF_NUM") | ||
| } | ||
|
|
||
| return sriovOcpConfig.VFNum, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message could mention both configuration sources.
The error message only references the environment variable, but VFNum can also be set via vf_num in the YAML config file.
Suggested fix
func (sriovOcpConfig *SriovOcpConfig) GetVFNum() (int, error) {
if sriovOcpConfig.VFNum <= 0 {
return 0, fmt.Errorf(
- "no SR-IOV VFs configured, check env var ECO_OCP_SRIOV_VF_NUM")
+ "no SR-IOV VFs configured, check vf_num in YAML or env var ECO_OCP_SRIOV_VF_NUM")
}
return sriovOcpConfig.VFNum, nil
}🤖 Prompt for AI Agents
In tests/ocp/sriov/internal/ocpsriovconfig/config.go around lines 148 to 157,
the error returned when VFNum is not set only mentions the environment variable;
update the error text to reference both configuration sources (the
ECO_OCP_SRIOV_VF_NUM environment variable and the vf_num key in the YAML config)
so callers know where to look; modify the fmt.Errorf call to include a concise
message listing both sources (for example: "no SR-IOV VFs configured; set
ECO_OCP_SRIOV_VF_NUM or vf_num in the YAML config") and keep the same error
return shape.
36be2a6 to
05cd886
Compare
05cd886 to
8995f5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
151-160: Error message should mention both configuration sources.The error message at line 156 only references the environment variable, but
VFNumcan also be set viavf_numin the YAML config file. This makes troubleshooting harder for users who configure via YAML.tests/ocp/sriov/tests/basic.go (1)
143-179: InitVF error handling prevents graceful device skipping.The pattern used across all tests (also at lines 181-214, 216-249, 251-284, 286-329, 331-364, 366-407, 409-452, 454-532):
result, err := sriovenv.InitVF(...) Expect(err).ToNot(HaveOccurred(), ...) if !result { continue }However,
initVFWithDevTypeinsriovenv.goline 267 returns(false, fmt.Errorf(...))when no suitable device is found on any worker, never(false, nil). This causesExpect(err).ToNot(HaveOccurred())to fail before the skip logic (if !resultor finalif !executed { Skip(...) }) can execute.This contradicts the README's promise to automatically skip devices not found on any worker node.
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)
200-268:initVFWithDevTypetreats "device not found" as an error, preventing graceful test skips.At line 267, the function returns
(false, fmt.Errorf(...))when no suitable device is found on any worker node. This prevents callers intests/basic.gofrom distinguishing "device not present" from genuine failures, because:result, err := sriovenv.InitVF(...) Expect(err).ToNot(HaveOccurred()) // Fails here for "not found" if !result { Skip(...) } // Never reachedConsider returning
(false, nil)when no matching NIC is discovered (lines 229-232, 249-251, etc.), reserving non-nil errors for real failures (API errors, timeouts). This will allow tests to implement the promised "device skipping" behavior without conflating "not found" with infrastructure failures.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tests/ocp/sriov/README.mdtests/ocp/sriov/internal/ocpsriovconfig/config.gotests/ocp/sriov/internal/ocpsriovconfig/default.yamltests/ocp/sriov/internal/sriovenv/sriovenv.gotests/ocp/sriov/internal/tsparams/consts.gotests/ocp/sriov/tests/basic.go
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
- tests/ocp/sriov/internal/tsparams/consts.go
🧰 Additional context used
🧬 Code graph analysis (3)
tests/ocp/sriov/tests/basic.go (3)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
DeviceConfig(26-32)tests/ocp/sriov/internal/tsparams/consts.go (12)
BCMVendorID(72-72)NamespaceTimeout(30-30)CleanupTimeout(34-34)LabelBasic(15-15)PodReadyTimeout(32-32)TestVLAN(61-61)TestVlanQoS(63-63)TestMinTxRate(65-65)TestMaxTxRate(67-67)DefaultTestMTU(56-56)NADTimeout(40-40)PollingInterval(28-28)tests/ocp/sriov/internal/sriovenv/sriovenv.go (17)
NetworkOption(59-59)CreateSriovNetwork(111-126)RemoveSriovNetwork(139-152)WithSpoof(62-66)WithTrust(69-73)WithVLAN(76-80)WithVlanQoS(83-87)WithMinTxRate(90-94)WithMaxTxRate(97-101)WithLinkState(104-108)VerifyLinkStateConfiguration(585-637)UpdateSriovPolicyMTU(295-346)InitDpdkVF(196-198)CreateDpdkTestPod(373-392)DeleteDpdkTestPod(395-408)WaitForPodWithLabelReady(411-448)GetPciAddress(705-740)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (3)
tests/cnf/ran/ptp/internal/iface/iface.go (1)
Name(22-22)tests/ocp/internal/ocpconfig/config.go (1)
OcpConfig(20-22)tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)
SriovOcpConfig(13-13)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (3)
tests/internal/sriovoperator/sriovenv.go (4)
IsSriovDeployed(32-54)WaitForSriovAndMCPStable(80-100)RemoveAllSriovNetworks(156-176)RemoveAllPoliciesAndWaitForSriovAndMCPStable(180-195)tests/ocp/sriov/internal/tsparams/consts.go (14)
MCPStableInterval(38-38)NADTimeout(40-40)PollingInterval(28-28)NamespaceTimeout(30-30)PolicyApplicationTimeout(42-42)PodReadyTimeout(32-32)TestPodClientIP(51-51)TestPodClientMAC(47-47)CleanupTimeout(34-34)TestPodServerIP(53-53)TestPodServerMAC(49-49)CarrierWaitTimeout(36-36)TestResourceLabelKey(75-75)TestResourceLabelValue(77-77)tests/internal/cluster/cluster.go (1)
ExecCmdWithStdout(98-180)
- Add SR-IOV environment helper functions using eco-goinfra - Use existing sriovoperator functions for stability checks and cleanup - Add network creation with NetworkOption pattern - Add policy creation and VF initialization helpers - Add pod verification and traffic testing utilities
- Add 9 basic SR-IOV test cases with test IDs as labels - Test spoof checking, trust, VLAN, link state, MTU, min/max tx rate - Add DPDK test with BCM NIC skip (OCPBUGS-30909) - Use constants from tsparams for test values - Add proper cleanup in AfterAll and AfterEach
- Add comprehensive README with test descriptions and configuration - Document environment variables and test labels - Fix NADTimeout to 10 seconds (was 2 minutes)
8995f5e to
e501a4c
Compare
e501a4c to
16dc562
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/ocp/sriov/tests/basic.go (1)
143-179: InitVF error handling still prevents graceful skips for "device not found" scenarios.The past review comment remains applicable. When
InitVFreturns an error (line 152), the test fails immediately viaExpect(err).ToNot(HaveOccurred()). However, the underlyinginitVFWithDevTypereturns(false, error)when no device is found on any node, not(false, nil).This means devices not present on any worker will cause test failures rather than skips, contradicting the intended skip behavior (lines 176-178).
Consider either:
- Modifying
initVFWithDevTypeto return(false, nil)for "no device found" cases- Using a sentinel error and checking with
errors.Isbefore theExpecttests/ocp/sriov/internal/sriovenv/sriovenv.go (1)
200-268: initVFWithDevType returns error for "no device found", preventing graceful skips.As noted in the past review comment, this function returns
(false, fmt.Errorf(...))on line 267 when no node has the device. This causes callers usingExpect(err).ToNot(HaveOccurred())to fail instead of skip.Consider returning
(false, nil)when no suitable node/device is found, reserving errors for actual failures (API errors, timeouts, etc.). This would allow theif !result { continue }andif !executed { Skip(...) }patterns in tests to work as intended.
🧹 Nitpick comments (2)
tests/ocp/sriov/tests/basic.go (1)
47-72: Test namespace should include the test resource label for reliable cleanup.The
setupTestNamespacefunction appliesparams.PrivilegedNSLabelsbut doesn't includetsparams.TestResourceLabelKey. TheCleanupLeftoverResourcesfunction insriovenv.gofirst looks for namespaces with this label before falling back to thee2e-prefix match.Adding the test label ensures consistent cleanup behavior and aligns with the documented cleanup strategy.
🔎 Proposed fix
func setupTestNamespace(testID string, data sriovconfig.DeviceConfig) string { testNamespace := "e2e-" + testID + data.Name By(fmt.Sprintf("Creating test namespace %q", testNamespace)) nsBuilder := namespace.NewBuilder(APIClient, testNamespace) for key, value := range params.PrivilegedNSLabels { nsBuilder.WithLabel(key, value) } + nsBuilder.WithLabel(tsparams.TestResourceLabelKey, tsparams.TestResourceLabelValue) _, err := nsBuilder.Create()tests/ocp/sriov/internal/tsparams/consts.go (1)
25-28: Clarify the distinction between RetryInterval and PollingInterval.Both
RetryInterval(3s) andPollingInterval(2s) serve similar purposes but have slightly different values. Consider adding comments to clarify when to use each, or consolidate to a single polling interval constant for consistency.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tests/ocp/sriov/README.mdtests/ocp/sriov/internal/sriovenv/sriovenv.gotests/ocp/sriov/internal/tsparams/consts.gotests/ocp/sriov/tests/basic.go
🧰 Additional context used
🧬 Code graph analysis (1)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
tests/internal/sriovoperator/sriovenv.go (4)
IsSriovDeployed(32-54)WaitForSriovAndMCPStable(80-100)RemoveAllSriovNetworks(156-176)RemoveAllPoliciesAndWaitForSriovAndMCPStable(180-195)tests/internal/cluster/cluster.go (1)
ExecCmdWithStdout(98-180)
🔇 Additional comments (20)
tests/ocp/sriov/README.md (2)
34-49: LGTM! Directory structure is well-documented.The updated directory structure accurately reflects the new internal packages (
ocpsriovconfig,ocpsriovinittools,sriovenv,sriovocpenv) and test files. The organization follows the established pattern for the eco-gotests framework.
468-473: Environment variable documentation looks correct.The environment variables are properly documented with format examples. Note that
ECO_OCP_SRIOV_TEST_CONTAINERaligns with the previous review comment suggestion (renamed fromECO_OCP_SRIOV_TEST_IMAGE).tests/ocp/sriov/tests/basic.go (6)
25-36: LGTM! Helper function for NO-CARRIER detection is clean.The
isNoCarrierErrorfunction correctly handles nil errors and checks for both common carrier error messages.
38-43: LGTM! BCM device detection is straightforward.The function correctly identifies Broadcom NICs by both name pattern and vendor ID.
74-88: LGTM! Network setup helper is well-structured.The function correctly creates the SR-IOV network and registers cleanup via
DeferCleanup.
102-121: LGTM! BeforeAll setup is comprehensive.The setup properly validates the SR-IOV operator status, loads configuration, and discovers worker nodes with appropriate assertions.
123-141: AfterAll cleanup continues despite individual errors - good pattern.The cleanup correctly collects errors rather than failing on the first one, ensuring all policies get cleanup attempts. The final stability wait ensures the cluster is ready for subsequent test runs.
454-532: DPDK test case is well-structured with proper cleanup and verification.The test correctly:
- Skips BCM devices with appropriate documentation reference
- Creates NAD and waits for readiness
- Verifies PCI address assignment
- Validates network-status annotation
The
DeferCleanupfor the DPDK pod ensures cleanup even on test failure.tests/ocp/sriov/internal/tsparams/consts.go (3)
14-15: LGTM! New basic test label added.The
LabelBasicconstant provides a clear label for filtering basic SR-IOV test cases.
21-44: LGTM! Timeout constants are well-organized and documented.The timeouts are appropriately sized for their operations:
WaitTimeout(20 min) for long operationsDefaultTimeout(300s) for general operationsPodReadyTimeout(300s) for pod startupNADTimeout(10s) aligns with the previous reviewer's suggestion of 5-10 seconds
46-77: LGTM! Test constants are well-documented.MAC addresses, IP addresses, MTU, VLAN, TX rate limits, and vendor IDs are all properly documented and follow consistent naming conventions.
tests/ocp/sriov/internal/sriovenv/sriovenv.go (9)
37-52: LGTM! Operator and policy readiness checks are clean wrappers.The functions properly delegate to the existing
sriovoperatorpackage functions with appropriate configuration.
58-108: LGTM! NetworkOption pattern provides clean, composable network configuration.The functional options pattern allows flexible network configuration without bloating function signatures.
110-126: LGTM! Network creation with NAD wait is well-implemented.The function creates the network with all options applied and waits for the NAD to be created before returning.
138-152: RemoveSriovNetwork correctly handles already-deleted case.The function properly checks for
IsNotFoundbefore attempting deletion and uses the builder'sDeleteAndWaitmethod.
292-333: MTU update now preserves all spec fields via DeepCopy - previous concern addressed.The implementation correctly stores a deep copy of the entire spec before deletion and applies it to the new policy. This preserves all fields including
IsRdma,LinkType,ESwitchMode,ExternallyManaged,VdpaType, etc.
381-396: DeleteDpdkTestPod handles eco-goinfra error messages correctly.The function checks both
apierrors.IsNotFoundand the eco-goinfra-specific "does not exist" message, ensuring graceful handling when the pod is already deleted.
491-569: CheckVFStatusWithPassTraffic creates thorough connectivity verification.The function:
- Generates unique pod names to avoid collisions
- Creates client/server pods with cleanup via defer
- Verifies interface readiness and carrier status
- Conditionally checks spoof configuration
- Tests actual ping connectivity
734-788: CleanupLeftoverResources provides comprehensive cleanup with good error handling.The function:
- Uses label-based selection first (preferred)
- Falls back to name prefix matching (backwards compatibility)
- Logs warnings instead of failing on individual errors
- Uses existing
sriovoperatorfunctions for network/policy cleanupThis approach is robust for test cleanup scenarios.
758-772: The field selector suggestion is not applicable—Kubernetes field selectors only support exact matching (=, ==, !=) and cannot filter by name prefix.The current implementation (list all namespaces and filter client-side using
strings.HasPrefix) is the correct and only viable approach for this use case. Since the code is test cleanup and already falls back gracefully when listing fails, this implementation is acceptable as-is.Likely an incorrect or invalid review comment.
Add detailed documentation for SR-IOV test suite:
variables, device configuration, and execution examples
Documentation covers:
Summary by CodeRabbit
New Features
Tests
Documentation
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.