ocp-sriov: add comprehensive test documentation #987

zhiqiangf · 2025-12-05T02:53:05Z

Add detailed documentation for SR-IOV test suite:

tests/README.md: Complete test case documentation with environment
variables, device configuration, and execution examples
Updated sriov/README.md reference

Documentation covers:

Test prerequisites and cluster requirements
Environment variable configuration (SRIOV_DEVICES, etc.)
Supported device configurations
Test execution examples and filtering"

Summary by CodeRabbit

New Features
- SR‑IOV config now accepts device lists with per-device options and VF count; new network options for spoof, trust, VLAN, QoS, min/max TX rate and link state.
Tests
- Added a comprehensive SR‑IOV basic test suite covering networks, policies, DPDK, link state, MTU and per-device scenarios.
Documentation
- Environment vars updated: ECO_OCP_SRIOV_TEST_IMAGE renamed to ECO_OCP_SRIOV_TEST_CONTAINER; expanded env var docs for devices, VF count, operator namespace and interface list.
Refactor
- Reorganized SR‑IOV test internals and removed a deprecated default timeout constant.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-05T02:53:10Z

📝 Walkthrough

Walkthrough

Adds a CNF-style SR-IOV test framework for OCP: new internal packages (ocpsriovconfig, ocpsriovinittools, sriovocpenv, sriovenv, tsparams changes), device/VF parsing and defaults, many test constants, and a new Ginkgo "basic" SR-IOV test suite. README env var names/formats updated.

Changes

Cohort / File(s)	Summary
Documentation `tests/ocp/sriov/README.md`	Updated environment variable names and examples (ECO_OCP_SRIOV_TEST_IMAGE → ECO_OCP_SRIOV_TEST_CONTAINER) and documented new optional env vars (ECO_OCP_SRIOV_DEVICES, ECO_OCP_SRIOV_VF_NUM, ECO_OCP_SRIOV_OPERATOR_NAMESPACE, ECO_OCP_SRIOV_INTERFACE_LIST).
OCP SR-IOV config `tests/ocp/sriov/internal/ocpsriovconfig/config.go`, `tests/ocp/sriov/internal/ocpsriovconfig/default.yaml`	Added `DeviceConfig` type and fields `Devices`, `DevicesEnv`, `VFNum`; parse ECO_OCP_SRIOV_DEVICES; added `GetSriovDevices`, `parseSriovDevicesEnv`, `GetVFNum`; adjusted default path and updated default YAML (vf_num, devices).
Init tools & env helpers `tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go`, `tests/ocp/sriov/internal/sriovocpenv/*`, `tests/ocp/sriov/internal/sriovenv/sriovenv.go`	Added init tooling and a large sriovenv package exposing NetworkOption builder, network/NAD/policy lifecycle functions, InitVF/InitDpdkVF, pod creation/deletion, interface/PCI utilities, verification and cleanup helpers.
Test constants & vars `tests/ocp/sriov/internal/tsparams/consts.go`, `tests/ocp/sriov/internal/tsparams/ocpsriovvars.go`	Introduced many new test constants (timeouts, VLAN/QoS/min/max TX, MTU, MAC/IP defaults, vendor IDs, resource label keys); removed `DefaultTimeout` and adjusted Labels documentation.
Test suite `tests/ocp/sriov/tests/basic.go`	New ordered Ginkgo test suite implementing SR-IOV basic scenarios: spoof/trust permutations, VLAN/QoS/min-max TX, MTU updates, DPDK VF flows, link-state checks, per-device iteration with conditional skipping and cleanup orchestration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

ocp-sriov: added SR-IOV reinstallation tests #972 — Overlaps with ocpsriovconfig changes and added parsing/fields.
ocp-sriov: Add SR-IOV test suite and initial configuration #947 — Prior SR-IOV test-suite additions; likely shares related test helpers and flows.
cnf network: add sriovNetwork on AppNS tests #845 — Related modifications to sriovenv and NAD-related helpers.

Suggested reviewers

evgenLevin
ajaggapa
gkopels

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'ocp-sriov: add comprehensive test documentation' accurately summarizes the main change: adding comprehensive documentation for the SR-IOV test suite, including README updates and environment variable documentation.
Docstring Coverage	✅ Passed	Docstring coverage is 95.35% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Command failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (4)

tests/ocp/sriov/internal/tsparams/consts.go (1)
1-70: Well-structured test constants file.

The constants are well-documented with appropriate timeout values. The labeling mechanism with a base Labels slice that individual tests can augment is a clean pattern for test filtering.

A minor note: the MAC addresses 20:04:0f:f1:88:01 and 20:04:0f:f1:88:02 have the locally administered bit set (bit 1 of the first octet, 0x02), which is correct for test/virtual interfaces. However, the current values (0x20) have the multicast bit set instead. For strictly correct locally administered unicast MACs, consider using addresses like 02:04:0f:f1:88:01.
-	// TestPodClientMAC is the default MAC address for test client pods.
-	TestPodClientMAC = "20:04:0f:f1:88:01"
-	// TestPodServerMAC is the default MAC address for test server pods.
-	TestPodServerMAC = "20:04:0f:f1:88:02"
+	// TestPodClientMAC is the default MAC address for test client pods.
+	TestPodClientMAC = "02:04:0f:f1:88:01"
+	// TestPodServerMAC is the default MAC address for test server pods.
+	TestPodServerMAC = "02:04:0f:f1:88:02"
tests/ocp/sriov/sriov_suite_test.go (1)
36-80: BeforeSuite handles namespace cleanup well.

The logic to detect pre-existing namespaces from previous partial failures and recreate them is robust. However, the namespace label setup is duplicated between lines 49-51 and 65-68. Consider extracting to a helper function.
+func createTestNamespace() *namespace.Builder {
+	ns := namespace.NewBuilder(APIClient, tsparams.TestNamespaceName)
+	for key, value := range params.PrivilegedNSLabels {
+		ns.WithLabel(key, value)
+	}
+	return ns
+}
+
 var _ = BeforeSuite(func() {
 	// ... validation and cleanup ...
 
 	By("Creating test namespace with privileged labels")
-	testNS = namespace.NewBuilder(APIClient, tsparams.TestNamespaceName)
-	for key, value := range params.PrivilegedNSLabels {
-		testNS.WithLabel(key, value)
-	}
+	testNS = createTestNamespace()
 
 	_, err = testNS.Create()
 	if err != nil {
 		if apierrors.IsAlreadyExists(err) {
 			// ... delete existing ...
-			testNS = namespace.NewBuilder(APIClient, tsparams.TestNamespaceName)
-			for key, value := range params.PrivilegedNSLabels {
-				testNS.WithLabel(key, value)
-			}
+			testNS = createTestNamespace()
 			// ...
tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
716-760: Consider returning accumulated errors for observability.

The cleanup function ignores all errors and always returns nil. While best-effort cleanup is appropriate, consider logging errors at a higher verbosity level or returning an aggregate error for observability.
 func CleanupLeftoverResources() error {
+	var errs []error
 	sriovOpNs := SriovOcpConfig.OcpSriovOperatorNamespace
 
 	klog.V(90).Info("Cleaning up leftover test resources")
 
 	// Cleanup test namespaces
-	namespaces, _ := namespace.List(APIClient, metav1.ListOptions{})
+	namespaces, err := namespace.List(APIClient, metav1.ListOptions{})
+	if err != nil {
+		klog.V(90).Infof("Warning: failed to list namespaces: %v", err)
+	}
 	// ... rest of cleanup ...
 
 	klog.V(90).Info("Cleanup completed")
 
 	return nil
 }
732-740: Document the naming convention matched by cleanup regex.

The regex ^\d{5}-|\w+dpdknet$ matches specific naming patterns. Consider adding a comment explaining what test resource names follow these patterns to help future maintainers understand which resources are targeted for cleanup.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b1c772b and 941b840fe473a30924a9bdb82b503809368aac2c.

📒 Files selected for processing (10)

tests/ocp/internal/ocpconfig/config.go (2 hunks)
tests/ocp/sriov/README.md (1 hunks)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (4 hunks)
tests/ocp/sriov/internal/ocpsriovconfig/default.yaml (0 hunks)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1 hunks)
tests/ocp/sriov/internal/tsparams/consts.go (1 hunks)
tests/ocp/sriov/internal/tsparams/sriovvars.go (1 hunks)
tests/ocp/sriov/sriov_suite_test.go (1 hunks)
tests/ocp/sriov/tests/README.md (1 hunks)
tests/ocp/sriov/tests/basic.go (1 hunks)

💤 Files with no reviewable changes (1)

tests/ocp/sriov/internal/ocpsriovconfig/default.yaml

🧰 Additional context used

🧬 Code graph analysis (2)

tests/ocp/sriov/sriov_suite_test.go (5)

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

SriovOcpConfig (21-26)

tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (2)

SriovOcpConfig (13-13)

APIClient (11-11)

tests/ocp/sriov/internal/tsparams/consts.go (1)

Labels (69-69)

tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)

Labels (14-14)

tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)

CleanupLeftoverResources (717-760)

CheckSriovOperatorStatus (40-42)

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)

SriovOcpConfig (13-13)

🪛 LanguageTool

tests/ocp/sriov/tests/README.md

[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (22)

tests/ocp/internal/ocpconfig/config.go (1)

38-40: Pointer preservation around YAML decode looks reasonable; confirm null‑intent semantics

Capturing the existing GeneralConfig pointer and restoring it only when YAML decoding leaves it nil is a pragmatic way to avoid losing the initialized config during Decode, and should help prevent nil dereferences downstream.

One subtle behavioral change to be aware of: if the YAML were ever to explicitly set this embedded pointer to null, this restoration logic would override that and keep a non‑nil GeneralConfig. If "config cannot be explicitly disabled/null" is the intended contract, this is fine; otherwise you might want a more specific condition for when to restore (e.g., only for an actually empty/default file or based on file size/contents).

Also applies to: 52-55

tests/ocp/sriov/README.md (3)

1-10: Comprehensive and well-structured documentation.

The README provides excellent coverage of test development guidelines, including directory structure, naming conventions, Ginkgo patterns, and best practices. This will be valuable for maintaining consistency across the SR-IOV test suite.

404-406: Good choice using klog over glog.

Using klog is the correct choice as it's the standard logging library in the Kubernetes ecosystem and provides better integration with Kubernetes tooling.

224-257: Excellent guidance on helper function design.

The clear distinction between where Gomega/Ginkgo can be used (test files only) versus helper functions (return errors instead) is a critical best practice that prevents common anti-patterns in test code organization.

tests/ocp/sriov/tests/README.md (3)

1-15: Well-documented test suite README.

The documentation provides clear information about the test file, labels, namespace, and container configuration. This will help users understand the test suite quickly.

86-116: Comprehensive environment variable documentation.

Good coverage of both mandatory and optional environment variables, including the fallback behavior and default device configurations. The format specification for SRIOV_DEVICES is particularly helpful.

475-520: Clear test execution flow documentation.

The documentation of BeforeAll, test execution patterns, AfterAll hooks, and resource management behavior helps users understand the test lifecycle. The note about automatic cleanup of leftover resources is valuable.

Note: The static analysis hint about "AfterAll" spelling is a false positive—it's a valid Ginkgo hook name.

tests/ocp/sriov/internal/ocpsriovconfig/config.go (3)

28-41: Good decoupling of YAML/env config from embedded structs.

The introduction of sriovYAMLConfig and sriovEnvConfig helper types prevents the YAML decoder from zeroing out embedded OcpConfig/GeneralConfig pointers during decoding. This is a solid approach to preserve the parent config state.

57-62: Defensive check for runtime.Caller failure.

Good addition to handle the edge case where runtime.Caller fails. This prevents potential panics from using invalid path information.

122-134: Correct conditional application of environment overrides.

The pattern of checking for non-empty values before applying ensures that YAML defaults aren't overwritten by empty environment variables. This is the expected behavior for configuration layering.

tests/ocp/sriov/tests/basic.go (6)

22-32: Utility function for NO-CARRIER detection.

Clean implementation checking for NO-CARRIER conditions. This enables graceful skipping of devices without physical link.

34-60: Well-structured namespace setup with automatic cleanup.

The setupTestNamespace helper correctly creates namespaces with privileged labels and registers cleanup via DeferCleanup. The Eventually check ensures namespace existence before proceeding.

77-99: Proper test suite structure with BeforeAll validation.

The test suite correctly uses Ordered, Label, and ContinueOnFailure as documented in the README. The BeforeAll hook validates operator status and discovers worker nodes before any tests run.

101-119: Robust policy cleanup in AfterAll.

Good approach collecting cleanup errors without failing early, then logging warnings for any failures. The final stability wait ensures the cluster is ready after cleanup.

121-157: Consistent test case pattern.

The test case structure (device iteration, VF initialization, namespace/network setup, verification, skip on NO-CARRIER) is well-designed and consistently applied across all test cases.

458-465: Consider adding cleanup for NAD polling failure case.

The Eventually block for NAD readiness doesn't have cleanup registered before it. If NAD polling fails, the namespace cleanup will still run, but the network created at line 456 may leave orphaned resources.

This is a minor edge case since DeferCleanup at line 69-74 (via setupSriovNetwork) should handle it, but verify that RemoveSriovNetwork handles non-existent networks gracefully.

tests/ocp/sriov/internal/tsparams/sriovvars.go (4)

15-22: Clear DeviceConfig struct definition.

The struct captures all necessary device configuration with a useful SupportsMinTxRate flag for conditional test behavior.

41-75: Robust environment variable parsing.

The parser correctly handles edge cases: empty entries are skipped, malformed entries log warnings but don't fail parsing entirely, and all values are trimmed. The format documentation in the comment is helpful.

77-92: Fail-fast behavior for invalid configuration.

The panic when SRIOV_DEVICES is set but no valid entries are parsed is appropriate—it prevents tests from silently running with unintended default configurations when the user explicitly set the variable.

94-113: Proper fallback chain for VF count.

The function correctly prioritizes ECO_OCP_SRIOV_VF_NUM over SRIOV_VF_NUM, validates for positive integers, logs warnings for invalid values, and falls back to a sensible default of 2.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)

215-266: Verify single-node VF initialization is intentional.

The function iterates through worker nodes but returns immediately after successfully creating a policy on the first node. This means VFs are only initialized on one node rather than all workers. If tests require VFs on multiple nodes, this logic would need adjustment.

tests/ocp/sriov/sriov_suite_test.go (1)

90-96: Verify ReporterNamespacesToDump and ReporterCRDsToDump are defined in tsparams package.

The tsparams.ReporterNamespacesToDump and tsparams.ReporterCRDsToDump variables are referenced in the test but were not found in the provided consts file. Confirm these are defined in the tsparams package or add the definitions if missing.

tests/ocp/sriov/internal/sriovenv/sriovenv.go

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

tests/ocp/sriov/tests/README.md (1)

11-16: Clarify single “Test Namespace” vs per-test e2e namespaces

The “Test Suite Information” section declares a single test namespace sriov-tests, while later sections state that each test creates its own namespace with a test case ID prefix. In tests/ocp/sriov/tests/basic.go, pods are created in dynamically named namespaces like e2e-<testID><deviceName>, not sriov-tests.

Consider clarifying that:

sriov-tests is the suite-level namespace (where applicable, e.g., reporter config), and

each basic test also creates its own e2e-* namespace per device.

That will help operators know where to look for resources and logs.

Also applies to: 515-521

tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)

11-18: Including LabelBasic in Labels aligns with new filtering docs

Extending Labels to {LabelSuite, LabelBasic} matches the usage in basic.go and the README examples (ocpsriov && basic). This keeps suite-level label-based filtering coherent across code and docs.

You may also want to double-check other suites under tests/ocp/sriov/ to ensure they still behave as expected when ECO_TEST_LABELS is set to this expanded label list.

tests/ocp/sriov/internal/tsparams/consts.go (1)

6-57: Good centralization of timeouts and default test data

Collecting label, timeout/interval, MAC/IP, and MTU constants here makes the SR-IOV tests much easier to tune and reason about. The values are consistent with usages in sriovenv.go and basic.go.

Just make sure any future changes to DefaultTestMTU, PodReadyTimeout, etc., are mirrored in the test documentation so operators don’t get stale numbers.

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

26-33: Device defaults and SRIOV_DEVICES parsing look good, but doc should mention the 5th flag

The DeviceConfig / GetDefaultDevices setup and parseDevicesFromEnv logic are reasonable:

Code defaults include SupportsMinTxRate per device, used by the VLAN/rate-limiting test to skip NICs that don’t support minTxRate.

SRIOV_DEVICES supports both name:deviceid:vendor:interface and a 5th optional supportsMinTxRate flag (true|1|yes → true, anything else → false).

Two follow-ups to keep things coherent:

The SR-IOV basic README currently documents only the 4-field form and lists a different default device set than GetDefaultDevices() provides. It should be updated to:

Reflect this exact GetDefaultDevices() list, and

Mention the optional 5th supportsMinTxRate field so users can correctly mark devices that don’t support min TX rate.

If default.yaml also carries a devices list, consider clearly documenting (or enforcing) which source “wins” (YAML vs SRIOV_DEVICES vs built-in defaults) to avoid surprises when operators override configuration.

Also applies to: 76-88, 255-306

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1dd58b3e8bb9cb90a5e797befbb72c018af18d5e and 9999d379c5eaa2504825f4765687ec2c9927f271.

📒 Files selected for processing (8)

tests/ocp/sriov/README.md (1 hunks)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (4 hunks)
tests/ocp/sriov/internal/ocpsriovconfig/default.yaml (1 hunks)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1 hunks)
tests/ocp/sriov/internal/tsparams/consts.go (2 hunks)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (2 hunks)
tests/ocp/sriov/tests/README.md (1 hunks)
tests/ocp/sriov/tests/basic.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/ocp/sriov/internal/ocpsriovconfig/default.yaml

🧰 Additional context used

🧬 Code graph analysis (2)

tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)

tests/ocp/sriov/internal/tsparams/consts.go (2)

LabelSuite (10-10)

LabelBasic (15-15)

tests/ocp/sriov/tests/basic.go (5)

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

DeviceConfig (27-33)

tests/cnf/ran/ptp/internal/iface/iface.go (1)

Name (22-22)

tests/system-tests/internal/link/link.go (1)

NewBuilder (48-77)

tests/ocp/sriov/internal/tsparams/consts.go (7)

NamespaceTimeout (30-30)

CleanupTimeout (34-34)

LabelBasic (15-15)

PodReadyTimeout (32-32)

DefaultTestMTU (56-56)

NADTimeout (40-40)

PollingInterval (28-28)

tests/ocp/sriov/internal/sriovenv/sriovenv.go (17)

NetworkOption (88-88)

CreateSriovNetwork (140-155)

RemoveSriovNetwork (168-177)

CheckSriovOperatorStatus (40-42)

RemoveSriovPolicy (184-203)

WaitForSriovPolicyReady (45-47)

InitVF (209-211)

WithSpoof (91-95)

WithTrust (98-102)

CheckVFStatusWithPassTraffic (510-585)

WithVLAN (105-109)

WithVlanQoS (112-116)

WithMinTxRate (119-123)

WithMaxTxRate (126-130)

WithLinkState (133-137)

VerifyLinkStateConfiguration (589-638)

GetPciAddress (729-765)

🪛 LanguageTool

tests/ocp/sriov/tests/README.md

[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (1)

tests/ocp/sriov/README.md (1)

401-412: Logging wording now matches actual klog usage

Renaming this bullet to “klog for verbose logging” aligns the text with the klog.V(level).Infof() example below; no further changes needed here.

coderabbitai · 2025-12-11T15:56:52Z

tests/ocp/sriov/internal/sriovenv/sriovenv.go

+// initVFWithDevType creates an SR-IOV policy for the specified device type.
+// It iterates through worker nodes and creates a policy on the first node where
+// the device is successfully discovered. Returns (true, nil) on success.
+func initVFWithDevType(name, deviceID, interfaceName, vendor, devType string, vfNum int,
+	workerNodes []*nodes.Builder) (bool, error) {
+	sriovOpNs := SriovOcpConfig.OcpSriovOperatorNamespace
+
+	// Cleanup existing policy
+	_ = RemoveSriovPolicy(name, tsparams.NamespaceTimeout)
+
+	for _, node := range workerNodes {
+		nodeName := node.Definition.Name
+
+		// Try to discover the actual interface name
+		actualInterface := interfaceName
+
+		// Only discover interface if not explicitly provided
+		// This allows users to specify exact interface for multi-port NICs
+		if interfaceName == "" && vendor != "" && deviceID != "" {
+			if discovered, err := discoverInterfaceName(nodeName, vendor, deviceID); err == nil {
+				actualInterface = discovered
+			}
+		}
+
+		pfSelector := fmt.Sprintf("%s#0-%d", actualInterface, vfNum-1)
+
+		policy := sriov.NewPolicyBuilder(APIClient, name, sriovOpNs, name, vfNum,
+			[]string{pfSelector}, map[string]string{"kubernetes.io/hostname": nodeName}).
+			WithDevType(devType)
+
+		if vendor != "" {
+			policy.Definition.Spec.NicSelector.Vendor = vendor
+		}
+
+		if deviceID != "" {
+			policy.Definition.Spec.NicSelector.DeviceID = deviceID
+		}
+
+		if _, err := policy.Create(); err != nil {
+			klog.V(90).Infof("Failed to create policy on node %q: %v", nodeName, err)
+
+			continue
+		}
+
+		if err := WaitForSriovPolicyReady(tsparams.PolicyApplicationTimeout); err != nil {
+			klog.V(90).Infof("Policy not ready on node %q: %v", nodeName, err)
+
+			_ = policy.Delete()
+
+			continue
+		}
+
+		klog.V(90).Infof("Successfully created policy %q on node %q", name, nodeName)
+
+		return true, nil
+	}
+
+	return false, fmt.Errorf("failed to create policy %q on any node", name)
+}


⚠️ Potential issue | 🟠 Major

initVFWithDevType never returns a non-error “not found” result, making graceful skips impossible

initVFWithDevType currently:

Cleans up any existing policy,

Iterates worker nodes trying to create a policy,

On first success, waits for readiness and returns (true, nil),

If all attempts fail, returns false with an error:

return false, fmt.Errorf("failed to create policy %q on any node", name)

Callers like InitVF / InitDpdkVF in turn are used by tests/basic.go, which expects to:

Treat err != nil as a hard failure via Expect(err).ToNot(HaveOccurred()), and

Use the boolean result to decide whether to skip devices and ultimately Skip("No SR-IOV devices matched...").

Because (false, nil) is never returned, “no matching device on any worker” can only manifest as an error, and tests can’t distinguish “device not present” from genuine failures. That contradicts the README’s intention to skip devices not found on any worker node.

Consider adjusting this API so callers can differentiate:

Return (false, nil) (or a sentinel error type) when no suitable NIC is found on any node, and reserve non-nil errors for real misconfigurations (e.g., API failures), or

Add a clear contract (docstring + error type) so tests can detect and Skip “not found” while still failing on other errors.

This will allow the basic tests and documentation to implement the promised “device skipping” behavior without conflating it with infrastructure failures.

🤖 Prompt for AI Agents

tests/ocp/sriov/internal/sriovenv/sriovenv.go lines 221-279: The function currently treats "no matching device on any worker" as an error and always returns (false, error), which prevents callers from distinguishing "not found" from real failures; change the API so the function returns (false, nil) when no suitable NIC is discovered on any worker (or alternatively return a well-documented sentinel error type like ErrDeviceNotFound), and only return non-nil errors for real failures (API/creation/timeout). Update the function to detect the "no-match" case and return (false, nil) (or ErrDeviceNotFound) instead of fmt.Errorf(...), and update callers (InitVF/InitDpdkVF and tests/basic.go) to check the boolean or sentinel error and Skip when appropriate while still failing on real errors.

coderabbitai · 2025-12-11T15:56:52Z

tests/ocp/sriov/tests/basic.go

+		var (
+			vfNum       = SriovOcpConfig.GetVFNum()
+			workerNodes []*nodes.Builder
+			testData    = SriovOcpConfig.GetDevices()
+		)
+
+		BeforeAll(func() {
+			By("Checking the SR-IOV operator is running")
+			err := sriovenv.CheckSriovOperatorStatus()
+			Expect(err).ToNot(HaveOccurred(), "SR-IOV operator is not running")
+
+			By("Discovering worker nodes")
+			workerNodes, err = nodes.List(APIClient,
+				metav1.ListOptions{LabelSelector: SriovOcpConfig.WorkerLabel})
+			Expect(err).ToNot(HaveOccurred(), "Failed to discover nodes")
+			Expect(len(workerNodes)).To(BeNumerically(">", 0), "No worker nodes found")
+		})
+
+		AfterAll(func() {
+			By("Cleaning up SR-IOV policies after all tests")
+
+			var cleanupErrors []string
+			for _, item := range testData {
+				err := sriovenv.RemoveSriovPolicy(item.Name, tsparams.DefaultTimeout)
+				if err != nil {
+					cleanupErrors = append(cleanupErrors, fmt.Sprintf("policy %q: %v", item.Name, err))
+				}
+			}
+
+			if len(cleanupErrors) > 0 {
+				klog.Warningf("Some policies failed to clean up: %v", cleanupErrors)
+			}
+
+			By("Waiting for post-cleanup cluster stability")
+			err := sriovenv.WaitForSriovPolicyReady(tsparams.DefaultTimeout)
+			Expect(err).ToNot(HaveOccurred(), "Cluster did not stabilize after cleanup")
+		})
+
+		It("SR-IOV VF with spoof checking enabled", reportxml.ID("25959"), func() {
+			caseID := "25959"
+			executed := false
+
+			for _, data := range testData {
+				By(fmt.Sprintf("Testing device: %s", data.Name))
+
+				result, err := sriovenv.InitVF(data.Name, data.DeviceID, data.InterfaceName,
+					data.Vendor, vfNum, workerNodes)
+				Expect(err).ToNot(HaveOccurred(), "Failed to initialize VF for device %q", data.Name)
+
+				if !result {
+					By(fmt.Sprintf("Skipping device %q - VF init failed", data.Name))
+
+					continue
+				}
+
+				executed = true
+				testNamespace := setupTestNamespace(caseID+"-", data)
+				networkName := caseID + "-" + data.Name
+				setupSriovNetwork(networkName, data.Name, testNamespace,
+					sriovenv.WithSpoof(true), sriovenv.WithTrust(false))
+
+				By("Verifying VF status with pass traffic")
+				err = sriovenv.CheckVFStatusWithPassTraffic(networkName, data.InterfaceName,
+					testNamespace, "spoof checking on", tsparams.PodReadyTimeout)
+				if isNoCarrierError(err) {
+					Skip("Interface has NO-CARRIER status")
+				}
+
+				Expect(err).ToNot(HaveOccurred(), "Test verification failed")
+			}
+
+			if !executed {
+				Skip("No SR-IOV devices matched the requested configuration")
+			}
+		})


⚠️ Potential issue | 🟠 Major

InitVF error handling prevents “device not found” from being treated as a skip

Across the tests (starting here with Test ID 25959), the pattern is:

result, err := sriovenv.InitVF(...) Expect(err).ToNot(HaveOccurred(), ...) if !result { // skip this device continue } executed = true ...

However, initVFWithDevType in sriovenv.go currently only returns:

(true, nil) when a policy is successfully created, or

(false, error) when it fails on all nodes.

It never returns (false, nil), so a “device not present on any worker” scenario produces an error and fails the test before the if !result or final if !executed { Skip(...) } paths can run. That contradicts the README, which says devices not found on any worker node are automatically skipped.

Consider one of these options:

Teach initVFWithDevType (and thus InitVF / InitDpdkVF) to distinguish “no suitable node/device found” from real failures, e.g. returning (false, nil) or a well-known sentinel error for the former; then the existing result/executed logic will work as intended, or

Keep InitVF returning an error, but in the tests explicitly detect a “no device found” error (e.g., via errors.Is or substring) and Skip in that case instead of unconditionally Expect(err).ToNot(HaveOccurred()).

Without such a change, running these tests on clusters lacking one or more configured devices will yield hard failures instead of clean skips.

tests/ocp/sriov/tests/README.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)
221-279: initVFWithDevType still conflates "device not found" with real errors.

The previous review correctly identified that this function returns (false, error) for all failure cases, making it impossible for callers to distinguish between "no matching device on any node" (which should trigger a skip) and actual infrastructure failures (which should fail the test).

The function currently returns:
return false, fmt.Errorf("failed to create policy %q on any node", name)
Consider returning (false, nil) when no device matches, reserving errors for genuine failures:
+var errNoMatchingDevice = fmt.Errorf("no matching device found on any worker node")

 func initVFWithDevType(...) (bool, error) {
     // ... existing logic ...
     
+    // Track if any attempt was made (vs no nodes at all)
+    var attemptedNodes int
     for _, node := range workerNodes {
+        attemptedNodes++
         // ... policy creation attempts ...
     }
     
-    return false, fmt.Errorf("failed to create policy %q on any node", name)
+    if attemptedNodes == 0 {
+        return false, fmt.Errorf("no worker nodes provided")
+    }
+    // No matching device - callers can skip
+    return false, nil
 }
This allows tests to use Skip() for "device not present" scenarios as documented in the README.
tests/ocp/sriov/tests/README.md (1)
420-443: MTU value in documentation still doesn't match code.

The README documents MTU as 1800 (line 431), but tsparams.DefaultTestMTU in consts.go is defined as 9000. This discrepancy was flagged in a previous review and marked as addressed, but the mismatch persists.

Update line 431 to reflect the actual default:
 **Configuration**:
-- MTU: `1800`
+- MTU: `9000` (defined by `DefaultTestMTU`)
And update line 438 accordingly:
-- Policy MTU is updated to 1800
+- Policy MTU is updated to 9000

🧹 Nitpick comments (4)

tests/ocp/sriov/internal/tsparams/consts.go (1)

55-56: Consider documenting MTU range constraint.

The README mentions "MTU value must be in range 1-9192 as per SR-IOV specification." Consider adding a comment here about the valid range, or defining MinTestMTU and MaxTestMTU constants for validation.

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

308-328: Legacy VFNum parsing has a subtle ordering dependency.

The check if sriovOcpConfig.VFNum != DefaultVFNum means SRIOV_VF_NUM is only read if ECO_OCP_SRIOV_VF_NUM wasn't set. This is correct but relies on the call order in NewSriovOcpConfig(). Consider adding a brief comment in NewSriovOcpConfig() noting this intentional precedence.
tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
303-356: Policy recreation now preserves additional fields - good improvement.

The UpdateSriovPolicyMTU function now correctly preserves DeviceType, Vendor, DeviceID, and Priority from the original spec. This addresses the previous review concern.

However, there are still some spec fields that might be lost:

LinkType

IsRdma

NicSelector.RootDevices

NicSelector.NetFilter

ExternallyManaged

VdpaType

For a complete solution, consider copying all non-zero fields from the original spec, or document which fields are preserved.

167-177: Silent error swallowing in RemoveSriovNetwork.

Returning nil when PullNetwork fails assumes the network is already deleted, but the error could indicate other issues (e.g., API server unreachable). Consider logging the error or checking for specific "not found" errors:
 func RemoveSriovNetwork(name string, timeout time.Duration) error {
     network, err := sriov.PullNetwork(APIClient, name, sriovOpNs)
     if err != nil {
-        return nil // Already deleted
+        klog.V(90).Infof("Network %q not found or error pulling: %v", name, err)
+        return nil // Assume already deleted
     }
     return network.DeleteAndWait(timeout)
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9999d379c5eaa2504825f4765687ec2c9927f271 and acd5dacc020bde2aaebbc9622d18f297bde606b0.

📒 Files selected for processing (8)

tests/ocp/sriov/README.md (3 hunks)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (4 hunks)
tests/ocp/sriov/internal/ocpsriovconfig/default.yaml (1 hunks)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1 hunks)
tests/ocp/sriov/internal/tsparams/consts.go (2 hunks)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (2 hunks)
tests/ocp/sriov/tests/README.md (1 hunks)
tests/ocp/sriov/tests/basic.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

tests/ocp/sriov/README.md
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go
tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
tests/ocp/sriov/tests/basic.go

🧰 Additional context used

🧬 Code graph analysis (1)

tests/ocp/sriov/internal/ocpsriovconfig/config.go (3)

tests/cnf/ran/ptp/internal/iface/iface.go (1)

Name (22-22)

tests/ocp/internal/ocpconfig/config.go (1)

OcpConfig (20-22)

tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)

SriovOcpConfig (13-13)

🪛 LanguageTool

tests/ocp/sriov/tests/README.md

[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (18)

tests/ocp/sriov/tests/README.md (3)

1-15: Clear and well-structured documentation.

The overview section effectively introduces the test suite, its purpose, and the key information users need (file location, labels, namespace, container). The structure follows documentation best practices.

16-85: Comprehensive prerequisites section with helpful examples.

The cluster requirements and OpenShift configuration sections are thorough, including verification commands and a practical MachineConfig example for DPDK hugepages. This will help users set up their test environment correctly.

492-497: Minor grammar issue: "AfterAll" spelling.

Static analysis flagged "AfterAll" - this is actually correct as it refers to the Ginkgo AfterAll hook. No change needed.

tests/ocp/sriov/internal/tsparams/consts.go (4)

14-16: LGTM - Well-defined label constant.

The LabelBasic constant follows the existing naming convention and provides a clear label for test filtering.

21-44: Reasonable timeout constants with good documentation.

The timeout values are well-chosen for SR-IOV operations:

PolicyApplicationTimeout at 35 minutes matches MCOWaitTimeout, appropriate for MCP updates

PodReadyTimeout at 5 minutes allows for SR-IOV device attachment

Polling intervals (2-10 seconds) are reasonable for avoiding API server overload

46-53: Good use of locally administered MAC addresses.

The 02: prefix in MAC addresses correctly indicates locally administered addresses, avoiding conflicts with real hardware MAC addresses. The /24 CIDR notation for IPs is appropriate for test networks.

64-67: Rate limiting constants look reasonable.

The TestMinTxRate (40 Mbps) and TestMaxTxRate (100 Mbps) provide a valid range for rate limiting tests. Note that some devices (BCM) don't support minTxRate, which is handled in the test code.

tests/ocp/sriov/internal/ocpsriovconfig/config.go (5)

45-75: Excellent documentation for the temporary struct pattern.

The detailed comments explaining why separate sriovYAMLConfig and sriovEnvConfig structs are needed to avoid yaml.v2's pointer-zeroing behavior is very helpful for future maintainers. This is a known gotcha with yaml.v2 and the workaround is correct.

76-88: Default devices are well-defined with feature flags.

The SupportsMinTxRate field correctly distinguishes devices that support rate limiting from those that don't (BCM and x710 devices). This enables graceful skipping of unsupported features in tests.

104-113: Initialization order is correct.

Setting defaults before reading config files ensures that:

Defaults are available if config files don't specify values

Config files can override defaults

Environment variables can override config file values

The error handling for runtime.Caller(0) failure is appropriate.

141-157: Defensive accessor methods with safe defaults.

GetDevices() and GetVFNum() properly return defaults when the config is empty or invalid. However, GetDevices() returns GetDefaultDevices() when len(c.Devices) == 0, but c.Devices is already set to defaults in NewSriovOcpConfig(). This redundancy is harmless but could be simplified.

259-306: Device parsing handles edge cases well.

The parseDevicesFromEnv function:

Trims whitespace from entries and parts

Handles optional 5th field for SupportsMinTxRate

Logs warnings for invalid entries without failing

Only overwrites defaults if at least one valid device is parsed

One minor improvement: the comment at line 301 could be clearer.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (6)

39-42: Simple and effective operator status check.

The CheckSriovOperatorStatus function correctly delegates to the existing sriovoperator.IsSriovDeployed utility, maintaining consistency with other test suites.

87-137: Clean functional options pattern for network configuration.

The NetworkOption type and With* functions provide a flexible, extensible API for configuring SR-IOV networks. This follows Go best practices for builder patterns.

509-585: Comprehensive VF status verification with traffic test.

The CheckVFStatusWithPassTraffic function properly:

Creates client/server pods with cleanup via defer

Verifies interface readiness and carrier status

Conditionally verifies spoof checking based on description

Tests actual connectivity with ping

Good defensive cleanup pattern.

683-722: Debug pod execution with proper cleanup.

The executeOnNode function correctly:

Truncates pod name to 63 characters (K8s limit)

Cleans up existing debug pods before creating new ones

Uses defer for cleanup

Uses nsenter to execute commands in the host namespace

One consideration: the 63-character truncation might create collisions for very long node names with similar prefixes.

771-839: Best-effort cleanup with good logging.

The CleanupLeftoverResources function appropriately:

Logs warnings but continues on errors (best-effort)

Uses regex patterns to match test resources

Cleans up namespaces, networks, and policies

The regex pattern ^\d{5}-|\w+dpdknet$ correctly matches test case ID prefixes and DPDK network suffixes.

728-765: PCI address extraction handles namespaced network names.

The GetPciAddress function correctly strips the namespace prefix from network names using strings.LastIndex. The JSON struct for parsing network status is appropriately minimal.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between acd5dacc020bde2aaebbc9622d18f297bde606b0 and 30bb5d969242eb732ce9a179a07f22a95dbb391f.

📒 Files selected for processing (1)

tests/ocp/sriov/tests/README.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

tests/ocp/sriov/tests/README.md

[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (1)

tests/ocp/sriov/tests/README.md (1)

420-443: MTU documentation correctly references DefaultTestMTU.

The MTU section now properly documents the value as 9000 (defined by DefaultTestMTU), which aligns with the actual test implementation and addresses the previous documentation mismatch concern.

coderabbitai · 2025-12-12T14:27:11Z

tests/ocp/sriov/tests/README.md

+#### Device Configuration
+
+If `SRIOV_DEVICES` is not set, the test suite uses default device configurations:
+
+- `e810xxv` (DeviceID: 159b, Vendor: 8086, Interface: eno12409)
+- `e810c` (DeviceID: 1593, Vendor: 8086, Interface: ens2f2)
+- `x710` (DeviceID: 1572, Vendor: 8086, Interface: ens5f0) - NO-CARRIER
+- `bcm57414` (DeviceID: 16d7, Vendor: 14e4, Interface: ens4f1np1)
+- `bcm57508` (DeviceID: 1750, Vendor: 14e4, Interface: ens3f0np0) - NO-CARRIER
+- `e810back` (DeviceID: 1591, Vendor: 8086, Interface: ens4f2)
+- `cx7anl244` (DeviceID: 1021, Vendor: 15b3, Interface: ens2f0np0)
+
+**Note**: Devices marked with "NO-CARRIER" may be skipped if the interface has no carrier signal.


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Inspect GetDefaultDevices() implementation echo "=== GetDefaultDevices() function ===" rg -A 150 'func GetDefaultDevices' tests/ocp/sriov/internal/ocpsriovconfig/config.go echo "" echo "=== Check for device-related mappings/constants ===" rg -B 2 -A 5 'map\[.*DeviceConfig|var.*Device' tests/ocp/sriov/internal/ocpsriovconfig/config.go

Repository: rh-ecosystem-edge/eco-gotests

Length of output: 4745

🏁 Script executed:

#!/bin/bash # Also check the README.md file structure echo "=== README.md device configuration section (lines 100-120) ===" sed -n '100,120p' tests/ocp/sriov/tests/README.md

Repository: rh-ecosystem-edge/eco-gotests

Length of output: 1123

🏁 Script executed:

#!/bin/bash # Check for line 493 grammar issue echo "=== README.md around line 493 ===" sed -n '490,495p' tests/ocp/sriov/tests/README.md

Repository: rh-ecosystem-edge/eco-gotests

Length of output: 323

Device configuration list is out of sync with GetDefaultDevices() in config.go.

The documented device list has multiple discrepancies with the actual implementation:

Missing devices: cx6 (ens6f0np0) and cx6dx (ens7f0np0) are not documented

Non-existent device: e810back appears in documentation but doesn't exist in the code

Interface name mismatches:

bcm57414: documented as ens4f1np1, actually ens3f0

bcm57508: documented as ens3f0np0, actually ens4f0

cx7anl244: documented as ens2f0np0, actually cx7 with ens8f0np0

Regenerate the device list directly from GetDefaultDevices() to ensure users following the documentation have accurate information.

🤖 Prompt for AI Agents

In tests/ocp/sriov/tests/README.md around lines 103 to 115, the documented default device list is out of sync with GetDefaultDevices() in config.go; update the README by regenerating the device table directly from GetDefaultDevices() and replacing the listed entries: remove the non-existent e810back, add missing cx6 (ens6f0np0) and cx6dx (ens7f0np0), and correct interface names for bcm57414 to ens3f0, bcm57508 to ens4f0, and cx7anl244 to cx7 with ens8f0np0; ensure NO-CARRIER annotations are preserved where applicable and commit the updated README so docs match the implementation.

tests/ocp/sriov/tests/README.md

ajaggapa · 2025-12-19T11:23:03Z

tests/ocp/sriov/README.md


 #### Optional
 - `ECO_OCP_SRIOV_INTERFACE_LIST`: Comma-separated list of SR-IOV interfaces
 - `ECO_OCP_SRIOV_TEST_IMAGE`: Container image for test workloads


Suggested change

- `ECO_OCP_SRIOV_TEST_CONTAINER`: Container image for test workloads

ajaggapa · 2025-12-19T13:34:47Z

tests/ocp/sriov/internal/tsparams/consts.go

+	// MCPStableInterval represents polling interval for MCP stability checks.
+	MCPStableInterval = 10 * time.Second
+	// NADTimeout represents timeout for NAD creation.
+	NADTimeout = 2 * time.Minute


in worst case, this should be 5-10seconds.
if it takes more than that we better consider it a failure.

ajaggapa · 2025-12-19T13:37:38Z

@zhiqiangf
i see all your open PRs are incremental. So the older comments are still valid on the latest one. Please address them.
Also, it is recommended to have fewer commits on each PR and you can force push changes on the same commit to address reviews

- Add DeviceConfig struct for SR-IOV device configuration - Add GetSriovDevices() method with ECO_OCP_SRIOV_DEVICES env var support - Add GetVFNum() method with error handling - Move all defaults to default.yaml - Fix suite-level labels to exclude feature-specific labels - Add constants for timeouts, labels, and test parameters

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

tests/ocp/sriov/tests/basic.go (1)

150-158: InitVF error handling still prevents graceful device skipping.

The past review comment on this pattern remains applicable. initVFWithDevType in sriovenv.go (line 248) returns (false, fmt.Errorf(...)) when no matching device is found on any node, so the Expect(err).ToNot(HaveOccurred()) on line 152 will fail before the if !result skip logic can execute.

This pattern is repeated in all 8 test cases. To implement the intended "skip devices not found" behavior, either:

Modify initVFWithDevType to return (false, nil) for "device not present" scenarios, or

Use a sentinel error type and check with errors.Is before the Expect.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)

243-249: initVFWithDevType still cannot gracefully indicate "device not found".

The previous review comment remains unaddressed. Line 248 returns (false, fmt.Errorf(...)) for all failure cases, making it impossible for callers to distinguish "no suitable device on any worker" from genuine API/configuration failures.

This prevents the test skip logic in basic.go from working as documented. Consider returning (false, nil) when the device simply isn't present on any node, reserving errors for actual failures.

🧹 Nitpick comments (2)

tests/ocp/sriov/internal/tsparams/consts.go (1)

50-53: Consider consolidating duplicate IP constants.

TestPodClientIP and TestPodServerIP duplicate ClientIPv4IPAddress and ServerIPv4IPAddress from ocpsriovvars.go. Consider using a single source of truth for these values.
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
129-134: Consider documenting accepted values for minTxRate.

The parsing accepts true, 1, or yes (case-insensitive) as true values, and anything else defaults to false. This is reasonable behavior but could be clearer in the format documentation on line 110-111.
Suggested documentation improvement
 // parseSriovDevicesEnv parses device configuration from ECO_OCP_SRIOV_DEVICES environment variable.
-// Format: "name1:deviceid1:vendor1:interface1[:minTxRate],name2:deviceid2:vendor2:interface2[:minTxRate],..."
+// Format: "name1:deviceid1:vendor1:interface1[:minTxRate],name2:deviceid2:vendor2:interface2[:minTxRate],..."
+// minTxRate accepts: true/1/yes for enabled, any other value (including omitted) defaults to false.
 // Example: "e810xxv:159b:8086:ens2f0,e810c:1593:8086:ens2f2".

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 941b840fe473a30924a9bdb82b503809368aac2c and 0d69e5b.

📒 Files selected for processing (7)

tests/ocp/sriov/README.md (2 hunks)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (3 hunks)
tests/ocp/sriov/internal/ocpsriovconfig/default.yaml (1 hunks)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1 hunks)
tests/ocp/sriov/internal/tsparams/consts.go (2 hunks)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (2 hunks)
tests/ocp/sriov/tests/basic.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/ocp/sriov/README.md

🧰 Additional context used

🧬 Code graph analysis (3)

tests/ocp/sriov/internal/ocpsriovconfig/config.go (2)

tests/ocp/internal/ocpconfig/config.go (1)

OcpConfig (20-22)

tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)

SriovOcpConfig (13-13)

tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)

tests/ocp/sriov/internal/tsparams/consts.go (1)

LabelSuite (10-10)

tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)

tests/internal/sriovoperator/sriovenv.go (4)

IsSriovDeployed (32-54)

WaitForSriovAndMCPStable (80-100)

RemoveAllSriovNetworks (156-176)

RemoveAllPoliciesAndWaitForSriovAndMCPStable (180-195)

tests/internal/cluster/cluster.go (1)

ExecCmdWithStdout (98-180)

🔇 Additional comments (6)

tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)

16-18: LGTM!

Clear documentation distinguishing suite-level labels from feature-specific labels. The consolidation of DefaultTimeout to consts.go reduces duplication.

tests/ocp/sriov/tests/basic.go (2)

45-72: LGTM!

Well-structured helper with proper namespace creation, label application, and cleanup via DeferCleanup. The Eventually wait ensures namespace readiness before proceeding.

454-532: DPDK test implementation looks correct.

The test properly skips BCM devices with a clear bug reference (OCPBUGS-30909), uses per-device namespaces, and verifies PCI address assignment. The DeferCleanup correctly captures the per-iteration namespace via closure.

tests/ocp/sriov/internal/tsparams/consts.go (1)

21-44: Well-organized timeout constants.

The timeout values are appropriate for SR-IOV operations. PolicyApplicationTimeout at 35 minutes aligns with MCOWaitTimeout for MachineConfigPool operations, which is correct since SR-IOV policy changes often trigger node reconfiguration.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)

109-125: Well-designed functional options pattern for network configuration.

The NetworkOption pattern with WithSpoof, WithTrust, WithVLAN, etc. provides a clean, extensible API for network creation. The CreateSriovNetwork function composes these options nicely.

713-749: Cleanup function handles errors gracefully.

Using warnings instead of returning errors on cleanup failures is appropriate since cleanup should be best-effort and not block test completion. The function properly cleans up test namespaces, networks, and policies.

coderabbitai · 2025-12-19T22:39:59Z

tests/ocp/sriov/internal/ocpsriovconfig/config.go

+// GetVFNum returns the configured number of virtual functions.
+// Returns an error if VFNum is not properly configured.
+func (sriovOcpConfig *SriovOcpConfig) GetVFNum() (int, error) {
+	if sriovOcpConfig.VFNum <= 0 {
+		return 0, fmt.Errorf(
+			"no SR-IOV VFs configured, check env var ECO_OCP_SRIOV_VF_NUM")
+	}
+
+	return sriovOcpConfig.VFNum, nil
+}


⚠️ Potential issue | 🟡 Minor

Error message could mention both configuration sources.

The error message only references the environment variable, but VFNum can also be set via vf_num in the YAML config file.

Suggested fix

func (sriovOcpConfig *SriovOcpConfig) GetVFNum() (int, error) { if sriovOcpConfig.VFNum <= 0 { return 0, fmt.Errorf( - "no SR-IOV VFs configured, check env var ECO_OCP_SRIOV_VF_NUM") + "no SR-IOV VFs configured, check vf_num in YAML or env var ECO_OCP_SRIOV_VF_NUM") } return sriovOcpConfig.VFNum, nil }

🤖 Prompt for AI Agents

In tests/ocp/sriov/internal/ocpsriovconfig/config.go around lines 148 to 157, the error returned when VFNum is not set only mentions the environment variable; update the error text to reference both configuration sources (the ECO_OCP_SRIOV_VF_NUM environment variable and the vf_num key in the YAML config) so callers know where to look; modify the fmt.Errorf call to include a concise message listing both sources (for example: "no SR-IOV VFs configured; set ECO_OCP_SRIOV_VF_NUM or vf_num in the YAML config") and keep the same error return shape.

tests/ocp/sriov/internal/ocpsriovconfig/default.yaml

tests/ocp/sriov/internal/sriovenv/sriovenv.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

151-160: Error message should mention both configuration sources.

The error message at line 156 only references the environment variable, but VFNum can also be set via vf_num in the YAML config file. This makes troubleshooting harder for users who configure via YAML.
tests/ocp/sriov/tests/basic.go (1)
143-179: InitVF error handling prevents graceful device skipping.

The pattern used across all tests (also at lines 181-214, 216-249, 251-284, 286-329, 331-364, 366-407, 409-452, 454-532):
result, err := sriovenv.InitVF(...)
Expect(err).ToNot(HaveOccurred(), ...)
if !result {
    continue
}
However, initVFWithDevType in sriovenv.go line 267 returns (false, fmt.Errorf(...)) when no suitable device is found on any worker, never (false, nil). This causes Expect(err).ToNot(HaveOccurred()) to fail before the skip logic (if !result or final if !executed { Skip(...) }) can execute.

This contradicts the README's promise to automatically skip devices not found on any worker node.
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)
200-268: initVFWithDevType treats "device not found" as an error, preventing graceful test skips.

At line 267, the function returns (false, fmt.Errorf(...)) when no suitable device is found on any worker node. This prevents callers in tests/basic.go from distinguishing "device not present" from genuine failures, because:
result, err := sriovenv.InitVF(...)
Expect(err).ToNot(HaveOccurred())  // Fails here for "not found"
if !result { Skip(...) }            // Never reached
Consider returning (false, nil) when no matching NIC is discovered (lines 229-232, 249-251, etc.), reserving non-nil errors for real failures (API errors, timeouts). This will allow tests to implement the promised "device skipping" behavior without conflating "not found" with infrastructure failures.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36be2a6 and 8995f5e.

📒 Files selected for processing (6)

tests/ocp/sriov/README.md
tests/ocp/sriov/internal/ocpsriovconfig/config.go
tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
tests/ocp/sriov/internal/sriovenv/sriovenv.go
tests/ocp/sriov/internal/tsparams/consts.go
tests/ocp/sriov/tests/basic.go

🚧 Files skipped from review as they are similar to previous changes (2)

tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
tests/ocp/sriov/internal/tsparams/consts.go

🧰 Additional context used

🧬 Code graph analysis (3)

tests/ocp/sriov/tests/basic.go (3)

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

DeviceConfig (26-32)

tests/ocp/sriov/internal/tsparams/consts.go (12)

BCMVendorID (72-72)

NamespaceTimeout (30-30)

CleanupTimeout (34-34)

LabelBasic (15-15)

PodReadyTimeout (32-32)

TestVLAN (61-61)

TestVlanQoS (63-63)

TestMinTxRate (65-65)

TestMaxTxRate (67-67)

DefaultTestMTU (56-56)

NADTimeout (40-40)

PollingInterval (28-28)

tests/ocp/sriov/internal/sriovenv/sriovenv.go (17)

NetworkOption (59-59)

CreateSriovNetwork (111-126)

RemoveSriovNetwork (139-152)

WithSpoof (62-66)

WithTrust (69-73)

WithVLAN (76-80)

WithVlanQoS (83-87)

WithMinTxRate (90-94)

WithMaxTxRate (97-101)

WithLinkState (104-108)

VerifyLinkStateConfiguration (585-637)

UpdateSriovPolicyMTU (295-346)

InitDpdkVF (196-198)

CreateDpdkTestPod (373-392)

DeleteDpdkTestPod (395-408)

WaitForPodWithLabelReady (411-448)

GetPciAddress (705-740)

tests/ocp/sriov/internal/ocpsriovconfig/config.go (3)

tests/cnf/ran/ptp/internal/iface/iface.go (1)

Name (22-22)

tests/ocp/internal/ocpconfig/config.go (1)

OcpConfig (20-22)

tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)

SriovOcpConfig (13-13)

tests/ocp/sriov/internal/sriovenv/sriovenv.go (3)

tests/internal/sriovoperator/sriovenv.go (4)

IsSriovDeployed (32-54)

WaitForSriovAndMCPStable (80-100)

RemoveAllSriovNetworks (156-176)

RemoveAllPoliciesAndWaitForSriovAndMCPStable (180-195)

tests/ocp/sriov/internal/tsparams/consts.go (14)

MCPStableInterval (38-38)

NADTimeout (40-40)

PollingInterval (28-28)

NamespaceTimeout (30-30)

PolicyApplicationTimeout (42-42)

PodReadyTimeout (32-32)

TestPodClientIP (51-51)

TestPodClientMAC (47-47)

CleanupTimeout (34-34)

TestPodServerIP (53-53)

TestPodServerMAC (49-49)

CarrierWaitTimeout (36-36)

TestResourceLabelKey (75-75)

TestResourceLabelValue (77-77)

tests/internal/cluster/cluster.go (1)

ExecCmdWithStdout (98-180)

tests/ocp/sriov/internal/sriovenv/sriovenv.go

- Add SR-IOV environment helper functions using eco-goinfra - Use existing sriovoperator functions for stability checks and cleanup - Add network creation with NetworkOption pattern - Add policy creation and VF initialization helpers - Add pod verification and traffic testing utilities

- Add 9 basic SR-IOV test cases with test IDs as labels - Test spoof checking, trust, VLAN, link state, MTU, min/max tx rate - Add DPDK test with BCM NIC skip (OCPBUGS-30909) - Use constants from tsparams for test values - Add proper cleanup in AfterAll and AfterEach

- Add comprehensive README with test descriptions and configuration - Document environment variables and test labels - Fix NADTimeout to 10 seconds (was 2 minutes)

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tests/ocp/sriov/tests/basic.go (1)

143-179: InitVF error handling still prevents graceful skips for "device not found" scenarios.

The past review comment remains applicable. When InitVF returns an error (line 152), the test fails immediately via Expect(err).ToNot(HaveOccurred()). However, the underlying initVFWithDevType returns (false, error) when no device is found on any node, not (false, nil).

This means devices not present on any worker will cause test failures rather than skips, contradicting the intended skip behavior (lines 176-178).

Consider either:

Modifying initVFWithDevType to return (false, nil) for "no device found" cases

Using a sentinel error and checking with errors.Is before the Expect

tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)

200-268: initVFWithDevType returns error for "no device found", preventing graceful skips.

As noted in the past review comment, this function returns (false, fmt.Errorf(...)) on line 267 when no node has the device. This causes callers using Expect(err).ToNot(HaveOccurred()) to fail instead of skip.

Consider returning (false, nil) when no suitable node/device is found, reserving errors for actual failures (API errors, timeouts, etc.). This would allow the if !result { continue } and if !executed { Skip(...) } patterns in tests to work as intended.

🧹 Nitpick comments (2)

tests/ocp/sriov/tests/basic.go (1)
47-72: Test namespace should include the test resource label for reliable cleanup.

The setupTestNamespace function applies params.PrivilegedNSLabels but doesn't include tsparams.TestResourceLabelKey. The CleanupLeftoverResources function in sriovenv.go first looks for namespaces with this label before falling back to the e2e- prefix match.

Adding the test label ensures consistent cleanup behavior and aligns with the documented cleanup strategy.
🔎 Proposed fix
 func setupTestNamespace(testID string, data sriovconfig.DeviceConfig) string {
 	testNamespace := "e2e-" + testID + data.Name
 
 	By(fmt.Sprintf("Creating test namespace %q", testNamespace))
 
 	nsBuilder := namespace.NewBuilder(APIClient, testNamespace)
 	for key, value := range params.PrivilegedNSLabels {
 		nsBuilder.WithLabel(key, value)
 	}
+	nsBuilder.WithLabel(tsparams.TestResourceLabelKey, tsparams.TestResourceLabelValue)
 
 	_, err := nsBuilder.Create()
tests/ocp/sriov/internal/tsparams/consts.go (1)

25-28: Clarify the distinction between RetryInterval and PollingInterval.

Both RetryInterval (3s) and PollingInterval (2s) serve similar purposes but have slightly different values. Consider adding comments to clarify when to use each, or consolidate to a single polling interval constant for consistency.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8995f5e and 16dc562.

📒 Files selected for processing (4)

tests/ocp/sriov/README.md
tests/ocp/sriov/internal/sriovenv/sriovenv.go
tests/ocp/sriov/internal/tsparams/consts.go
tests/ocp/sriov/tests/basic.go

🧰 Additional context used

🧬 Code graph analysis (1)

tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)

tests/internal/sriovoperator/sriovenv.go (4)

IsSriovDeployed (32-54)

WaitForSriovAndMCPStable (80-100)

RemoveAllSriovNetworks (156-176)

RemoveAllPoliciesAndWaitForSriovAndMCPStable (180-195)

tests/internal/cluster/cluster.go (1)

ExecCmdWithStdout (98-180)

🔇 Additional comments (20)

tests/ocp/sriov/README.md (2)

34-49: LGTM! Directory structure is well-documented.

The updated directory structure accurately reflects the new internal packages (ocpsriovconfig, ocpsriovinittools, sriovenv, sriovocpenv) and test files. The organization follows the established pattern for the eco-gotests framework.

468-473: Environment variable documentation looks correct.

The environment variables are properly documented with format examples. Note that ECO_OCP_SRIOV_TEST_CONTAINER aligns with the previous review comment suggestion (renamed from ECO_OCP_SRIOV_TEST_IMAGE).

tests/ocp/sriov/tests/basic.go (6)

25-36: LGTM! Helper function for NO-CARRIER detection is clean.

The isNoCarrierError function correctly handles nil errors and checks for both common carrier error messages.

38-43: LGTM! BCM device detection is straightforward.

The function correctly identifies Broadcom NICs by both name pattern and vendor ID.

74-88: LGTM! Network setup helper is well-structured.

The function correctly creates the SR-IOV network and registers cleanup via DeferCleanup.

102-121: LGTM! BeforeAll setup is comprehensive.

The setup properly validates the SR-IOV operator status, loads configuration, and discovers worker nodes with appropriate assertions.

123-141: AfterAll cleanup continues despite individual errors - good pattern.

The cleanup correctly collects errors rather than failing on the first one, ensuring all policies get cleanup attempts. The final stability wait ensures the cluster is ready for subsequent test runs.

454-532: DPDK test case is well-structured with proper cleanup and verification.

The test correctly:

Skips BCM devices with appropriate documentation reference

Creates NAD and waits for readiness

Verifies PCI address assignment

Validates network-status annotation

The DeferCleanup for the DPDK pod ensures cleanup even on test failure.

tests/ocp/sriov/internal/tsparams/consts.go (3)

14-15: LGTM! New basic test label added.

The LabelBasic constant provides a clear label for filtering basic SR-IOV test cases.

21-44: LGTM! Timeout constants are well-organized and documented.

The timeouts are appropriately sized for their operations:

WaitTimeout (20 min) for long operations

DefaultTimeout (300s) for general operations

PodReadyTimeout (300s) for pod startup

NADTimeout (10s) aligns with the previous reviewer's suggestion of 5-10 seconds

46-77: LGTM! Test constants are well-documented.

MAC addresses, IP addresses, MTU, VLAN, TX rate limits, and vendor IDs are all properly documented and follow consistent naming conventions.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (9)

37-52: LGTM! Operator and policy readiness checks are clean wrappers.

The functions properly delegate to the existing sriovoperator package functions with appropriate configuration.

58-108: LGTM! NetworkOption pattern provides clean, composable network configuration.

The functional options pattern allows flexible network configuration without bloating function signatures.

110-126: LGTM! Network creation with NAD wait is well-implemented.

The function creates the network with all options applied and waits for the NAD to be created before returning.

138-152: RemoveSriovNetwork correctly handles already-deleted case.

The function properly checks for IsNotFound before attempting deletion and uses the builder's DeleteAndWait method.

292-333: MTU update now preserves all spec fields via DeepCopy - previous concern addressed.

The implementation correctly stores a deep copy of the entire spec before deletion and applies it to the new policy. This preserves all fields including IsRdma, LinkType, ESwitchMode, ExternallyManaged, VdpaType, etc.

381-396: DeleteDpdkTestPod handles eco-goinfra error messages correctly.

The function checks both apierrors.IsNotFound and the eco-goinfra-specific "does not exist" message, ensuring graceful handling when the pod is already deleted.

491-569: CheckVFStatusWithPassTraffic creates thorough connectivity verification.

The function:

Generates unique pod names to avoid collisions

Creates client/server pods with cleanup via defer

Verifies interface readiness and carrier status

Conditionally checks spoof configuration

Tests actual ping connectivity

734-788: CleanupLeftoverResources provides comprehensive cleanup with good error handling.

The function:

Uses label-based selection first (preferred)

Falls back to name prefix matching (backwards compatibility)

Logs warnings instead of failing on individual errors

Uses existing sriovoperator functions for network/policy cleanup

This approach is robust for test cleanup scenarios.

758-772: The field selector suggestion is not applicable—Kubernetes field selectors only support exact matching (=, ==, !=) and cannot filter by name prefix.

The current implementation (list all namespaces and filter client-side using strings.HasPrefix) is the correct and only viable approach for this use case. Since the code is test cleanup and already falls back gracefully when listing fails, this implementation is acceptable as-is.

Likely an incorrect or invalid review comment.

zhiqiangf force-pushed the pr4-sriov-docs branch 6 times, most recently from ec77e65 to 941b840 Compare December 5, 2025 17:30

zhiqiangf marked this pull request as ready for review December 5, 2025 17:41

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

zhiqiangf force-pushed the pr4-sriov-docs branch 10 times, most recently from 9999d37 to 7cc85e6 Compare December 11, 2025 15:55

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

zhiqiangf force-pushed the pr4-sriov-docs branch from 7cc85e6 to acd5dac Compare December 11, 2025 18:07

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

zhiqiangf force-pushed the pr4-sriov-docs branch 4 times, most recently from 252dc0c to 493fc7d Compare December 13, 2025 00:33

ajaggapa reviewed Dec 19, 2025

View reviewed changes

zhiqiangf force-pushed the pr4-sriov-docs branch from 493fc7d to 5b0f95a Compare December 19, 2025 21:01

zhiqiangf force-pushed the pr4-sriov-docs branch 2 times, most recently from fd0aa7b to 0d69e5b Compare December 19, 2025 22:27

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

zhiqiangf force-pushed the pr4-sriov-docs branch from 36be2a6 to 05cd886 Compare December 23, 2025 03:08

ocp-sriov: handle io.EOF for empty yaml files

2b82963

zhiqiangf force-pushed the pr4-sriov-docs branch from 05cd886 to 8995f5e Compare December 23, 2025 15:29

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

tests/ocp/sriov/internal/sriovenv/sriovenv.go Show resolved Hide resolved

root added 3 commits December 23, 2025 11:29

ocp-sriov: add documentation for SR-IOV tests

8f3c907

- Add comprehensive README with test descriptions and configuration - Document environment variables and test labels - Fix NADTimeout to 10 seconds (was 2 minutes)

zhiqiangf force-pushed the pr4-sriov-docs branch from 8995f5e to e501a4c Compare December 23, 2025 16:29

ocp-sriov: update readme with current structure and env vars

16dc562

zhiqiangf force-pushed the pr4-sriov-docs branch from e501a4c to 16dc562 Compare December 23, 2025 19:06

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes


	- `ECO_OCP_SRIOV_TEST_CONTAINER`: Container image for test workloads

ocp-sriov: add comprehensive test documentation #987

Are you sure you want to change the base?

ocp-sriov: add comprehensive test documentation #987

Uh oh!

Conversation

zhiqiangf commented Dec 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajaggapa Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

ajaggapa Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

ajaggapa commented Dec 19, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhiqiangf commented Dec 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 5, 2025 •

edited

Loading