Skip to content

Conversation

@zhiqiangf
Copy link

@zhiqiangf zhiqiangf commented Dec 5, 2025

Add detailed documentation for SR-IOV test suite:

  • tests/README.md: Complete test case documentation with environment
    variables, device configuration, and execution examples
  • Updated sriov/README.md reference

Documentation covers:

  • Test prerequisites and cluster requirements
  • Environment variable configuration (SRIOV_DEVICES, etc.)
  • Supported device configurations
  • Test execution examples and filtering"

Summary by CodeRabbit

  • New Features

    • SR‑IOV config now accepts device lists with per-device options and VF count; new network options for spoof, trust, VLAN, QoS, min/max TX rate and link state.
  • Tests

    • Added a comprehensive SR‑IOV basic test suite covering networks, policies, DPDK, link state, MTU and per-device scenarios.
  • Documentation

    • Environment vars updated: ECO_OCP_SRIOV_TEST_IMAGE renamed to ECO_OCP_SRIOV_TEST_CONTAINER; expanded env var docs for devices, VF count, operator namespace and interface list.
  • Refactor

    • Reorganized SR‑IOV test internals and removed a deprecated default timeout constant.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

📝 Walkthrough

Walkthrough

Adds a CNF-style SR-IOV test framework for OCP: new internal packages (ocpsriovconfig, ocpsriovinittools, sriovocpenv, sriovenv, tsparams changes), device/VF parsing and defaults, many test constants, and a new Ginkgo "basic" SR-IOV test suite. README env var names/formats updated.

Changes

Cohort / File(s) Summary
Documentation
tests/ocp/sriov/README.md
Updated environment variable names and examples (ECO_OCP_SRIOV_TEST_IMAGE → ECO_OCP_SRIOV_TEST_CONTAINER) and documented new optional env vars (ECO_OCP_SRIOV_DEVICES, ECO_OCP_SRIOV_VF_NUM, ECO_OCP_SRIOV_OPERATOR_NAMESPACE, ECO_OCP_SRIOV_INTERFACE_LIST).
OCP SR-IOV config
tests/ocp/sriov/internal/ocpsriovconfig/config.go, tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
Added DeviceConfig type and fields Devices, DevicesEnv, VFNum; parse ECO_OCP_SRIOV_DEVICES; added GetSriovDevices, parseSriovDevicesEnv, GetVFNum; adjusted default path and updated default YAML (vf_num, devices).
Init tools & env helpers
tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go, tests/ocp/sriov/internal/sriovocpenv/*, tests/ocp/sriov/internal/sriovenv/sriovenv.go
Added init tooling and a large sriovenv package exposing NetworkOption builder, network/NAD/policy lifecycle functions, InitVF/InitDpdkVF, pod creation/deletion, interface/PCI utilities, verification and cleanup helpers.
Test constants & vars
tests/ocp/sriov/internal/tsparams/consts.go, tests/ocp/sriov/internal/tsparams/ocpsriovvars.go
Introduced many new test constants (timeouts, VLAN/QoS/min/max TX, MTU, MAC/IP defaults, vendor IDs, resource label keys); removed DefaultTimeout and adjusted Labels documentation.
Test suite
tests/ocp/sriov/tests/basic.go
New ordered Ginkgo test suite implementing SR-IOV basic scenarios: spoof/trust permutations, VLAN/QoS/min-max TX, MTU updates, DPDK VF flows, link-state checks, per-device iteration with conditional skipping and cleanup orchestration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested reviewers

  • evgenLevin
  • ajaggapa
  • gkopels

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'ocp-sriov: add comprehensive test documentation' accurately summarizes the main change: adding comprehensive documentation for the SR-IOV test suite, including README updates and environment variable documentation.
Docstring Coverage ✅ Passed Docstring coverage is 95.35% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zhiqiangf zhiqiangf force-pushed the pr4-sriov-docs branch 6 times, most recently from ec77e65 to 941b840 Compare December 5, 2025 17:30
@zhiqiangf zhiqiangf marked this pull request as ready for review December 5, 2025 17:41
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
tests/ocp/sriov/internal/tsparams/consts.go (1)

1-70: Well-structured test constants file.

The constants are well-documented with appropriate timeout values. The labeling mechanism with a base Labels slice that individual tests can augment is a clean pattern for test filtering.

A minor note: the MAC addresses 20:04:0f:f1:88:01 and 20:04:0f:f1:88:02 have the locally administered bit set (bit 1 of the first octet, 0x02), which is correct for test/virtual interfaces. However, the current values (0x20) have the multicast bit set instead. For strictly correct locally administered unicast MACs, consider using addresses like 02:04:0f:f1:88:01.

-	// TestPodClientMAC is the default MAC address for test client pods.
-	TestPodClientMAC = "20:04:0f:f1:88:01"
-	// TestPodServerMAC is the default MAC address for test server pods.
-	TestPodServerMAC = "20:04:0f:f1:88:02"
+	// TestPodClientMAC is the default MAC address for test client pods.
+	TestPodClientMAC = "02:04:0f:f1:88:01"
+	// TestPodServerMAC is the default MAC address for test server pods.
+	TestPodServerMAC = "02:04:0f:f1:88:02"
tests/ocp/sriov/sriov_suite_test.go (1)

36-80: BeforeSuite handles namespace cleanup well.

The logic to detect pre-existing namespaces from previous partial failures and recreate them is robust. However, the namespace label setup is duplicated between lines 49-51 and 65-68. Consider extracting to a helper function.

+func createTestNamespace() *namespace.Builder {
+	ns := namespace.NewBuilder(APIClient, tsparams.TestNamespaceName)
+	for key, value := range params.PrivilegedNSLabels {
+		ns.WithLabel(key, value)
+	}
+	return ns
+}
+
 var _ = BeforeSuite(func() {
 	// ... validation and cleanup ...
 
 	By("Creating test namespace with privileged labels")
-	testNS = namespace.NewBuilder(APIClient, tsparams.TestNamespaceName)
-	for key, value := range params.PrivilegedNSLabels {
-		testNS.WithLabel(key, value)
-	}
+	testNS = createTestNamespace()
 
 	_, err = testNS.Create()
 	if err != nil {
 		if apierrors.IsAlreadyExists(err) {
 			// ... delete existing ...
-			testNS = namespace.NewBuilder(APIClient, tsparams.TestNamespaceName)
-			for key, value := range params.PrivilegedNSLabels {
-				testNS.WithLabel(key, value)
-			}
+			testNS = createTestNamespace()
 			// ...
tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)

716-760: Consider returning accumulated errors for observability.

The cleanup function ignores all errors and always returns nil. While best-effort cleanup is appropriate, consider logging errors at a higher verbosity level or returning an aggregate error for observability.

 func CleanupLeftoverResources() error {
+	var errs []error
 	sriovOpNs := SriovOcpConfig.OcpSriovOperatorNamespace
 
 	klog.V(90).Info("Cleaning up leftover test resources")
 
 	// Cleanup test namespaces
-	namespaces, _ := namespace.List(APIClient, metav1.ListOptions{})
+	namespaces, err := namespace.List(APIClient, metav1.ListOptions{})
+	if err != nil {
+		klog.V(90).Infof("Warning: failed to list namespaces: %v", err)
+	}
 	// ... rest of cleanup ...
 
 	klog.V(90).Info("Cleanup completed")
 
 	return nil
 }

732-740: Document the naming convention matched by cleanup regex.

The regex ^\d{5}-|\w+dpdknet$ matches specific naming patterns. Consider adding a comment explaining what test resource names follow these patterns to help future maintainers understand which resources are targeted for cleanup.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b1c772b and 941b840fe473a30924a9bdb82b503809368aac2c.

📒 Files selected for processing (10)
  • tests/ocp/internal/ocpconfig/config.go (2 hunks)
  • tests/ocp/sriov/README.md (1 hunks)
  • tests/ocp/sriov/internal/ocpsriovconfig/config.go (4 hunks)
  • tests/ocp/sriov/internal/ocpsriovconfig/default.yaml (0 hunks)
  • tests/ocp/sriov/internal/sriovenv/sriovenv.go (1 hunks)
  • tests/ocp/sriov/internal/tsparams/consts.go (1 hunks)
  • tests/ocp/sriov/internal/tsparams/sriovvars.go (1 hunks)
  • tests/ocp/sriov/sriov_suite_test.go (1 hunks)
  • tests/ocp/sriov/tests/README.md (1 hunks)
  • tests/ocp/sriov/tests/basic.go (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
tests/ocp/sriov/sriov_suite_test.go (5)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
  • SriovOcpConfig (21-26)
tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (2)
  • SriovOcpConfig (13-13)
  • APIClient (11-11)
tests/ocp/sriov/internal/tsparams/consts.go (1)
  • Labels (69-69)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)
  • Labels (14-14)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
  • CleanupLeftoverResources (717-760)
  • CheckSriovOperatorStatus (40-42)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)
  • SriovOcpConfig (13-13)
🪛 LanguageTool
tests/ocp/sriov/tests/README.md

[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (22)
tests/ocp/internal/ocpconfig/config.go (1)

38-40: Pointer preservation around YAML decode looks reasonable; confirm null‑intent semantics

Capturing the existing GeneralConfig pointer and restoring it only when YAML decoding leaves it nil is a pragmatic way to avoid losing the initialized config during Decode, and should help prevent nil dereferences downstream.

One subtle behavioral change to be aware of: if the YAML were ever to explicitly set this embedded pointer to null, this restoration logic would override that and keep a non‑nil GeneralConfig. If "config cannot be explicitly disabled/null" is the intended contract, this is fine; otherwise you might want a more specific condition for when to restore (e.g., only for an actually empty/default file or based on file size/contents).

Also applies to: 52-55

tests/ocp/sriov/README.md (3)

1-10: Comprehensive and well-structured documentation.

The README provides excellent coverage of test development guidelines, including directory structure, naming conventions, Ginkgo patterns, and best practices. This will be valuable for maintaining consistency across the SR-IOV test suite.


404-406: Good choice using klog over glog.

Using klog is the correct choice as it's the standard logging library in the Kubernetes ecosystem and provides better integration with Kubernetes tooling.


224-257: Excellent guidance on helper function design.

The clear distinction between where Gomega/Ginkgo can be used (test files only) versus helper functions (return errors instead) is a critical best practice that prevents common anti-patterns in test code organization.

tests/ocp/sriov/tests/README.md (3)

1-15: Well-documented test suite README.

The documentation provides clear information about the test file, labels, namespace, and container configuration. This will help users understand the test suite quickly.


86-116: Comprehensive environment variable documentation.

Good coverage of both mandatory and optional environment variables, including the fallback behavior and default device configurations. The format specification for SRIOV_DEVICES is particularly helpful.


475-520: Clear test execution flow documentation.

The documentation of BeforeAll, test execution patterns, AfterAll hooks, and resource management behavior helps users understand the test lifecycle. The note about automatic cleanup of leftover resources is valuable.

Note: The static analysis hint about "AfterAll" spelling is a false positive—it's a valid Ginkgo hook name.

tests/ocp/sriov/internal/ocpsriovconfig/config.go (3)

28-41: Good decoupling of YAML/env config from embedded structs.

The introduction of sriovYAMLConfig and sriovEnvConfig helper types prevents the YAML decoder from zeroing out embedded OcpConfig/GeneralConfig pointers during decoding. This is a solid approach to preserve the parent config state.


57-62: Defensive check for runtime.Caller failure.

Good addition to handle the edge case where runtime.Caller fails. This prevents potential panics from using invalid path information.


122-134: Correct conditional application of environment overrides.

The pattern of checking for non-empty values before applying ensures that YAML defaults aren't overwritten by empty environment variables. This is the expected behavior for configuration layering.

tests/ocp/sriov/tests/basic.go (6)

22-32: Utility function for NO-CARRIER detection.

Clean implementation checking for NO-CARRIER conditions. This enables graceful skipping of devices without physical link.


34-60: Well-structured namespace setup with automatic cleanup.

The setupTestNamespace helper correctly creates namespaces with privileged labels and registers cleanup via DeferCleanup. The Eventually check ensures namespace existence before proceeding.


77-99: Proper test suite structure with BeforeAll validation.

The test suite correctly uses Ordered, Label, and ContinueOnFailure as documented in the README. The BeforeAll hook validates operator status and discovers worker nodes before any tests run.


101-119: Robust policy cleanup in AfterAll.

Good approach collecting cleanup errors without failing early, then logging warnings for any failures. The final stability wait ensures the cluster is ready after cleanup.


121-157: Consistent test case pattern.

The test case structure (device iteration, VF initialization, namespace/network setup, verification, skip on NO-CARRIER) is well-designed and consistently applied across all test cases.


458-465: Consider adding cleanup for NAD polling failure case.

The Eventually block for NAD readiness doesn't have cleanup registered before it. If NAD polling fails, the namespace cleanup will still run, but the network created at line 456 may leave orphaned resources.

This is a minor edge case since DeferCleanup at line 69-74 (via setupSriovNetwork) should handle it, but verify that RemoveSriovNetwork handles non-existent networks gracefully.

tests/ocp/sriov/internal/tsparams/sriovvars.go (4)

15-22: Clear DeviceConfig struct definition.

The struct captures all necessary device configuration with a useful SupportsMinTxRate flag for conditional test behavior.


41-75: Robust environment variable parsing.

The parser correctly handles edge cases: empty entries are skipped, malformed entries log warnings but don't fail parsing entirely, and all values are trimmed. The format documentation in the comment is helpful.


77-92: Fail-fast behavior for invalid configuration.

The panic when SRIOV_DEVICES is set but no valid entries are parsed is appropriate—it prevents tests from silently running with unintended default configurations when the user explicitly set the variable.


94-113: Proper fallback chain for VF count.

The function correctly prioritizes ECO_OCP_SRIOV_VF_NUM over SRIOV_VF_NUM, validates for positive integers, logs warnings for invalid values, and falls back to a sensible default of 2.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)

215-266: Verify single-node VF initialization is intentional.

The function iterates through worker nodes but returns immediately after successfully creating a policy on the first node. This means VFs are only initialized on one node rather than all workers. If tests require VFs on multiple nodes, this logic would need adjustment.

tests/ocp/sriov/sriov_suite_test.go (1)

90-96: Verify ReporterNamespacesToDump and ReporterCRDsToDump are defined in tsparams package.

The tsparams.ReporterNamespacesToDump and tsparams.ReporterCRDsToDump variables are referenced in the test but were not found in the provided consts file. Confirm these are defined in the tsparams package or add the definitions if missing.

@zhiqiangf zhiqiangf force-pushed the pr4-sriov-docs branch 10 times, most recently from 9999d37 to 7cc85e6 Compare December 11, 2025 15:55
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
tests/ocp/sriov/tests/README.md (1)

11-16: Clarify single “Test Namespace” vs per-test e2e namespaces

The “Test Suite Information” section declares a single test namespace sriov-tests, while later sections state that each test creates its own namespace with a test case ID prefix. In tests/ocp/sriov/tests/basic.go, pods are created in dynamically named namespaces like e2e-<testID><deviceName>, not sriov-tests.

Consider clarifying that:

  • sriov-tests is the suite-level namespace (where applicable, e.g., reporter config), and
  • each basic test also creates its own e2e-* namespace per device.

That will help operators know where to look for resources and logs.

Also applies to: 515-521

tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)

11-18: Including LabelBasic in Labels aligns with new filtering docs

Extending Labels to {LabelSuite, LabelBasic} matches the usage in basic.go and the README examples (ocpsriov && basic). This keeps suite-level label-based filtering coherent across code and docs.

You may also want to double-check other suites under tests/ocp/sriov/ to ensure they still behave as expected when ECO_TEST_LABELS is set to this expanded label list.

tests/ocp/sriov/internal/tsparams/consts.go (1)

6-57: Good centralization of timeouts and default test data

Collecting label, timeout/interval, MAC/IP, and MTU constants here makes the SR-IOV tests much easier to tune and reason about. The values are consistent with usages in sriovenv.go and basic.go.

Just make sure any future changes to DefaultTestMTU, PodReadyTimeout, etc., are mirrored in the test documentation so operators don’t get stale numbers.

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

26-33: Device defaults and SRIOV_DEVICES parsing look good, but doc should mention the 5th flag

The DeviceConfig / GetDefaultDevices setup and parseDevicesFromEnv logic are reasonable:

  • Code defaults include SupportsMinTxRate per device, used by the VLAN/rate-limiting test to skip NICs that don’t support minTxRate.
  • SRIOV_DEVICES supports both name:deviceid:vendor:interface and a 5th optional supportsMinTxRate flag (true|1|yestrue, anything else → false).

Two follow-ups to keep things coherent:

  1. The SR-IOV basic README currently documents only the 4-field form and lists a different default device set than GetDefaultDevices() provides. It should be updated to:

    • Reflect this exact GetDefaultDevices() list, and
    • Mention the optional 5th supportsMinTxRate field so users can correctly mark devices that don’t support min TX rate.
  2. If default.yaml also carries a devices list, consider clearly documenting (or enforcing) which source “wins” (YAML vs SRIOV_DEVICES vs built-in defaults) to avoid surprises when operators override configuration.

Also applies to: 76-88, 255-306

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1dd58b3e8bb9cb90a5e797befbb72c018af18d5e and 9999d379c5eaa2504825f4765687ec2c9927f271.

📒 Files selected for processing (8)
  • tests/ocp/sriov/README.md (1 hunks)
  • tests/ocp/sriov/internal/ocpsriovconfig/config.go (4 hunks)
  • tests/ocp/sriov/internal/ocpsriovconfig/default.yaml (1 hunks)
  • tests/ocp/sriov/internal/sriovenv/sriovenv.go (1 hunks)
  • tests/ocp/sriov/internal/tsparams/consts.go (2 hunks)
  • tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (2 hunks)
  • tests/ocp/sriov/tests/README.md (1 hunks)
  • tests/ocp/sriov/tests/basic.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)
tests/ocp/sriov/internal/tsparams/consts.go (2)
  • LabelSuite (10-10)
  • LabelBasic (15-15)
tests/ocp/sriov/tests/basic.go (5)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
  • DeviceConfig (27-33)
tests/cnf/ran/ptp/internal/iface/iface.go (1)
  • Name (22-22)
tests/system-tests/internal/link/link.go (1)
  • NewBuilder (48-77)
tests/ocp/sriov/internal/tsparams/consts.go (7)
  • NamespaceTimeout (30-30)
  • CleanupTimeout (34-34)
  • LabelBasic (15-15)
  • PodReadyTimeout (32-32)
  • DefaultTestMTU (56-56)
  • NADTimeout (40-40)
  • PollingInterval (28-28)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (17)
  • NetworkOption (88-88)
  • CreateSriovNetwork (140-155)
  • RemoveSriovNetwork (168-177)
  • CheckSriovOperatorStatus (40-42)
  • RemoveSriovPolicy (184-203)
  • WaitForSriovPolicyReady (45-47)
  • InitVF (209-211)
  • WithSpoof (91-95)
  • WithTrust (98-102)
  • CheckVFStatusWithPassTraffic (510-585)
  • WithVLAN (105-109)
  • WithVlanQoS (112-116)
  • WithMinTxRate (119-123)
  • WithMaxTxRate (126-130)
  • WithLinkState (133-137)
  • VerifyLinkStateConfiguration (589-638)
  • GetPciAddress (729-765)
🪛 LanguageTool
tests/ocp/sriov/tests/README.md

[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (1)
tests/ocp/sriov/README.md (1)

401-412: Logging wording now matches actual klog usage

Renaming this bullet to “klog for verbose logging” aligns the text with the klog.V(level).Infof() example below; no further changes needed here.

Comment on lines 221 to 268
// initVFWithDevType creates an SR-IOV policy for the specified device type.
// It iterates through worker nodes and creates a policy on the first node where
// the device is successfully discovered. Returns (true, nil) on success.
func initVFWithDevType(name, deviceID, interfaceName, vendor, devType string, vfNum int,
workerNodes []*nodes.Builder) (bool, error) {
sriovOpNs := SriovOcpConfig.OcpSriovOperatorNamespace

// Cleanup existing policy
_ = RemoveSriovPolicy(name, tsparams.NamespaceTimeout)

for _, node := range workerNodes {
nodeName := node.Definition.Name

// Try to discover the actual interface name
actualInterface := interfaceName

// Only discover interface if not explicitly provided
// This allows users to specify exact interface for multi-port NICs
if interfaceName == "" && vendor != "" && deviceID != "" {
if discovered, err := discoverInterfaceName(nodeName, vendor, deviceID); err == nil {
actualInterface = discovered
}
}

pfSelector := fmt.Sprintf("%s#0-%d", actualInterface, vfNum-1)

policy := sriov.NewPolicyBuilder(APIClient, name, sriovOpNs, name, vfNum,
[]string{pfSelector}, map[string]string{"kubernetes.io/hostname": nodeName}).
WithDevType(devType)

if vendor != "" {
policy.Definition.Spec.NicSelector.Vendor = vendor
}

if deviceID != "" {
policy.Definition.Spec.NicSelector.DeviceID = deviceID
}

if _, err := policy.Create(); err != nil {
klog.V(90).Infof("Failed to create policy on node %q: %v", nodeName, err)

continue
}

if err := WaitForSriovPolicyReady(tsparams.PolicyApplicationTimeout); err != nil {
klog.V(90).Infof("Policy not ready on node %q: %v", nodeName, err)

_ = policy.Delete()

continue
}

klog.V(90).Infof("Successfully created policy %q on node %q", name, nodeName)

return true, nil
}

return false, fmt.Errorf("failed to create policy %q on any node", name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

initVFWithDevType never returns a non-error “not found” result, making graceful skips impossible

initVFWithDevType currently:

  • Cleans up any existing policy,
  • Iterates worker nodes trying to create a policy,
  • On first success, waits for readiness and returns (true, nil),
  • If all attempts fail, returns false with an error:
return false, fmt.Errorf("failed to create policy %q on any node", name)

Callers like InitVF / InitDpdkVF in turn are used by tests/basic.go, which expects to:

  • Treat err != nil as a hard failure via Expect(err).ToNot(HaveOccurred()), and
  • Use the boolean result to decide whether to skip devices and ultimately Skip("No SR-IOV devices matched...").

Because (false, nil) is never returned, “no matching device on any worker” can only manifest as an error, and tests can’t distinguish “device not present” from genuine failures. That contradicts the README’s intention to skip devices not found on any worker node.

Consider adjusting this API so callers can differentiate:

  • Return (false, nil) (or a sentinel error type) when no suitable NIC is found on any node, and reserve non-nil errors for real misconfigurations (e.g., API failures), or
  • Add a clear contract (docstring + error type) so tests can detect and Skip “not found” while still failing on other errors.

This will allow the basic tests and documentation to implement the promised “device skipping” behavior without conflating it with infrastructure failures.

🤖 Prompt for AI Agents
tests/ocp/sriov/internal/sriovenv/sriovenv.go lines 221-279: The function
currently treats "no matching device on any worker" as an error and always
returns (false, error), which prevents callers from distinguishing "not found"
from real failures; change the API so the function returns (false, nil) when no
suitable NIC is discovered on any worker (or alternatively return a
well-documented sentinel error type like ErrDeviceNotFound), and only return
non-nil errors for real failures (API/creation/timeout). Update the function to
detect the "no-match" case and return (false, nil) (or ErrDeviceNotFound)
instead of fmt.Errorf(...), and update callers (InitVF/InitDpdkVF and
tests/basic.go) to check the boolean or sentinel error and Skip when appropriate
while still failing on real errors.

Comment on lines 84 to 179
var (
vfNum = SriovOcpConfig.GetVFNum()
workerNodes []*nodes.Builder
testData = SriovOcpConfig.GetDevices()
)

BeforeAll(func() {
By("Checking the SR-IOV operator is running")
err := sriovenv.CheckSriovOperatorStatus()
Expect(err).ToNot(HaveOccurred(), "SR-IOV operator is not running")

By("Discovering worker nodes")
workerNodes, err = nodes.List(APIClient,
metav1.ListOptions{LabelSelector: SriovOcpConfig.WorkerLabel})
Expect(err).ToNot(HaveOccurred(), "Failed to discover nodes")
Expect(len(workerNodes)).To(BeNumerically(">", 0), "No worker nodes found")
})

AfterAll(func() {
By("Cleaning up SR-IOV policies after all tests")

var cleanupErrors []string
for _, item := range testData {
err := sriovenv.RemoveSriovPolicy(item.Name, tsparams.DefaultTimeout)
if err != nil {
cleanupErrors = append(cleanupErrors, fmt.Sprintf("policy %q: %v", item.Name, err))
}
}

if len(cleanupErrors) > 0 {
klog.Warningf("Some policies failed to clean up: %v", cleanupErrors)
}

By("Waiting for post-cleanup cluster stability")
err := sriovenv.WaitForSriovPolicyReady(tsparams.DefaultTimeout)
Expect(err).ToNot(HaveOccurred(), "Cluster did not stabilize after cleanup")
})

It("SR-IOV VF with spoof checking enabled", reportxml.ID("25959"), func() {
caseID := "25959"
executed := false

for _, data := range testData {
By(fmt.Sprintf("Testing device: %s", data.Name))

result, err := sriovenv.InitVF(data.Name, data.DeviceID, data.InterfaceName,
data.Vendor, vfNum, workerNodes)
Expect(err).ToNot(HaveOccurred(), "Failed to initialize VF for device %q", data.Name)

if !result {
By(fmt.Sprintf("Skipping device %q - VF init failed", data.Name))

continue
}

executed = true
testNamespace := setupTestNamespace(caseID+"-", data)
networkName := caseID + "-" + data.Name
setupSriovNetwork(networkName, data.Name, testNamespace,
sriovenv.WithSpoof(true), sriovenv.WithTrust(false))

By("Verifying VF status with pass traffic")
err = sriovenv.CheckVFStatusWithPassTraffic(networkName, data.InterfaceName,
testNamespace, "spoof checking on", tsparams.PodReadyTimeout)
if isNoCarrierError(err) {
Skip("Interface has NO-CARRIER status")
}

Expect(err).ToNot(HaveOccurred(), "Test verification failed")
}

if !executed {
Skip("No SR-IOV devices matched the requested configuration")
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

InitVF error handling prevents “device not found” from being treated as a skip

Across the tests (starting here with Test ID 25959), the pattern is:

result, err := sriovenv.InitVF(...)
Expect(err).ToNot(HaveOccurred(), ...)
if !result {
    // skip this device
    continue
}
executed = true
...

However, initVFWithDevType in sriovenv.go currently only returns:

  • (true, nil) when a policy is successfully created, or
  • (false, error) when it fails on all nodes.

It never returns (false, nil), so a “device not present on any worker” scenario produces an error and fails the test before the if !result or final if !executed { Skip(...) } paths can run. That contradicts the README, which says devices not found on any worker node are automatically skipped.

Consider one of these options:

  • Teach initVFWithDevType (and thus InitVF / InitDpdkVF) to distinguish “no suitable node/device found” from real failures, e.g. returning (false, nil) or a well-known sentinel error for the former; then the existing result/executed logic will work as intended, or
  • Keep InitVF returning an error, but in the tests explicitly detect a “no device found” error (e.g., via errors.Is or substring) and Skip in that case instead of unconditionally Expect(err).ToNot(HaveOccurred()).

Without such a change, running these tests on clusters lacking one or more configured devices will yield hard failures instead of clean skips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)

221-279: initVFWithDevType still conflates "device not found" with real errors.

The previous review correctly identified that this function returns (false, error) for all failure cases, making it impossible for callers to distinguish between "no matching device on any node" (which should trigger a skip) and actual infrastructure failures (which should fail the test).

The function currently returns:

return false, fmt.Errorf("failed to create policy %q on any node", name)

Consider returning (false, nil) when no device matches, reserving errors for genuine failures:

+var errNoMatchingDevice = fmt.Errorf("no matching device found on any worker node")

 func initVFWithDevType(...) (bool, error) {
     // ... existing logic ...
     
+    // Track if any attempt was made (vs no nodes at all)
+    var attemptedNodes int
     for _, node := range workerNodes {
+        attemptedNodes++
         // ... policy creation attempts ...
     }
     
-    return false, fmt.Errorf("failed to create policy %q on any node", name)
+    if attemptedNodes == 0 {
+        return false, fmt.Errorf("no worker nodes provided")
+    }
+    // No matching device - callers can skip
+    return false, nil
 }

This allows tests to use Skip() for "device not present" scenarios as documented in the README.

tests/ocp/sriov/tests/README.md (1)

420-443: MTU value in documentation still doesn't match code.

The README documents MTU as 1800 (line 431), but tsparams.DefaultTestMTU in consts.go is defined as 9000. This discrepancy was flagged in a previous review and marked as addressed, but the mismatch persists.

Update line 431 to reflect the actual default:

 **Configuration**:
-- MTU: `1800`
+- MTU: `9000` (defined by `DefaultTestMTU`)

And update line 438 accordingly:

-- Policy MTU is updated to 1800
+- Policy MTU is updated to 9000
🧹 Nitpick comments (4)
tests/ocp/sriov/internal/tsparams/consts.go (1)

55-56: Consider documenting MTU range constraint.

The README mentions "MTU value must be in range 1-9192 as per SR-IOV specification." Consider adding a comment here about the valid range, or defining MinTestMTU and MaxTestMTU constants for validation.

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

308-328: Legacy VFNum parsing has a subtle ordering dependency.

The check if sriovOcpConfig.VFNum != DefaultVFNum means SRIOV_VF_NUM is only read if ECO_OCP_SRIOV_VF_NUM wasn't set. This is correct but relies on the call order in NewSriovOcpConfig(). Consider adding a brief comment in NewSriovOcpConfig() noting this intentional precedence.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)

303-356: Policy recreation now preserves additional fields - good improvement.

The UpdateSriovPolicyMTU function now correctly preserves DeviceType, Vendor, DeviceID, and Priority from the original spec. This addresses the previous review concern.

However, there are still some spec fields that might be lost:

  • LinkType
  • IsRdma
  • NicSelector.RootDevices
  • NicSelector.NetFilter
  • ExternallyManaged
  • VdpaType

For a complete solution, consider copying all non-zero fields from the original spec, or document which fields are preserved.


167-177: Silent error swallowing in RemoveSriovNetwork.

Returning nil when PullNetwork fails assumes the network is already deleted, but the error could indicate other issues (e.g., API server unreachable). Consider logging the error or checking for specific "not found" errors:

 func RemoveSriovNetwork(name string, timeout time.Duration) error {
     network, err := sriov.PullNetwork(APIClient, name, sriovOpNs)
     if err != nil {
-        return nil // Already deleted
+        klog.V(90).Infof("Network %q not found or error pulling: %v", name, err)
+        return nil // Assume already deleted
     }
     return network.DeleteAndWait(timeout)
 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9999d379c5eaa2504825f4765687ec2c9927f271 and acd5dacc020bde2aaebbc9622d18f297bde606b0.

📒 Files selected for processing (8)
  • tests/ocp/sriov/README.md (3 hunks)
  • tests/ocp/sriov/internal/ocpsriovconfig/config.go (4 hunks)
  • tests/ocp/sriov/internal/ocpsriovconfig/default.yaml (1 hunks)
  • tests/ocp/sriov/internal/sriovenv/sriovenv.go (1 hunks)
  • tests/ocp/sriov/internal/tsparams/consts.go (2 hunks)
  • tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (2 hunks)
  • tests/ocp/sriov/tests/README.md (1 hunks)
  • tests/ocp/sriov/tests/basic.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/ocp/sriov/README.md
  • tests/ocp/sriov/internal/tsparams/ocpsriovvars.go
  • tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
  • tests/ocp/sriov/tests/basic.go
🧰 Additional context used
🧬 Code graph analysis (1)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (3)
tests/cnf/ran/ptp/internal/iface/iface.go (1)
  • Name (22-22)
tests/ocp/internal/ocpconfig/config.go (1)
  • OcpConfig (20-22)
tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)
  • SriovOcpConfig (13-13)
🪛 LanguageTool
tests/ocp/sriov/tests/README.md

[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (18)
tests/ocp/sriov/tests/README.md (3)

1-15: Clear and well-structured documentation.

The overview section effectively introduces the test suite, its purpose, and the key information users need (file location, labels, namespace, container). The structure follows documentation best practices.


16-85: Comprehensive prerequisites section with helpful examples.

The cluster requirements and OpenShift configuration sections are thorough, including verification commands and a practical MachineConfig example for DPDK hugepages. This will help users set up their test environment correctly.


492-497: Minor grammar issue: "AfterAll" spelling.

Static analysis flagged "AfterAll" - this is actually correct as it refers to the Ginkgo AfterAll hook. No change needed.

tests/ocp/sriov/internal/tsparams/consts.go (4)

14-16: LGTM - Well-defined label constant.

The LabelBasic constant follows the existing naming convention and provides a clear label for test filtering.


21-44: Reasonable timeout constants with good documentation.

The timeout values are well-chosen for SR-IOV operations:

  • PolicyApplicationTimeout at 35 minutes matches MCOWaitTimeout, appropriate for MCP updates
  • PodReadyTimeout at 5 minutes allows for SR-IOV device attachment
  • Polling intervals (2-10 seconds) are reasonable for avoiding API server overload

46-53: Good use of locally administered MAC addresses.

The 02: prefix in MAC addresses correctly indicates locally administered addresses, avoiding conflicts with real hardware MAC addresses. The /24 CIDR notation for IPs is appropriate for test networks.


64-67: Rate limiting constants look reasonable.

The TestMinTxRate (40 Mbps) and TestMaxTxRate (100 Mbps) provide a valid range for rate limiting tests. Note that some devices (BCM) don't support minTxRate, which is handled in the test code.

tests/ocp/sriov/internal/ocpsriovconfig/config.go (5)

45-75: Excellent documentation for the temporary struct pattern.

The detailed comments explaining why separate sriovYAMLConfig and sriovEnvConfig structs are needed to avoid yaml.v2's pointer-zeroing behavior is very helpful for future maintainers. This is a known gotcha with yaml.v2 and the workaround is correct.


76-88: Default devices are well-defined with feature flags.

The SupportsMinTxRate field correctly distinguishes devices that support rate limiting from those that don't (BCM and x710 devices). This enables graceful skipping of unsupported features in tests.


104-113: Initialization order is correct.

Setting defaults before reading config files ensures that:

  1. Defaults are available if config files don't specify values
  2. Config files can override defaults
  3. Environment variables can override config file values

The error handling for runtime.Caller(0) failure is appropriate.


141-157: Defensive accessor methods with safe defaults.

GetDevices() and GetVFNum() properly return defaults when the config is empty or invalid. However, GetDevices() returns GetDefaultDevices() when len(c.Devices) == 0, but c.Devices is already set to defaults in NewSriovOcpConfig(). This redundancy is harmless but could be simplified.


259-306: Device parsing handles edge cases well.

The parseDevicesFromEnv function:

  • Trims whitespace from entries and parts
  • Handles optional 5th field for SupportsMinTxRate
  • Logs warnings for invalid entries without failing
  • Only overwrites defaults if at least one valid device is parsed

One minor improvement: the comment at line 301 could be clearer.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (6)

39-42: Simple and effective operator status check.

The CheckSriovOperatorStatus function correctly delegates to the existing sriovoperator.IsSriovDeployed utility, maintaining consistency with other test suites.


87-137: Clean functional options pattern for network configuration.

The NetworkOption type and With* functions provide a flexible, extensible API for configuring SR-IOV networks. This follows Go best practices for builder patterns.


509-585: Comprehensive VF status verification with traffic test.

The CheckVFStatusWithPassTraffic function properly:

  • Creates client/server pods with cleanup via defer
  • Verifies interface readiness and carrier status
  • Conditionally verifies spoof checking based on description
  • Tests actual connectivity with ping

Good defensive cleanup pattern.


683-722: Debug pod execution with proper cleanup.

The executeOnNode function correctly:

  • Truncates pod name to 63 characters (K8s limit)
  • Cleans up existing debug pods before creating new ones
  • Uses defer for cleanup
  • Uses nsenter to execute commands in the host namespace

One consideration: the 63-character truncation might create collisions for very long node names with similar prefixes.


771-839: Best-effort cleanup with good logging.

The CleanupLeftoverResources function appropriately:

  • Logs warnings but continues on errors (best-effort)
  • Uses regex patterns to match test resources
  • Cleans up namespaces, networks, and policies

The regex pattern ^\d{5}-|\w+dpdknet$ correctly matches test case ID prefixes and DPDK network suffixes.


728-765: PCI address extraction handles namespaced network names.

The GetPciAddress function correctly strips the namespace prefix from network names using strings.LastIndex. The JSON struct for parsing network status is appropriately minimal.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between acd5dacc020bde2aaebbc9622d18f297bde606b0 and 30bb5d969242eb732ce9a179a07f22a95dbb391f.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/README.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
tests/ocp/sriov/tests/README.md

[grammar] ~493-~493: Ensure spelling is correct
Context: ... up resources using DeferCleanup ### AfterAll Hook 1. Policy Cleanup: Removes al...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (1)
tests/ocp/sriov/tests/README.md (1)

420-443: MTU documentation correctly references DefaultTestMTU.

The MTU section now properly documents the value as 9000 (defined by DefaultTestMTU), which aligns with the actual test implementation and addresses the previous documentation mismatch concern.

Comment on lines 103 to 115
#### Device Configuration

If `SRIOV_DEVICES` is not set, the test suite uses default device configurations:

- `e810xxv` (DeviceID: 159b, Vendor: 8086, Interface: eno12409)
- `e810c` (DeviceID: 1593, Vendor: 8086, Interface: ens2f2)
- `x710` (DeviceID: 1572, Vendor: 8086, Interface: ens5f0) - NO-CARRIER
- `bcm57414` (DeviceID: 16d7, Vendor: 14e4, Interface: ens4f1np1)
- `bcm57508` (DeviceID: 1750, Vendor: 14e4, Interface: ens3f0np0) - NO-CARRIER
- `e810back` (DeviceID: 1591, Vendor: 8086, Interface: ens4f2)
- `cx7anl244` (DeviceID: 1021, Vendor: 15b3, Interface: ens2f0np0)

**Note**: Devices marked with "NO-CARRIER" may be skipped if the interface has no carrier signal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect GetDefaultDevices() implementation
echo "=== GetDefaultDevices() function ==="
rg -A 150 'func GetDefaultDevices' tests/ocp/sriov/internal/ocpsriovconfig/config.go

echo ""
echo "=== Check for device-related mappings/constants ==="
rg -B 2 -A 5 'map\[.*DeviceConfig|var.*Device' tests/ocp/sriov/internal/ocpsriovconfig/config.go

Repository: rh-ecosystem-edge/eco-gotests

Length of output: 4745


🏁 Script executed:

#!/bin/bash
# Also check the README.md file structure
echo "=== README.md device configuration section (lines 100-120) ==="
sed -n '100,120p' tests/ocp/sriov/tests/README.md

Repository: rh-ecosystem-edge/eco-gotests

Length of output: 1123


🏁 Script executed:

#!/bin/bash
# Check for line 493 grammar issue
echo "=== README.md around line 493 ==="
sed -n '490,495p' tests/ocp/sriov/tests/README.md

Repository: rh-ecosystem-edge/eco-gotests

Length of output: 323


Device configuration list is out of sync with GetDefaultDevices() in config.go.

The documented device list has multiple discrepancies with the actual implementation:

  • Missing devices: cx6 (ens6f0np0) and cx6dx (ens7f0np0) are not documented
  • Non-existent device: e810back appears in documentation but doesn't exist in the code
  • Interface name mismatches:
    • bcm57414: documented as ens4f1np1, actually ens3f0
    • bcm57508: documented as ens3f0np0, actually ens4f0
    • cx7anl244: documented as ens2f0np0, actually cx7 with ens8f0np0

Regenerate the device list directly from GetDefaultDevices() to ensure users following the documentation have accurate information.

🤖 Prompt for AI Agents
In tests/ocp/sriov/tests/README.md around lines 103 to 115, the documented
default device list is out of sync with GetDefaultDevices() in config.go; update
the README by regenerating the device table directly from GetDefaultDevices()
and replacing the listed entries: remove the non-existent e810back, add missing
cx6 (ens6f0np0) and cx6dx (ens7f0np0), and correct interface names for bcm57414
to ens3f0, bcm57508 to ens4f0, and cx7anl244 to cx7 with ens8f0np0; ensure
NO-CARRIER annotations are preserved where applicable and commit the updated
README so docs match the implementation.

@zhiqiangf zhiqiangf force-pushed the pr4-sriov-docs branch 4 times, most recently from 252dc0c to 493fc7d Compare December 13, 2025 00:33
#### Optional
- `ECO_OCP_SRIOV_INTERFACE_LIST`: Comma-separated list of SR-IOV interfaces
- `ECO_OCP_SRIOV_TEST_IMAGE`: Container image for test workloads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `ECO_OCP_SRIOV_TEST_CONTAINER`: Container image for test workloads

// MCPStableInterval represents polling interval for MCP stability checks.
MCPStableInterval = 10 * time.Second
// NADTimeout represents timeout for NAD creation.
NADTimeout = 2 * time.Minute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in worst case, this should be 5-10seconds.
if it takes more than that we better consider it a failure.

@ajaggapa
Copy link
Collaborator

@zhiqiangf
i see all your open PRs are incremental. So the older comments are still valid on the latest one. Please address them.
Also, it is recommended to have fewer commits on each PR and you can force push changes on the same commit to address reviews

- Add DeviceConfig struct for SR-IOV device configuration
- Add GetSriovDevices() method with ECO_OCP_SRIOV_DEVICES env var support
- Add GetVFNum() method with error handling
- Move all defaults to default.yaml
- Fix suite-level labels to exclude feature-specific labels
- Add constants for timeouts, labels, and test parameters
@zhiqiangf zhiqiangf force-pushed the pr4-sriov-docs branch 2 times, most recently from fd0aa7b to 0d69e5b Compare December 19, 2025 22:27
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
tests/ocp/sriov/tests/basic.go (1)

150-158: InitVF error handling still prevents graceful device skipping.

The past review comment on this pattern remains applicable. initVFWithDevType in sriovenv.go (line 248) returns (false, fmt.Errorf(...)) when no matching device is found on any node, so the Expect(err).ToNot(HaveOccurred()) on line 152 will fail before the if !result skip logic can execute.

This pattern is repeated in all 8 test cases. To implement the intended "skip devices not found" behavior, either:

  1. Modify initVFWithDevType to return (false, nil) for "device not present" scenarios, or
  2. Use a sentinel error type and check with errors.Is before the Expect.
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)

243-249: initVFWithDevType still cannot gracefully indicate "device not found".

The previous review comment remains unaddressed. Line 248 returns (false, fmt.Errorf(...)) for all failure cases, making it impossible for callers to distinguish "no suitable device on any worker" from genuine API/configuration failures.

This prevents the test skip logic in basic.go from working as documented. Consider returning (false, nil) when the device simply isn't present on any node, reserving errors for actual failures.

🧹 Nitpick comments (2)
tests/ocp/sriov/internal/tsparams/consts.go (1)

50-53: Consider consolidating duplicate IP constants.

TestPodClientIP and TestPodServerIP duplicate ClientIPv4IPAddress and ServerIPv4IPAddress from ocpsriovvars.go. Consider using a single source of truth for these values.

tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

129-134: Consider documenting accepted values for minTxRate.

The parsing accepts true, 1, or yes (case-insensitive) as true values, and anything else defaults to false. This is reasonable behavior but could be clearer in the format documentation on line 110-111.

Suggested documentation improvement
 // parseSriovDevicesEnv parses device configuration from ECO_OCP_SRIOV_DEVICES environment variable.
-// Format: "name1:deviceid1:vendor1:interface1[:minTxRate],name2:deviceid2:vendor2:interface2[:minTxRate],..."
+// Format: "name1:deviceid1:vendor1:interface1[:minTxRate],name2:deviceid2:vendor2:interface2[:minTxRate],..."
+// minTxRate accepts: true/1/yes for enabled, any other value (including omitted) defaults to false.
 // Example: "e810xxv:159b:8086:ens2f0,e810c:1593:8086:ens2f2".
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 941b840fe473a30924a9bdb82b503809368aac2c and 0d69e5b.

📒 Files selected for processing (7)
  • tests/ocp/sriov/README.md (2 hunks)
  • tests/ocp/sriov/internal/ocpsriovconfig/config.go (3 hunks)
  • tests/ocp/sriov/internal/ocpsriovconfig/default.yaml (1 hunks)
  • tests/ocp/sriov/internal/sriovenv/sriovenv.go (1 hunks)
  • tests/ocp/sriov/internal/tsparams/consts.go (2 hunks)
  • tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (2 hunks)
  • tests/ocp/sriov/tests/basic.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/ocp/sriov/README.md
🧰 Additional context used
🧬 Code graph analysis (3)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (2)
tests/ocp/internal/ocpconfig/config.go (1)
  • OcpConfig (20-22)
tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)
  • SriovOcpConfig (13-13)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)
tests/ocp/sriov/internal/tsparams/consts.go (1)
  • LabelSuite (10-10)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
tests/internal/sriovoperator/sriovenv.go (4)
  • IsSriovDeployed (32-54)
  • WaitForSriovAndMCPStable (80-100)
  • RemoveAllSriovNetworks (156-176)
  • RemoveAllPoliciesAndWaitForSriovAndMCPStable (180-195)
tests/internal/cluster/cluster.go (1)
  • ExecCmdWithStdout (98-180)
🔇 Additional comments (6)
tests/ocp/sriov/internal/tsparams/ocpsriovvars.go (1)

16-18: LGTM!

Clear documentation distinguishing suite-level labels from feature-specific labels. The consolidation of DefaultTimeout to consts.go reduces duplication.

tests/ocp/sriov/tests/basic.go (2)

45-72: LGTM!

Well-structured helper with proper namespace creation, label application, and cleanup via DeferCleanup. The Eventually wait ensures namespace readiness before proceeding.


454-532: DPDK test implementation looks correct.

The test properly skips BCM devices with a clear bug reference (OCPBUGS-30909), uses per-device namespaces, and verifies PCI address assignment. The DeferCleanup correctly captures the per-iteration namespace via closure.

tests/ocp/sriov/internal/tsparams/consts.go (1)

21-44: Well-organized timeout constants.

The timeout values are appropriate for SR-IOV operations. PolicyApplicationTimeout at 35 minutes aligns with MCOWaitTimeout for MachineConfigPool operations, which is correct since SR-IOV policy changes often trigger node reconfiguration.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)

109-125: Well-designed functional options pattern for network configuration.

The NetworkOption pattern with WithSpoof, WithTrust, WithVLAN, etc. provides a clean, extensible API for network creation. The CreateSriovNetwork function composes these options nicely.


713-749: Cleanup function handles errors gracefully.

Using warnings instead of returning errors on cleanup failures is appropriate since cleanup should be best-effort and not block test completion. The function properly cleans up test namespaces, networks, and policies.

Comment on lines +148 to +160
// GetVFNum returns the configured number of virtual functions.
// Returns an error if VFNum is not properly configured.
func (sriovOcpConfig *SriovOcpConfig) GetVFNum() (int, error) {
if sriovOcpConfig.VFNum <= 0 {
return 0, fmt.Errorf(
"no SR-IOV VFs configured, check env var ECO_OCP_SRIOV_VF_NUM")
}

return sriovOcpConfig.VFNum, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Error message could mention both configuration sources.

The error message only references the environment variable, but VFNum can also be set via vf_num in the YAML config file.

Suggested fix
 func (sriovOcpConfig *SriovOcpConfig) GetVFNum() (int, error) {
 	if sriovOcpConfig.VFNum <= 0 {
 		return 0, fmt.Errorf(
-			"no SR-IOV VFs configured, check env var ECO_OCP_SRIOV_VF_NUM")
+			"no SR-IOV VFs configured, check vf_num in YAML or env var ECO_OCP_SRIOV_VF_NUM")
 	}
 
 	return sriovOcpConfig.VFNum, nil
 }
🤖 Prompt for AI Agents
In tests/ocp/sriov/internal/ocpsriovconfig/config.go around lines 148 to 157,
the error returned when VFNum is not set only mentions the environment variable;
update the error text to reference both configuration sources (the
ECO_OCP_SRIOV_VF_NUM environment variable and the vf_num key in the YAML config)
so callers know where to look; modify the fmt.Errorf call to include a concise
message listing both sources (for example: "no SR-IOV VFs configured; set
ECO_OCP_SRIOV_VF_NUM or vf_num in the YAML config") and keep the same error
return shape.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)

151-160: Error message should mention both configuration sources.

The error message at line 156 only references the environment variable, but VFNum can also be set via vf_num in the YAML config file. This makes troubleshooting harder for users who configure via YAML.

tests/ocp/sriov/tests/basic.go (1)

143-179: InitVF error handling prevents graceful device skipping.

The pattern used across all tests (also at lines 181-214, 216-249, 251-284, 286-329, 331-364, 366-407, 409-452, 454-532):

result, err := sriovenv.InitVF(...)
Expect(err).ToNot(HaveOccurred(), ...)
if !result {
    continue
}

However, initVFWithDevType in sriovenv.go line 267 returns (false, fmt.Errorf(...)) when no suitable device is found on any worker, never (false, nil). This causes Expect(err).ToNot(HaveOccurred()) to fail before the skip logic (if !result or final if !executed { Skip(...) }) can execute.

This contradicts the README's promise to automatically skip devices not found on any worker node.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)

200-268: initVFWithDevType treats "device not found" as an error, preventing graceful test skips.

At line 267, the function returns (false, fmt.Errorf(...)) when no suitable device is found on any worker node. This prevents callers in tests/basic.go from distinguishing "device not present" from genuine failures, because:

result, err := sriovenv.InitVF(...)
Expect(err).ToNot(HaveOccurred())  // Fails here for "not found"
if !result { Skip(...) }            // Never reached

Consider returning (false, nil) when no matching NIC is discovered (lines 229-232, 249-251, etc.), reserving non-nil errors for real failures (API errors, timeouts). This will allow tests to implement the promised "device skipping" behavior without conflating "not found" with infrastructure failures.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36be2a6 and 8995f5e.

📒 Files selected for processing (6)
  • tests/ocp/sriov/README.md
  • tests/ocp/sriov/internal/ocpsriovconfig/config.go
  • tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
  • tests/ocp/sriov/internal/sriovenv/sriovenv.go
  • tests/ocp/sriov/internal/tsparams/consts.go
  • tests/ocp/sriov/tests/basic.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/ocp/sriov/internal/ocpsriovconfig/default.yaml
  • tests/ocp/sriov/internal/tsparams/consts.go
🧰 Additional context used
🧬 Code graph analysis (3)
tests/ocp/sriov/tests/basic.go (3)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (1)
  • DeviceConfig (26-32)
tests/ocp/sriov/internal/tsparams/consts.go (12)
  • BCMVendorID (72-72)
  • NamespaceTimeout (30-30)
  • CleanupTimeout (34-34)
  • LabelBasic (15-15)
  • PodReadyTimeout (32-32)
  • TestVLAN (61-61)
  • TestVlanQoS (63-63)
  • TestMinTxRate (65-65)
  • TestMaxTxRate (67-67)
  • DefaultTestMTU (56-56)
  • NADTimeout (40-40)
  • PollingInterval (28-28)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (17)
  • NetworkOption (59-59)
  • CreateSriovNetwork (111-126)
  • RemoveSriovNetwork (139-152)
  • WithSpoof (62-66)
  • WithTrust (69-73)
  • WithVLAN (76-80)
  • WithVlanQoS (83-87)
  • WithMinTxRate (90-94)
  • WithMaxTxRate (97-101)
  • WithLinkState (104-108)
  • VerifyLinkStateConfiguration (585-637)
  • UpdateSriovPolicyMTU (295-346)
  • InitDpdkVF (196-198)
  • CreateDpdkTestPod (373-392)
  • DeleteDpdkTestPod (395-408)
  • WaitForPodWithLabelReady (411-448)
  • GetPciAddress (705-740)
tests/ocp/sriov/internal/ocpsriovconfig/config.go (3)
tests/cnf/ran/ptp/internal/iface/iface.go (1)
  • Name (22-22)
tests/ocp/internal/ocpconfig/config.go (1)
  • OcpConfig (20-22)
tests/ocp/sriov/internal/ocpsriovinittools/ocpsriovinittools.go (1)
  • SriovOcpConfig (13-13)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (3)
tests/internal/sriovoperator/sriovenv.go (4)
  • IsSriovDeployed (32-54)
  • WaitForSriovAndMCPStable (80-100)
  • RemoveAllSriovNetworks (156-176)
  • RemoveAllPoliciesAndWaitForSriovAndMCPStable (180-195)
tests/ocp/sriov/internal/tsparams/consts.go (14)
  • MCPStableInterval (38-38)
  • NADTimeout (40-40)
  • PollingInterval (28-28)
  • NamespaceTimeout (30-30)
  • PolicyApplicationTimeout (42-42)
  • PodReadyTimeout (32-32)
  • TestPodClientIP (51-51)
  • TestPodClientMAC (47-47)
  • CleanupTimeout (34-34)
  • TestPodServerIP (53-53)
  • TestPodServerMAC (49-49)
  • CarrierWaitTimeout (36-36)
  • TestResourceLabelKey (75-75)
  • TestResourceLabelValue (77-77)
tests/internal/cluster/cluster.go (1)
  • ExecCmdWithStdout (98-180)

root added 3 commits December 23, 2025 11:29
- Add SR-IOV environment helper functions using eco-goinfra
- Use existing sriovoperator functions for stability checks and cleanup
- Add network creation with NetworkOption pattern
- Add policy creation and VF initialization helpers
- Add pod verification and traffic testing utilities
- Add 9 basic SR-IOV test cases with test IDs as labels
- Test spoof checking, trust, VLAN, link state, MTU, min/max tx rate
- Add DPDK test with BCM NIC skip (OCPBUGS-30909)
- Use constants from tsparams for test values
- Add proper cleanup in AfterAll and AfterEach
- Add comprehensive README with test descriptions and configuration
- Document environment variables and test labels
- Fix NADTimeout to 10 seconds (was 2 minutes)
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/ocp/sriov/tests/basic.go (1)

143-179: InitVF error handling still prevents graceful skips for "device not found" scenarios.

The past review comment remains applicable. When InitVF returns an error (line 152), the test fails immediately via Expect(err).ToNot(HaveOccurred()). However, the underlying initVFWithDevType returns (false, error) when no device is found on any node, not (false, nil).

This means devices not present on any worker will cause test failures rather than skips, contradicting the intended skip behavior (lines 176-178).

Consider either:

  1. Modifying initVFWithDevType to return (false, nil) for "no device found" cases
  2. Using a sentinel error and checking with errors.Is before the Expect
tests/ocp/sriov/internal/sriovenv/sriovenv.go (1)

200-268: initVFWithDevType returns error for "no device found", preventing graceful skips.

As noted in the past review comment, this function returns (false, fmt.Errorf(...)) on line 267 when no node has the device. This causes callers using Expect(err).ToNot(HaveOccurred()) to fail instead of skip.

Consider returning (false, nil) when no suitable node/device is found, reserving errors for actual failures (API errors, timeouts, etc.). This would allow the if !result { continue } and if !executed { Skip(...) } patterns in tests to work as intended.

🧹 Nitpick comments (2)
tests/ocp/sriov/tests/basic.go (1)

47-72: Test namespace should include the test resource label for reliable cleanup.

The setupTestNamespace function applies params.PrivilegedNSLabels but doesn't include tsparams.TestResourceLabelKey. The CleanupLeftoverResources function in sriovenv.go first looks for namespaces with this label before falling back to the e2e- prefix match.

Adding the test label ensures consistent cleanup behavior and aligns with the documented cleanup strategy.

🔎 Proposed fix
 func setupTestNamespace(testID string, data sriovconfig.DeviceConfig) string {
 	testNamespace := "e2e-" + testID + data.Name
 
 	By(fmt.Sprintf("Creating test namespace %q", testNamespace))
 
 	nsBuilder := namespace.NewBuilder(APIClient, testNamespace)
 	for key, value := range params.PrivilegedNSLabels {
 		nsBuilder.WithLabel(key, value)
 	}
+	nsBuilder.WithLabel(tsparams.TestResourceLabelKey, tsparams.TestResourceLabelValue)
 
 	_, err := nsBuilder.Create()
tests/ocp/sriov/internal/tsparams/consts.go (1)

25-28: Clarify the distinction between RetryInterval and PollingInterval.

Both RetryInterval (3s) and PollingInterval (2s) serve similar purposes but have slightly different values. Consider adding comments to clarify when to use each, or consolidate to a single polling interval constant for consistency.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8995f5e and 16dc562.

📒 Files selected for processing (4)
  • tests/ocp/sriov/README.md
  • tests/ocp/sriov/internal/sriovenv/sriovenv.go
  • tests/ocp/sriov/internal/tsparams/consts.go
  • tests/ocp/sriov/tests/basic.go
🧰 Additional context used
🧬 Code graph analysis (1)
tests/ocp/sriov/internal/sriovenv/sriovenv.go (2)
tests/internal/sriovoperator/sriovenv.go (4)
  • IsSriovDeployed (32-54)
  • WaitForSriovAndMCPStable (80-100)
  • RemoveAllSriovNetworks (156-176)
  • RemoveAllPoliciesAndWaitForSriovAndMCPStable (180-195)
tests/internal/cluster/cluster.go (1)
  • ExecCmdWithStdout (98-180)
🔇 Additional comments (20)
tests/ocp/sriov/README.md (2)

34-49: LGTM! Directory structure is well-documented.

The updated directory structure accurately reflects the new internal packages (ocpsriovconfig, ocpsriovinittools, sriovenv, sriovocpenv) and test files. The organization follows the established pattern for the eco-gotests framework.


468-473: Environment variable documentation looks correct.

The environment variables are properly documented with format examples. Note that ECO_OCP_SRIOV_TEST_CONTAINER aligns with the previous review comment suggestion (renamed from ECO_OCP_SRIOV_TEST_IMAGE).

tests/ocp/sriov/tests/basic.go (6)

25-36: LGTM! Helper function for NO-CARRIER detection is clean.

The isNoCarrierError function correctly handles nil errors and checks for both common carrier error messages.


38-43: LGTM! BCM device detection is straightforward.

The function correctly identifies Broadcom NICs by both name pattern and vendor ID.


74-88: LGTM! Network setup helper is well-structured.

The function correctly creates the SR-IOV network and registers cleanup via DeferCleanup.


102-121: LGTM! BeforeAll setup is comprehensive.

The setup properly validates the SR-IOV operator status, loads configuration, and discovers worker nodes with appropriate assertions.


123-141: AfterAll cleanup continues despite individual errors - good pattern.

The cleanup correctly collects errors rather than failing on the first one, ensuring all policies get cleanup attempts. The final stability wait ensures the cluster is ready for subsequent test runs.


454-532: DPDK test case is well-structured with proper cleanup and verification.

The test correctly:

  • Skips BCM devices with appropriate documentation reference
  • Creates NAD and waits for readiness
  • Verifies PCI address assignment
  • Validates network-status annotation

The DeferCleanup for the DPDK pod ensures cleanup even on test failure.

tests/ocp/sriov/internal/tsparams/consts.go (3)

14-15: LGTM! New basic test label added.

The LabelBasic constant provides a clear label for filtering basic SR-IOV test cases.


21-44: LGTM! Timeout constants are well-organized and documented.

The timeouts are appropriately sized for their operations:

  • WaitTimeout (20 min) for long operations
  • DefaultTimeout (300s) for general operations
  • PodReadyTimeout (300s) for pod startup
  • NADTimeout (10s) aligns with the previous reviewer's suggestion of 5-10 seconds

46-77: LGTM! Test constants are well-documented.

MAC addresses, IP addresses, MTU, VLAN, TX rate limits, and vendor IDs are all properly documented and follow consistent naming conventions.

tests/ocp/sriov/internal/sriovenv/sriovenv.go (9)

37-52: LGTM! Operator and policy readiness checks are clean wrappers.

The functions properly delegate to the existing sriovoperator package functions with appropriate configuration.


58-108: LGTM! NetworkOption pattern provides clean, composable network configuration.

The functional options pattern allows flexible network configuration without bloating function signatures.


110-126: LGTM! Network creation with NAD wait is well-implemented.

The function creates the network with all options applied and waits for the NAD to be created before returning.


138-152: RemoveSriovNetwork correctly handles already-deleted case.

The function properly checks for IsNotFound before attempting deletion and uses the builder's DeleteAndWait method.


292-333: MTU update now preserves all spec fields via DeepCopy - previous concern addressed.

The implementation correctly stores a deep copy of the entire spec before deletion and applies it to the new policy. This preserves all fields including IsRdma, LinkType, ESwitchMode, ExternallyManaged, VdpaType, etc.


381-396: DeleteDpdkTestPod handles eco-goinfra error messages correctly.

The function checks both apierrors.IsNotFound and the eco-goinfra-specific "does not exist" message, ensuring graceful handling when the pod is already deleted.


491-569: CheckVFStatusWithPassTraffic creates thorough connectivity verification.

The function:

  • Generates unique pod names to avoid collisions
  • Creates client/server pods with cleanup via defer
  • Verifies interface readiness and carrier status
  • Conditionally checks spoof configuration
  • Tests actual ping connectivity

734-788: CleanupLeftoverResources provides comprehensive cleanup with good error handling.

The function:

  • Uses label-based selection first (preferred)
  • Falls back to name prefix matching (backwards compatibility)
  • Logs warnings instead of failing on individual errors
  • Uses existing sriovoperator functions for network/policy cleanup

This approach is robust for test cleanup scenarios.


758-772: The field selector suggestion is not applicable—Kubernetes field selectors only support exact matching (=, ==, !=) and cannot filter by name prefix.

The current implementation (list all namespaces and filter client-side using strings.HasPrefix) is the correct and only viable approach for this use case. Since the code is test cleanup and already falls back gracefully when listing fails, this implementation is acceptable as-is.

Likely an incorrect or invalid review comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants