-
Notifications
You must be signed in to change notification settings - Fork 49
cnf network: fix NAD creation flake #1012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughReplaces in-test direct SR‑IOV network Create() calls with a builder-variable pattern and delegates network creation plus NAD readiness waiting to a new exported helper Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (10)
🚧 Files skipped from review as they are similar to previous changes (3)
🧰 Additional context used🧬 Code graph analysis (4)tests/cnf/core/network/sriov/tests/mlxSecureBoot.go (3)
tests/cnf/core/network/sriov/tests/allmulti.go (2)
tests/cnf/core/network/sriov/tests/metricsExporter.go (2)
tests/cnf/core/network/sriov/tests/paralleldraining.go (3)
🔇 Additional comments (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/cnf/core/network/sriov/tests/externalllymanaged.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/cnf/core/network/sriov/tests/externalllymanaged.go (3)
tests/cnf/core/network/sriov/internal/tsparams/consts.go (1)
TestNamespaceName(7-7)tests/cnf/core/network/internal/netparam/const.go (1)
LogLevelDebug(31-31)tests/cnf/core/network/sriov/internal/sriovenv/sriovenv.go (1)
CreateSriovNetworkAndWaitForNADCreation(59-68)
| sriovNetworkExtManaged := sriov.NewNetworkBuilder(APIClient, sriovAndResName, NetConfig.SriovOperatorNamespace, | ||
| tsparams.TestNamespaceName, sriovAndResName).WithStaticIpam().WithMacAddressSupport().WithIPAddressSupport(). | ||
| WithLogLevel(netparam.LogLevelDebug).Create() | ||
| WithLogLevel(netparam.LogLevelDebug) | ||
|
|
||
| err = sriovenv.CreateSriovNetworkAndWaitForNADCreation(sriovNetworkExtManaged, 5*time.Second) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to create SR-IOV network") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the file and check its size
find . -name "externalllymanaged.go" -type f 2>/dev/null | head -5Repository: rh-ecosystem-edge/eco-gotests
Length of output: 131
🏁 Script executed:
#!/bin/bash
# Check repository structure
git ls-files | grep -i "externally\|sriov.*test" | head -20Repository: rh-ecosystem-edge/eco-gotests
Length of output: 765
🏁 Script executed:
#!/bin/bash
# Search for NetworkBuilder.Create() calls and timeout constants
echo "=== Searching for SR-IOV timeout constants ==="
rg --type go "Timeout|timeout" tests/cnf/core/network/sriov/internal/ | grep -i "nad\|default" | head -20
echo ""
echo "=== Searching for NetworkBuilder patterns ==="
rg --type go "NewNetworkBuilder.*\.Create\(\)" tests/cnf/core/network/sriov/tests/ -A 2 -B 2 | head -50
echo ""
echo "=== Checking for CreateSriovNetwork helper usage ==="
rg --type go "CreateSriovNetworkAndWaitForNADCreation" tests/cnf/core/network/sriov/ | head -20
echo ""
echo "=== Looking for timeout pattern in sriov tests ==="
rg --type go "time\.(Second|Minute)" tests/cnf/core/network/sriov/tests/ | grep -v "Binary" | head -30Repository: rh-ecosystem-edge/eco-gotests
Length of output: 6551
🏁 Script executed:
#!/bin/bash
# Read externalllymanaged.go around lines 407-411 and 441-446
echo "=== Lines 400-420 context ==="
sed -n '400,420p' tests/cnf/core/network/sriov/tests/externalllymanaged.go
echo ""
echo "=== Lines 435-450 context ==="
sed -n '435,450p' tests/cnf/core/network/sriov/tests/externalllymanaged.go
echo ""
echo "=== Check tsparams for WaitTimeout definition ==="
grep -n "WaitTimeout" tests/cnf/core/network/sriov/internal/tsparams/sriovvars.goRepository: rh-ecosystem-edge/eco-gotests
Length of output: 2300
The 5-second NAD creation timeout is inadequate and inconsistent application has been confirmed.
Lines 407-411 in the same file create an SR-IOV network using the builder pattern with .Create() but do not wait for NAD creation, while lines 441-446 now use the new helper with a wait. This inconsistency must be resolved.
More critically, the 5-second timeout is far too aggressive. The codebase defines WaitTimeout = 3 * time.Minute and DefaultTimeout = 300 * time.Second for similar async operations. Using only 5 seconds for NAD creation (60x shorter than the established WaitTimeout) will cause flakiness, especially in resource-constrained environments.
Additionally, the same 5-second timeout appears in webhook-matchConditions.go, indicating the pattern was duplicated.
Required fixes:
- Replace
5*time.Secondwithtsparams.WaitTimeout(3 minutes) - Apply the new helper function to lines 407-411 which currently use
.Create()directly - Check
webhook-matchConditions.goand update its timeout similarly
🤖 Prompt for AI Agents
In tests/cnf/core/network/sriov/tests/externalllymanaged.go around lines 441-446
the NAD creation timeout is set to an inadequate 5*time.Second and is
inconsistent with other code; replace the literal 5*time.Second with
tsparams.WaitTimeout and also update the earlier SR-IOV network creation at
lines ~407-411 to use the CreateSriovNetworkAndWaitForNADCreation helper
(instead of calling .Create() directly) so both paths wait consistently;
additionally, search for the same 5*time.Second usage in
webhook-matchConditions.go and replace it with tsparams.WaitTimeout to align
with the codebase default wait time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anvesh what do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ideal case, it takes just 1-2 secs to create NAD. So, anything more than 5seconds should be treated as unusual scenario instead of bypassing it with longer wait time.
c5e5bf5 to
44bb06c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Fix all issues with AI Agents 🤖
In @tests/cnf/core/network/sriov/tests/allmulti.go:
- Around line 483-486: Replace the hardcoded 5*time.Second with the project’s
centralized timeout constant (e.g., DefaultTimeout or TestTimeout) when calling
sriovenv.CreateSriovNetworkAndWaitForNADCreation, and fix the Expect message
formatting by passing a single formatted string (use fmt.Sprintf) that inserts
srIovNetworkObject.Definition.Name and err into the message rather than
providing format verbs and args directly to Expect.
- Around line 331-334: Replace the hardcoded timeout passed to
CreateSriovNetworkAndWaitForNADCreation with the centralized
tsparams.NADWaitTimeout and fix the Expect error message by passing a single
formatted string (use fmt.Sprintf) instead of separate format + args;
specifically update the call that uses srIovNetworkObject and err so it calls
CreateSriovNetworkAndWaitForNADCreation(srIovNetworkObject,
tsparams.NADWaitTimeout) and change Expect(err).ToNot(HaveOccurred(), "fmt",
args...) to Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("failed to create and
wait for NAD creation for Sriov Network %s with error %v",
srIovNetworkObject.Definition.Name, err)).
In @tests/cnf/core/network/sriov/tests/metricsExporter.go:
- Around line 392-395: Replace the hardcoded 5*time.Second with the centralized
timeout constant tsparams.NADWaitTimeout when calling
sriovenv.CreateSriovNetworkAndWaitForNADCreation; also fix the error message
formatting by passing a single formatted string to
Expect(...).ToNot(HaveOccurred()) (e.g., build the message with fmt.Sprintf and
pass that string as the message argument). Ensure the change targets the call to
sriovenv.CreateSriovNetworkAndWaitForNADCreation and the
Expect(err).ToNot(HaveOccurred(...)) usage so the message uses
res.network.Definition.Name and err via fmt.Sprintf.
🧹 Nitpick comments (1)
tests/cnf/core/network/sriov/tests/rdmametricsapi.go (1)
367-372: Use the centralized timeout constant for consistency.Line 367 uses a hardcoded
5*time.Secondinstead oftsparams.NADWaitTimeout. Using the constant improves maintainability and ensures consistent timeout values across the test suite.🔎 Proposed fix
- err := sriovenv.CreateSriovNetworkAndWaitForNADCreation(testNetBuilder, 5*time.Second) + err := sriovenv.CreateSriovNetworkAndWaitForNADCreation(testNetBuilder, tsparams.NADWaitTimeout) Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("failed to create and wait for NAD creation for Sriov Network %s with error %v", netName, err))
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
tests/cnf/core/network/sriov/internal/tsparams/sriovvars.gotests/cnf/core/network/sriov/tests/allmulti.gotests/cnf/core/network/sriov/tests/exposemtu.gotests/cnf/core/network/sriov/tests/externalllymanaged.gotests/cnf/core/network/sriov/tests/metricsExporter.gotests/cnf/core/network/sriov/tests/mlxSecureBoot.gotests/cnf/core/network/sriov/tests/paralleldraining.gotests/cnf/core/network/sriov/tests/qinq.gotests/cnf/core/network/sriov/tests/rdmametricsapi.gotests/cnf/core/network/sriov/tests/webhook-matchConditions.go
🧰 Additional context used
🧬 Code graph analysis (7)
tests/cnf/core/network/sriov/tests/exposemtu.go (1)
tests/cnf/core/network/sriov/internal/sriovenv/sriovenv.go (1)
CreateSriovNetworkAndWaitForNADCreation(59-68)
tests/cnf/core/network/sriov/tests/webhook-matchConditions.go (2)
tests/cnf/core/network/sriov/internal/sriovenv/sriovenv.go (1)
CreateSriovNetworkAndWaitForNADCreation(59-68)tests/cnf/core/network/sriov/internal/tsparams/sriovvars.go (1)
NADWaitTimeout(29-29)
tests/cnf/core/network/sriov/tests/allmulti.go (2)
tests/cnf/core/network/sriov/internal/sriovenv/sriovenv.go (1)
CreateSriovNetworkAndWaitForNADCreation(59-68)tests/cnf/ran/ptp/internal/iface/iface.go (1)
Name(22-22)
tests/cnf/core/network/sriov/tests/qinq.go (1)
tests/cnf/core/network/sriov/internal/sriovenv/sriovenv.go (1)
CreateSriovNetworkAndWaitForNADCreation(59-68)
tests/cnf/core/network/sriov/tests/mlxSecureBoot.go (2)
tests/cnf/core/network/internal/netparam/const.go (1)
LogLevelDebug(31-31)tests/cnf/core/network/sriov/internal/sriovenv/sriovenv.go (1)
CreateSriovNetworkAndWaitForNADCreation(59-68)
tests/cnf/core/network/sriov/tests/rdmametricsapi.go (1)
tests/cnf/core/network/sriov/internal/sriovenv/sriovenv.go (1)
CreateSriovNetworkAndWaitForNADCreation(59-68)
tests/cnf/core/network/sriov/tests/metricsExporter.go (1)
tests/cnf/core/network/sriov/internal/sriovenv/sriovenv.go (1)
CreateSriovNetworkAndWaitForNADCreation(59-68)
🔇 Additional comments (11)
tests/cnf/core/network/sriov/internal/tsparams/sriovvars.go (1)
28-29: LGTM!The new
NADWaitTimeoutconstant centralizes the NAD creation timeout value, improving maintainability.tests/cnf/core/network/sriov/tests/webhook-matchConditions.go (1)
64-64: LGTM!Good use of the
tsparams.NADWaitTimeoutconstant instead of a hardcoded timeout value.tests/cnf/core/network/sriov/tests/mlxSecureBoot.go (1)
144-149: LGTM - Consistent refactor to use centralized NAD wait helper.The change correctly separates the builder construction from creation and uses the new
CreateSriovNetworkAndWaitForNADCreationhelper to ensure NAD readiness before proceeding. This aligns with the PR objective of fixing NAD creation flakes.tests/cnf/core/network/sriov/tests/exposemtu.go (2)
185-191: LGTM - Correct usage of NAD wait helper.The refactor properly separates builder construction from creation and delegates to the centralized helper for NAD readiness waiting.
117-125: Verify: These network creations still use direct.Create()without NAD wait.These two SR-IOV network creations in the "netdev 2 Policies with different MTU" test case still use the old pattern (direct
.Create()) without waiting for NAD readiness. Consider applying the sameCreateSriovNetworkAndWaitForNADCreationhelper here for consistency and to prevent potential NAD creation flakes in this test case as well.tests/cnf/core/network/sriov/tests/qinq.go (3)
378-382: LGTM - Consistent refactor for promiscuous network creation.The builder-variable pattern with NAD wait helper is correctly applied.
698-704: LGTM - Helper function correctly updated.The
defineAndCreateSrIovNetworkWithQinQfunction now uses the centralized NAD creation helper, ensuring consistent behavior across QinQ tests.
1049-1055: LGTM - Promiscuous network creation in helper function updated.The
defineAndCreateSriovNetworksfunction correctly applies the new pattern for the promiscuous network creation.tests/cnf/core/network/sriov/tests/paralleldraining.go (1)
272-279: LGTM - Parallel draining test configuration updated correctly.The SR-IOV network creation now uses the centralized NAD wait helper, which should help prevent flakiness in the parallel draining test setup.
tests/cnf/core/network/sriov/tests/externalllymanaged.go (2)
407-415: LGTM - Bond deployment test SR-IOV network creation updated.The change correctly applies the builder-variable pattern with the NAD wait helper, consistent with other tests in this PR.
444-450: LGTM - Helper functioncreateSriovConfigurationupdated consistently.The centralized NAD creation helper is now used, which should help prevent NAD creation flakes in externally managed SR-IOV tests. The 5-second timeout aligns with the developer's rationale that NAD creation typically completes in 1-2 seconds, making 5 seconds sufficient to catch genuine issues without masking problems with excessively long waits.
6b1efba to
426d13c
Compare
…SriovNetworkAndWaitForNADCreation - Fix NAD creation flake - Replace direct .Create() calls with CreateSriovNetworkAndWaitForNADCreation for SR-IOV networks - Add NADWaitTimeout constant to tsparams package - Replace hardcoded 5*time.Second with tsparams.NADWaitTimeout in CreateSriovNetworkAndWaitForNADCreation calls
426d13c to
33955cd
Compare
|
/lgtm |
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.