-
Notifications
You must be signed in to change notification settings - Fork 52
NO-JIRA: fix: e2e: increase timeouts for storage init / verify replicas #421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The issue was that GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix were calling Eventually() without timeout and retry parameters: Eventually(komega.List(templateList, client.InNamespace(namespace))).Should(Succeed(), ...) This uses Gomega's default timeout (1 second), which is far too short when the API server returns a transient "storage is (re)initializing" error (HTTP 429). The new cluster-api-provider-aws v2.10.0 introduces the v1beta2 API version 1, and when CRD storage is reinitializing during API version transitions, these transient errors are expected. The fix adds time.Minute, RetryShort parameters to both Eventually calls, matching the pattern used by other functions in the same file (like GetAWSMachineTemplateByName at line 31). This gives the API server up to 1 minute to complete storage initialization, with 1-second retry intervals.
Storage reinitialization errors continue for 15+ minutes during the v1beta2 transition The cluster connection stabilized around 01:20:08 (~2 minutes in) The caches did eventually populate intermittently The 30-minute timeout provides more buffer for: Storage reinitialization to complete CAPI MachineSet controller to process the scale-up Machine provisioning to complete Note: This is a short-term workaround. The root cause is the prolonged storage instability during the v1beta2 API transition in the AWS provider PR.
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
WalkthroughThese changes adjust timeout and retry policies in e2e test helpers for AWS machine template operations and MachineSet replica verification. The modifications add explicit timeout/retry parameters to list operations and extend wait durations during verification, without altering function signatures or logic flow. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
e2e/machineset_migration_helpers.go (1)
176-176: Extended timeout addresses storage reinitialization delays.The increase from
WaitLongtoWaitOverLongis appropriate given the documented storage reinitialization issues during the v1beta2 AWS provider transition. The 30-minute timeout provides adequate buffer for storage initialization (15+ minutes), controller processing, and machine provisioning.Consider adding an inline comment explaining this is a temporary workaround for the v1beta2 transition storage instability, to help future maintainers understand why such an extended timeout is necessary.
Also applies to: 181-181
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
e2e/framework/machinetemplate.go(2 hunks)e2e/machineset_migration_helpers.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
e2e/framework/machinetemplate.go (1)
e2e/framework/framework.go (1)
RetryShort(17-17)
🔇 Additional comments (2)
e2e/framework/machinetemplate.go (2)
59-59: LGTM! Explicit timeout handles storage initialization delays.Adding
time.MinuteandRetryShortparameters addresses the HTTP 429 errors during storage reinitialization. This pattern is consistent with other operations in the file (lines 31, 45, 90) and replaces the insufficient 1-second default timeout.
84-84: LGTM! Consistent timeout pattern for list operations.The explicit timeout parameters mirror the fix at line 59 and ensure the list operation in the deletion path can also tolerate storage reinitialization delays. The consistent application of
time.MinuteandRetryShortacross both functions is appropriate.
|
/label acknowledge-critical-fixes-only This increases e2e timeouts to try and make them more reliable |
|
/approve /verified bypass No need to verify at this point yet I'm interested in running this against openshift/cluster-api-provider-aws#582 |
|
@damdo: The DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damdo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@damdo: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@damdo: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
fix: e2e: storage init timeouts
The issue was that GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix were calling Eventually() without timeout and retry parameters:
Eventually(komega.List(templateList, client.InNamespace(namespace))).Should(Succeed(), ...)
This uses Gomega's default timeout (1 second), which is far too short when the API server returns a transient "storage is (re)initializing" error (HTTP 429). The new cluster-api-provider-aws v2.10.0 introduces the v1beta2 API version 1, and when CRD storage is reinitializing during API version transitions, these transient errors are expected.
The fix adds time.Minute, RetryShort parameters to both Eventually calls, matching the pattern used by other functions in the same file (like GetAWSMachineTemplateByName at line 31). This gives the API server up to 1 minute to complete storage initialization, with 1-second retry intervals.
fix: e2e: increase wait for replicas timeouts
Storage reinitialization errors continue for 15+ minutes during the v1beta2 transition
The cluster connection stabilized around 01:20:08 (~2 minutes in)
The caches did eventually populate intermittently
The 30-minute timeout provides more buffer for: Storage reinitialization to complete
CAPI MachineSet controller to process the scale-up
Machine provisioning to complete
Note: This is a short-term workaround. The root cause is the prolonged storage instability during the v1beta2 API transition in the AWS provider PR.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.