NO-JIRA: fix: e2e: increase timeouts for storage init / verify replicas #421

damdo · 2025-12-06T09:56:04Z

fix: e2e: storage init timeouts
The issue was that GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix were calling Eventually() without timeout and retry parameters:
Eventually(komega.List(templateList, client.InNamespace(namespace))).Should(Succeed(), ...)
This uses Gomega's default timeout (1 second), which is far too short when the API server returns a transient "storage is (re)initializing" error (HTTP 429). The new cluster-api-provider-aws v2.10.0 introduces the v1beta2 API version 1, and when CRD storage is reinitializing during API version transitions, these transient errors are expected.
The fix adds time.Minute, RetryShort parameters to both Eventually calls, matching the pattern used by other functions in the same file (like GetAWSMachineTemplateByName at line 31). This gives the API server up to 1 minute to complete storage initialization, with 1-second retry intervals.
fix: e2e: increase wait for replicas timeouts
Storage reinitialization errors continue for 15+ minutes during the v1beta2 transition
The cluster connection stabilized around 01:20:08 (~2 minutes in)
The caches did eventually populate intermittently
The 30-minute timeout provides more buffer for: Storage reinitialization to complete
CAPI MachineSet controller to process the scale-up
Machine provisioning to complete
Note: This is a short-term workaround. The root cause is the prolonged storage instability during the v1beta2 API transition in the AWS provider PR.

Summary by CodeRabbit

Chores
- Extended timeout and retry policies for template list operations to improve reliability during listing.
- Increased wait duration for replica verification to ensure stability during migration processes.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

The issue was that GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix were calling Eventually() without timeout and retry parameters: Eventually(komega.List(templateList, client.InNamespace(namespace))).Should(Succeed(), ...) This uses Gomega's default timeout (1 second), which is far too short when the API server returns a transient "storage is (re)initializing" error (HTTP 429). The new cluster-api-provider-aws v2.10.0 introduces the v1beta2 API version 1, and when CRD storage is reinitializing during API version transitions, these transient errors are expected. The fix adds time.Minute, RetryShort parameters to both Eventually calls, matching the pattern used by other functions in the same file (like GetAWSMachineTemplateByName at line 31). This gives the API server up to 1 minute to complete storage initialization, with 1-second retry intervals.

Storage reinitialization errors continue for 15+ minutes during the v1beta2 transition The cluster connection stabilized around 01:20:08 (~2 minutes in) The caches did eventually populate intermittently The 30-minute timeout provides more buffer for: Storage reinitialization to complete CAPI MachineSet controller to process the scale-up Machine provisioning to complete Note: This is a short-term workaround. The root cause is the prolonged storage instability during the v1beta2 API transition in the AWS provider PR.

openshift-ci-robot · 2025-12-06T09:56:06Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2025-12-06T09:56:13Z

Walkthrough

These changes adjust timeout and retry policies in e2e test helpers for AWS machine template operations and MachineSet replica verification. The modifications add explicit timeout/retry parameters to list operations and extend wait durations during verification, without altering function signatures or logic flow.

Changes

Cohort / File(s)	Summary
E2E Test Helper Timeout/Retry Adjustments `e2e/framework/machinetemplate.go`, `e2e/machineset_migration_helpers.go`	Added explicit timeout (time.Minute) and retry policy (RetryShort) to AWSMachineTemplate list operations in GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix. Changed MachineSet replica verification wait duration from WaitLong to WaitOverLong for both MAPI and CAPI MachineSets.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Simple parameter additions to existing function calls (time.Minute, RetryShort)
Straightforward timing constant substitution (WaitLong → WaitOverLong)
No logic changes or new functionality introduced

Poem

🐰 A hop through the timeouts, so patient and slow,
Retries and waits now with steadier flow,
The templates list with more time to spare,
MachineSet dancers twirl in the air!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title clearly relates to the main changes: increasing timeouts for e2e tests to handle storage initialization and replica verification delays.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

e2e/machineset_migration_helpers.go (1)

176-176: Extended timeout addresses storage reinitialization delays.

The increase from WaitLong to WaitOverLong is appropriate given the documented storage reinitialization issues during the v1beta2 AWS provider transition. The 30-minute timeout provides adequate buffer for storage initialization (15+ minutes), controller processing, and machine provisioning.

Consider adding an inline comment explaining this is a temporary workaround for the v1beta2 transition storage instability, to help future maintainers understand why such an extended timeout is necessary.

Also applies to: 181-181

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 46fc1aa and 88ea197.

📒 Files selected for processing (2)

e2e/framework/machinetemplate.go (2 hunks)
e2e/machineset_migration_helpers.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

e2e/framework/machinetemplate.go (1)

e2e/framework/framework.go (1)

RetryShort (17-17)

🔇 Additional comments (2)

e2e/framework/machinetemplate.go (2)

59-59: LGTM! Explicit timeout handles storage initialization delays.

Adding time.Minute and RetryShort parameters addresses the HTTP 429 errors during storage reinitialization. This pattern is consistent with other operations in the file (lines 31, 45, 90) and replaces the insufficient 1-second default timeout.

84-84: LGTM! Consistent timeout pattern for list operations.

The explicit timeout parameters mirror the fix at line 59 and ensure the list operation in the deletion path can also tolerate storage reinitialization delays. The consistent application of time.Minute and RetryShort across both functions is appropriate.

damdo · 2025-12-06T10:07:09Z

/label acknowledge-critical-fixes-only

This increases e2e timeouts to try and make them more reliable

damdo · 2025-12-06T10:08:00Z

/approve

/verified bypass

No need to verify at this point yet I'm interested in running this against openshift/cluster-api-provider-aws#582

openshift-ci-robot · 2025-12-06T10:08:12Z

@damdo: The verified label has been added.

Details

In response to this:

/approve

/verified bypass

No need to verify at this point yet I'm interested in running this against openshift/cluster-api-provider-aws#582

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-12-06T10:08:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damdo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [damdo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-12-06T10:08:53Z

@damdo: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-12-06T10:09:39Z

@damdo: This pull request explicitly references no jira issue.

Details

In response to this:

fix: e2e: storage init timeouts
The issue was that GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix were calling Eventually() without timeout and retry parameters:
Eventually(komega.List(templateList, client.InNamespace(namespace))).Should(Succeed(), ...)
This uses Gomega's default timeout (1 second), which is far too short when the API server returns a transient "storage is (re)initializing" error (HTTP 429). The new cluster-api-provider-aws v2.10.0 introduces the v1beta2 API version 1, and when CRD storage is reinitializing during API version transitions, these transient errors are expected.
The fix adds time.Minute, RetryShort parameters to both Eventually calls, matching the pattern used by other functions in the same file (like GetAWSMachineTemplateByName at line 31). This gives the API server up to 1 minute to complete storage initialization, with 1-second retry intervals.

fix: e2e: increase wait for replicas timeouts
Storage reinitialization errors continue for 15+ minutes during the v1beta2 transition
The cluster connection stabilized around 01:20:08 (~2 minutes in)
The caches did eventually populate intermittently
The 30-minute timeout provides more buffer for: Storage reinitialization to complete
CAPI MachineSet controller to process the scale-up
Machine provisioning to complete
Note: This is a short-term workaround. The root cause is the prolonged storage instability during the v1beta2 API transition in the AWS provider PR.

Summary by CodeRabbit

Chores

Extended timeout and retry policies for template list operations to improve reliability during listing.

Increased wait duration for replica verification to ensure stability during migration processes.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

damdo added 2 commits December 6, 2025 10:17

openshift-ci bot requested review from RadekManak and chrischdi December 6, 2025 09:56

coderabbitai bot reviewed Dec 6, 2025

View reviewed changes

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Dec 6, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 6, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2025

damdo changed the title ~~fix: e2e increase storage init / verify replicas timeouts~~ NO-JIRA: fix: e2e: increase timeouts for storage init / verify replicas Dec 6, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 6, 2025

damdo merged commit a52cfc1 into openshift:main Dec 6, 2025
8 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO-JIRA: fix: e2e: increase timeouts for storage init / verify replicas #421

NO-JIRA: fix: e2e: increase timeouts for storage init / verify replicas #421

Uh oh!

damdo commented Dec 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

openshift-ci-robot commented Dec 6, 2025

Uh oh!

coderabbitai bot commented Dec 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

damdo commented Dec 6, 2025

Uh oh!

damdo commented Dec 6, 2025

Uh oh!

openshift-ci-robot commented Dec 6, 2025

Uh oh!

openshift-ci bot commented Dec 6, 2025

Uh oh!

openshift-ci bot commented Dec 6, 2025

Uh oh!

openshift-ci-robot commented Dec 6, 2025

Summary by CodeRabbit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NO-JIRA: fix: e2e: increase timeouts for storage init / verify replicas #421

NO-JIRA: fix: e2e: increase timeouts for storage init / verify replicas #421

Uh oh!

Conversation

damdo commented Dec 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Dec 6, 2025

Uh oh!

coderabbitai bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

damdo commented Dec 6, 2025

Uh oh!

damdo commented Dec 6, 2025

Uh oh!

openshift-ci-robot commented Dec 6, 2025

Uh oh!

openshift-ci bot commented Dec 6, 2025

Uh oh!

openshift-ci bot commented Dec 6, 2025

Uh oh!

openshift-ci-robot commented Dec 6, 2025

Summary by CodeRabbit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

damdo commented Dec 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 6, 2025 •

edited

Loading