Skip to content

Conversation

@damdo
Copy link
Member

@damdo damdo commented Dec 6, 2025

  • fix: e2e: storage init timeouts
    The issue was that GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix were calling Eventually() without timeout and retry parameters:
    Eventually(komega.List(templateList, client.InNamespace(namespace))).Should(Succeed(), ...)
    This uses Gomega's default timeout (1 second), which is far too short when the API server returns a transient "storage is (re)initializing" error (HTTP 429). The new cluster-api-provider-aws v2.10.0 introduces the v1beta2 API version 1, and when CRD storage is reinitializing during API version transitions, these transient errors are expected.
    The fix adds time.Minute, RetryShort parameters to both Eventually calls, matching the pattern used by other functions in the same file (like GetAWSMachineTemplateByName at line 31). This gives the API server up to 1 minute to complete storage initialization, with 1-second retry intervals.

  • fix: e2e: increase wait for replicas timeouts
    Storage reinitialization errors continue for 15+ minutes during the v1beta2 transition
    The cluster connection stabilized around 01:20:08 (~2 minutes in)
    The caches did eventually populate intermittently
    The 30-minute timeout provides more buffer for: Storage reinitialization to complete
    CAPI MachineSet controller to process the scale-up
    Machine provisioning to complete
    Note: This is a short-term workaround. The root cause is the prolonged storage instability during the v1beta2 API transition in the AWS provider PR.

Summary by CodeRabbit

  • Chores
    • Extended timeout and retry policies for template list operations to improve reliability during listing.
    • Increased wait duration for replica verification to ensure stability during migration processes.

✏️ Tip: You can customize this high-level summary in your review settings.

damdo added 2 commits December 6, 2025 10:17
The issue was that GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix were calling Eventually() without timeout and retry parameters:
Eventually(komega.List(templateList, client.InNamespace(namespace))).Should(Succeed(), ...)

This uses Gomega's default timeout (1 second), which is far too short when the API server returns a transient "storage is (re)initializing" error (HTTP 429). The new cluster-api-provider-aws v2.10.0 introduces the v1beta2 API version 1, and when CRD storage is reinitializing during API version transitions, these transient errors are expected.

The fix adds time.Minute, RetryShort parameters to both Eventually calls, matching the pattern used by other functions in the same file (like GetAWSMachineTemplateByName at line 31). This gives the API server up to 1 minute to complete storage initialization, with 1-second retry intervals.
Storage reinitialization errors continue for 15+ minutes during the v1beta2 transition
The cluster connection stabilized around 01:20:08 (~2 minutes in)
The caches did eventually populate intermittently
The 30-minute timeout provides more buffer for: Storage reinitialization to complete
CAPI MachineSet controller to process the scale-up
Machine provisioning to complete
Note: This is a short-term workaround. The root cause is the prolonged storage instability during the v1beta2 API transition in the AWS provider PR.
@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link

coderabbitai bot commented Dec 6, 2025

Walkthrough

These changes adjust timeout and retry policies in e2e test helpers for AWS machine template operations and MachineSet replica verification. The modifications add explicit timeout/retry parameters to list operations and extend wait durations during verification, without altering function signatures or logic flow.

Changes

Cohort / File(s) Summary
E2E Test Helper Timeout/Retry Adjustments
e2e/framework/machinetemplate.go, e2e/machineset_migration_helpers.go
Added explicit timeout (time.Minute) and retry policy (RetryShort) to AWSMachineTemplate list operations in GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix. Changed MachineSet replica verification wait duration from WaitLong to WaitOverLong for both MAPI and CAPI MachineSets.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

  • Simple parameter additions to existing function calls (time.Minute, RetryShort)
  • Straightforward timing constant substitution (WaitLong → WaitOverLong)
  • No logic changes or new functionality introduced

Poem

🐰 A hop through the timeouts, so patient and slow,
Retries and waits now with steadier flow,
The templates list with more time to spare,
MachineSet dancers twirl in the air!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title clearly relates to the main changes: increasing timeouts for e2e tests to handle storage initialization and replica verification delays.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
e2e/machineset_migration_helpers.go (1)

176-176: Extended timeout addresses storage reinitialization delays.

The increase from WaitLong to WaitOverLong is appropriate given the documented storage reinitialization issues during the v1beta2 AWS provider transition. The 30-minute timeout provides adequate buffer for storage initialization (15+ minutes), controller processing, and machine provisioning.

Consider adding an inline comment explaining this is a temporary workaround for the v1beta2 transition storage instability, to help future maintainers understand why such an extended timeout is necessary.

Also applies to: 181-181

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 46fc1aa and 88ea197.

📒 Files selected for processing (2)
  • e2e/framework/machinetemplate.go (2 hunks)
  • e2e/machineset_migration_helpers.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
e2e/framework/machinetemplate.go (1)
e2e/framework/framework.go (1)
  • RetryShort (17-17)
🔇 Additional comments (2)
e2e/framework/machinetemplate.go (2)

59-59: LGTM! Explicit timeout handles storage initialization delays.

Adding time.Minute and RetryShort parameters addresses the HTTP 429 errors during storage reinitialization. This pattern is consistent with other operations in the file (lines 31, 45, 90) and replaces the insufficient 1-second default timeout.


84-84: LGTM! Consistent timeout pattern for list operations.

The explicit timeout parameters mirror the fix at line 59 and ensure the list operation in the deletion path can also tolerate storage reinitialization delays. The consistent application of time.Minute and RetryShort across both functions is appropriate.

@damdo
Copy link
Member Author

damdo commented Dec 6, 2025

/label acknowledge-critical-fixes-only

This increases e2e timeouts to try and make them more reliable

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Dec 6, 2025
@damdo
Copy link
Member Author

damdo commented Dec 6, 2025

/approve

/verified bypass

No need to verify at this point yet I'm interested in running this against openshift/cluster-api-provider-aws#582

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 6, 2025
@openshift-ci-robot
Copy link

@damdo: The verified label has been added.

Details

In response to this:

/approve

/verified bypass

No need to verify at this point yet I'm interested in running this against openshift/cluster-api-provider-aws#582

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damdo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 6, 2025

@damdo: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@damdo damdo changed the title fix: e2e increase storage init / verify replicas timeouts NO-JIRA: fix: e2e: increase timeouts for storage init / verify replicas Dec 6, 2025
@openshift-ci-robot
Copy link

@damdo: This pull request explicitly references no jira issue.

Details

In response to this:

  • fix: e2e: storage init timeouts
    The issue was that GetAWSMachineTemplateByPrefix and DeleteAWSMachineTemplateByPrefix were calling Eventually() without timeout and retry parameters:
    Eventually(komega.List(templateList, client.InNamespace(namespace))).Should(Succeed(), ...)
    This uses Gomega's default timeout (1 second), which is far too short when the API server returns a transient "storage is (re)initializing" error (HTTP 429). The new cluster-api-provider-aws v2.10.0 introduces the v1beta2 API version 1, and when CRD storage is reinitializing during API version transitions, these transient errors are expected.
    The fix adds time.Minute, RetryShort parameters to both Eventually calls, matching the pattern used by other functions in the same file (like GetAWSMachineTemplateByName at line 31). This gives the API server up to 1 minute to complete storage initialization, with 1-second retry intervals.

  • fix: e2e: increase wait for replicas timeouts
    Storage reinitialization errors continue for 15+ minutes during the v1beta2 transition
    The cluster connection stabilized around 01:20:08 (~2 minutes in)
    The caches did eventually populate intermittently
    The 30-minute timeout provides more buffer for: Storage reinitialization to complete
    CAPI MachineSet controller to process the scale-up
    Machine provisioning to complete
    Note: This is a short-term workaround. The root cause is the prolonged storage instability during the v1beta2 API transition in the AWS provider PR.

Summary by CodeRabbit

  • Chores
  • Extended timeout and retry policies for template list operations to improve reliability during listing.
  • Increased wait duration for replica verification to ensure stability during migration processes.

✏️ Tip: You can customize this high-level summary in your review settings.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 6, 2025
@damdo damdo merged commit a52cfc1 into openshift:main Dec 6, 2025
8 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants