Skip to content

Add retry logic for API server transient errors and apply across TaskRun reconciler#611

Open
jangel97 wants to merge 1 commit intokonflux-ci:mainfrom
jangel97:fix/mpc-api-timeout-retry
Open

Add retry logic for API server transient errors and apply across TaskRun reconciler#611
jangel97 wants to merge 1 commit intokonflux-ci:mainfrom
jangel97:fix/mpc-api-timeout-retry

Conversation

@jangel97
Copy link

@jangel97 jangel97 commented Oct 28, 2025

Add retry logic to make MPC resilient to transient API server failures

MPC failed to allocate hosts when the API server returned etcdserver: request timed out. This caused user build failures.

This change:

  • Introduces retry.go with RetryOnTransientAPIError, ListWithRetry, GetWithRetry, and UpdateWithRetry helpers using exponential backoff.
  • Extends UpdateTaskRunWithRetry to handle both conflicts and transient errors, capped at 30s total retry duration.
  • Replaces direct client.List/Get/Update calls in dynamicpool, hostpool, local, and taskrun with retry-enabled wrappers.
  • Adds unit tests in retry_test.go for transient error detection, backoff behaviour, context cancellation, and helper functions.

…Run reconciler

Signed-off-by: Jose Angel Morena <jmorenas@redhat.com>
@mshaposhnik mshaposhnik reopened this Feb 12, 2026
@snyk-io
Copy link

snyk-io bot commented Feb 12, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants