Skip to content

Handle transient model-capacity failures with retry/fallback#15

Merged
vkehfdl1 merged 2 commits intomainfrom
Feature/#14
Apr 3, 2026
Merged

Handle transient model-capacity failures with retry/fallback#15
vkehfdl1 merged 2 commits intomainfrom
Feature/#14

Conversation

@vkehfdl1
Copy link
Copy Markdown
Contributor

@vkehfdl1 vkehfdl1 commented Apr 3, 2026

Summary

  • Detect known model-capacity errors (Selected model is at capacity, model is currently overloaded) in OMX stderr
  • Automatically retry with bounded backoff (1m → 3m → 10m, max 4 attempts)
  • Skip retry when GitHub side effect was already posted to avoid duplicate comments/PRs
  • Track retry_attempts, retry_history, retry_exhausted in job metadata for observability
  • New retrying job status visible via dani show-state

Test plan

  • 69 tests passing (11 new retry-specific tests)
  • ruff lint passing
  • Verify on real capacity error scenario when it next occurs

Closes #14

🤖 Generated with Claude Code

vkehfdl1 and others added 2 commits April 3, 2026 14:49
Detect known capacity errors (e.g. "Selected model is at capacity") in
OMX stderr and automatically retry with backoff (1m, 3m, 10m). Skip
retry when the GitHub side effect was already posted to avoid duplicates.
Track retry history in job metadata for observability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vkehfdl1 vkehfdl1 merged commit 89496ff into main Apr 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle transient model-capacity failures with retry/fallback

1 participant