Skip to content

feat: automatic retry and failover for rate-limited LLM requests#733

Open
raheelshahzad wants to merge 2 commits intokatanemo:mainfrom
raheelshahzad:feat/retry-on-ratelimit
Open

feat: automatic retry and failover for rate-limited LLM requests#733
raheelshahzad wants to merge 2 commits intokatanemo:mainfrom
raheelshahzad:feat/retry-on-ratelimit

Conversation

@raheelshahzad
Copy link
Copy Markdown
Collaborator

@raheelshahzad raheelshahzad commented Feb 10, 2026

Summary

Adds a retry-on-ratelimit system to the Plano gateway that automatically retries failed LLM requests (429, 503, timeouts) across alternative providers with intelligent selection.

Structure (2 commits)

Commit 1 — Production code (~4k lines)
Core retry engine in crates/common/src/retry/:

  • orchestrator: retry loop with budget tracking
  • provider_selector: weighted selection excluding blocked providers
  • error_detector: classifies responses into retryable categories
  • backoff: exponential backoff with jitter + Retry-After support
  • retry_after_state: per-provider rate-limit cooldown tracking
  • latency_block_state: high-latency provider temporary exclusion
  • latency_trigger: consecutive slow-response counter
  • validation: config validation with cross-field checks
  • error_response: structured error responses when retries exhausted

Three phases: P0 (core retry + backoff), P1 (Retry-After + fallback models + timeout), P2 (proactive high-latency failover).

Commit 2 — Tests (~10.9k lines)

  • 302 property-based unit tests (proptest, 100+ iterations each)
  • 13 integration test scenarios (IT-1 through IT-13)
  • Covers all retry behaviors: 429/503, exhaustion, backoff, fallback priority, Retry-After, timeout, high-latency failover, streaming, body preservation

Copy link
Copy Markdown
Contributor

@adilhafeez adilhafeez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for putting this change together @raheelshahzad . Please join our discord channel too. Overall looks good!

Left some comments in the PR and have some additional suggestions/comments on overall change,

  • we should do exponential backoff on retries
  • how do we ensure that we have not reached request timeout
  • max_retries should be defined somewhere in config.yaml probably not in this PR but we should let developers define that var
  • this code change needs an update to docs
  • I think we should allow retry to same provider or at least let developers define if they want to retry to different provider. Consider following example,
model_providers:
  - model: openai/gpt-4o
    base_url: https://dsna-oai.openai.azure.com
    access_key: $OPENAI_API_KEY
    retry_on_ratelimit: true # new feature
    retry_to_same_provider: true # this flag should only allow retry to same provider otherwise we should retry randomly to all models

  - model: openai/gpt-5
    base_url: https://dsna-oai.openai.azure.com
    access_key: $OPENAI_API_KEY

Comment on lines +95 to +104
self.providers.iter().find_map(|(key, provider)| {
if provider.internal != Some(true)
&& provider.name != current_name
&& key == &provider.name
{
Some(Arc::clone(provider))
} else {
None
}
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should pick random model

Comment on lines 403 to 419
if res.status() == StatusCode::TOO_MANY_REQUESTS && attempts < max_attempts {
let providers = llm_providers.read().await;
if let Some(provider) = providers.get(&current_resolved_model) {
if provider.retry_on_ratelimit == Some(true) {
if let Some(alt_provider) = providers.get_alternative(&current_resolved_model) {
info!(
request_id = %request_id,
current_model = %current_resolved_model,
alt_model = %alt_provider.name,
"429 received, retrying with alternative model"
);
current_resolved_model = alt_provider.name.clone();
continue;
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add exponential backoff

let mut current_resolved_model = resolved_model.clone();
let mut current_client_request = client_request;
let mut attempts = 0;
let max_attempts = 2; // Original + 1 retry
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be configurable

);
// Capture start time right before sending request to upstream
let request_start_time = std::time::Instant::now();
let _request_start_system_time = std::time::SystemTime::now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

@adilhafeez
Copy link
Copy Markdown
Contributor

adilhafeez commented Feb 10, 2026

I looked through envoy retry semantics https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-field-config-route-v3-routeaction-retry-policy

I think we should lean toward this design for retries. We don't have to implement this completely but we should implement bare minimal but following similar semantics / config, thoughts?

@raheelshahzad raheelshahzad force-pushed the feat/retry-on-ratelimit branch from d1aa3ac to ca903d2 Compare February 12, 2026 04:08
Copy link
Copy Markdown
Collaborator Author

@raheelshahzad raheelshahzad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Exponential backoff with configurable base and max intervals.
  2. Configurable max_retries.
  3. retry_to_same_provider option.
  4. Random alternative selection when failing over to a different model.
  5. Documentation updates in the reference configuration.
  6. Comprehensive unit tests for all the above.

@adilhafeez
Copy link
Copy Markdown
Contributor

adilhafeez commented Feb 12, 2026

Thanks a lot Raheel for continuing to make plano better. We are getting there.

This may be a slightly better way to specify retries,

  model_providers:
    - model: openai/gpt-4o
      access_key: $OPENAI_API_KEY
      default: true
      retry_policy:
        num_retries: 2
        # retry_on: [429]             # default
        # back_off:
        #   base_interval: 25ms       # default
        #   max_interval: 250ms       # default (10x base)
        # failover:
        #   strategy: same_provider   # default

    # Need more control
    - model: anthropic/claude-sonnet-4-0
      access_key: $ANTHROPIC_API_KEY
      retry_policy:
        num_retries: 3
        failover:
          strategy: any

    # Full control
    - model: openai/gpt-4o-mini
      access_key: $OPENAI_API_KEY
      retry_policy:
        num_retries: 2
        retry_on: [429, 503]
        back_off:
          base_interval: 100ms
          max_interval: 2000ms
        failover:
          providers:
            - anthropic/claude-sonnet-4-0

    # No retries (default, just omit retry_policy)
    - model: mistral/ministral-3b-latest
      access_key: $MISTRAL_API_KEY

@salmanap
Copy link
Copy Markdown
Contributor

salmanap commented Mar 3, 2026

Thanks a lot Raheel for continuing to make plano better. We are getting there.

This may be a slightly better way to specify retries,

  model_providers:
    - model: openai/gpt-4o
      access_key: $OPENAI_API_KEY
      default: true
      retry_policy:
        num_retries: 2
        # retry_on: [429]             # default
        # back_off:
        #   base_interval: 25ms       # default
        #   max_interval: 250ms       # default (10x base)
        # failover:
        #   strategy: same_provider   # default

    # Need more control
    - model: anthropic/claude-sonnet-4-0
      access_key: $ANTHROPIC_API_KEY
      retry_policy:
        num_retries: 3
        failover:
          strategy: any

    # Full control
    - model: openai/gpt-4o-mini
      access_key: $OPENAI_API_KEY
      retry_policy:
        num_retries: 2
        retry_on: [429, 503]
        back_off:
          base_interval: 100ms
          max_interval: 2000ms
        failover:
          providers:
            - anthropic/claude-sonnet-4-0

    # No retries (default, just omit retry_policy)
    - model: mistral/ministral-3b-latest
      access_key: $MISTRAL_API_KEY

I like this developer experience, and would love to see an updated PR about it. This would help with free-tier GPU traffic shaping and a very useful feature for coding agents.

@raheelshahzad raheelshahzad force-pushed the feat/retry-on-ratelimit branch from ca903d2 to 1384982 Compare March 9, 2026 00:43
@raheelshahzad raheelshahzad changed the title feat: add support for retrying LLM requests on 429 ratelimits (#697) feat: automatic retry and failover for rate-limited LLM requests Mar 9, 2026
@raheelshahzad raheelshahzad force-pushed the feat/retry-on-ratelimit branch from 1384982 to d569d4f Compare March 9, 2026 00:45
Implement a retry-on-ratelimit system for the Plano gateway that
automatically retries failed LLM requests (429, 503, timeouts) across
alternative providers with intelligent provider selection.

Core modules (crates/common/src/retry/):
- orchestrator: retry loop with budget tracking and attempt management
- provider_selector: weighted selection excluding blocked providers
- error_detector: classifies responses into retryable error categories
- backoff: exponential backoff with jitter and Retry-After support
- retry_after_state: per-provider rate-limit cooldown tracking
- latency_block_state: high-latency provider temporary exclusion
- latency_trigger: consecutive slow-response counter
- validation: configuration validation with cross-field checks
- error_response: structured error responses when retries exhausted

Three phases: P0 (core retry + backoff), P1 (Retry-After + fallback
models + timeout), P2 (proactive high-latency failover).

Tests follow in a separate PR.
…elimit

Add 302 property-based unit tests (proptest, 100+ iterations each) and
13 integration test scenarios covering all retry behaviors.

Unit tests cover:
- Configuration round-trip parsing, defaults, and validation
- Status code range expansion and error classification
- Exponential backoff formula, bounds, and scope filtering
- Provider selection strategy correctness and fallback ordering
- Retry-After state scope behavior and max expiration updates
- Cooldown exclusion invariants and initial selection cooldown
- Bounded retry (max_attempts + budget enforcement)
- Request preservation across retries
- Latency trigger sliding window and block state management
- Timeout vs high-latency precedence
- Error response detail completeness

Integration tests (tests/e2e/):
- IT-1 through IT-13 covering 429/503 retry, exhaustion, backoff,
  fallback priority, Retry-After honoring, timeout retry, high-latency
  failover, streaming preservation, and body preservation
@raheelshahzad raheelshahzad force-pushed the feat/retry-on-ratelimit branch from d569d4f to 98bf024 Compare March 9, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants