Skip to content

feat: add model fallback chain with error classification and cooldown#92

Closed
Leeaandrob wants to merge 1 commit intosipeed:mainfrom
Leeaandrob:feat/model-fallback-chain
Closed

feat: add model fallback chain with error classification and cooldown#92
Leeaandrob wants to merge 1 commit intosipeed:mainfrom
Leeaandrob:feat/model-fallback-chain

Conversation

@Leeaandrob
Copy link
Collaborator

Summary

  • Implement a 2-layer model fallback system (text + image/multimodal) that automatically retries failed LLM requests across multiple providers/models
  • Error classification engine with ~40 patterns matching 7 failure categories (auth, rate_limit, billing, timeout, format, overloaded, unknown)
  • Per-provider cooldown tracker with standard exponential backoff (1m→5m→25m→1h) and billing-specific backoff (5h→10h→20h→24h)
  • 24h failure window reset: error counts reset to 0 after 24h of no failures
  • Model reference parsing supporting provider/model format with provider normalization
  • Candidate deduplication prevents trying the same provider/model twice
  • Fully backward compatible: without model_fallbacks configured, behavior is unchanged

New Files (8)

File Lines Purpose
model_ref.go 64 Parse provider/model references with normalization
error_classifier.go 253 Classify ~40 error patterns into FailoverReason
cooldown.go 207 Per-provider cooldown with standard + billing formulas
fallback.go 283 FallbackChain orchestrator: Execute() + ExecuteImage()
model_ref_test.go 125 10 test cases
error_classifier_test.go 337 20 test cases
cooldown_test.go 269 13 test cases
fallback_test.go 473 21 test cases

Modified Files (3)

  • types.go: Add FailoverError, FailoverReason enum, ModelConfig
  • config.go: Add model_fallbacks, image_model, image_model_fallbacks + helper methods
  • loop.go: Integrate FallbackChain into runLLMIteration() when candidates > 1

Config Example

{
  "agents": {
    "defaults": {
      "model": "gpt-4",
      "model_fallbacks": ["anthropic/claude-opus", "groq/llama-3"],
      "image_model": "openai/gpt-4o",
      "image_model_fallbacks": ["anthropic/claude-sonnet"]
    }
  }
}

Cooldown Behavior

Type Errors Cooldown
Standard 1 / 2 / 3 / 4+ 1m / 5m / 25m / 1h
Billing 1 / 2 / 3 / 4+ 5h / 10h / 20h / 24h

Test plan

  • go build ./... passes
  • go vet ./... passes
  • All 128 provider tests pass (64 new + 64 existing)
  • 95%+ coverage on new code
  • Backward compatible (no fallbacks = unchanged behavior)
  • context.Canceled never triggers fallback (user abort)
  • Non-retriable errors (format, image dimension/size) abort immediately
  • Concurrent access safe (sync.RWMutex in cooldown tracker)

Implement a 2-layer model fallback system that automatically retries
failed LLM requests across multiple providers/models:

- model_ref.go: Parse "provider/model" references with normalization
- error_classifier.go: Classify ~40 error patterns into 7 categories
  (auth, rate_limit, billing, timeout, format, overloaded, unknown)
- cooldown.go: Per-provider cooldown with standard backoff (1m→5m→25m→1h)
  and billing-specific backoff (5h→10h→20h→24h), 24h failure window reset
- fallback.go: FallbackChain orchestrator with Execute (text) and
  ExecuteImage (multimodal), candidate deduplication, context cancellation
- types.go: FailoverError, FailoverReason enum, ModelConfig
- config.go: model_fallbacks, image_model, image_model_fallbacks fields
- loop.go: Integration with AgentLoop when fallback candidates configured

Config example:
  "model": "gpt-4",
  "model_fallbacks": ["anthropic/claude-opus", "groq/llama-3"],
  "image_model": "openai/gpt-4o",
  "image_model_fallbacks": ["anthropic/claude-sonnet"]

Backward compatible: without fallbacks configured, behavior is unchanged.
@Leeaandrob
Copy link
Collaborator Author

Superseded by PR #131 which includes the model-fallback-chain as part of the multi-agent-routing feature branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant