Skip to content

Latest commit

 

History

History
285 lines (203 loc) · 11.5 KB

File metadata and controls

285 lines (203 loc) · 11.5 KB

Dynamic Model Routing

Introduced in v2.19.0. Capability scoring introduced in v2.52.0.

Dynamic model routing automatically selects cheaper models for simple work and reserves expensive models for complex tasks. This reduces token consumption by 20-50% on capped plans without sacrificing quality where it matters.

Starting in v2.52.0, the router uses capability-aware scoring to select the best fit model for each task, not just the cheapest one in the tier.

How It Works

Each unit dispatched by auto-mode passes through a two-stage pipeline:

Stage 1: Complexity classification — classifies the work into a tier (light/standard/heavy).

Stage 2: Capability scoring — within the eligible tier, ranks available models by how well their capabilities match the task's requirements.

The key rule: downgrade-only semantics. The user's configured model is always the ceiling — routing never upgrades beyond what you've configured.

Tier Typical Work Default Model Level
Light Slice completion, UAT, hooks Haiku-class
Standard Research, planning, execution, milestone completion Sonnet-class
Heavy Replanning, roadmap reassessment, complex execution Opus-class

Enabling

Dynamic routing is off by default. Enable it in preferences:

---
version: 1
dynamic_routing:
  enabled: true
---

Configuration

dynamic_routing:
  enabled: true
  tier_models:                    # explicit model per tier (optional)
    light: claude-haiku-4-5
    standard: claude-sonnet-4-6
    heavy: claude-opus-4-6
  escalate_on_failure: true       # bump tier on task failure (default: true)
  budget_pressure: true           # auto-downgrade when approaching budget ceiling (default: true)
  cross_provider: true            # consider models from other providers (default: true)
  hooks: true                     # apply routing to post-unit hooks (default: true)
  capability_routing: true        # enable capability scoring within tier (default: true)

tier_models

Override which model is used for each tier. When omitted, the router uses a built-in capability mapping that knows common model families:

  • Light: claude-haiku-4-5, gpt-4o-mini, gemini-2.0-flash
  • Standard: claude-sonnet-4-6, gpt-4o, gemini-2.5-pro
  • Heavy: claude-opus-4-6, gpt-4.5-preview, gemini-2.5-pro

escalate_on_failure

When a task fails at a given tier, the router escalates to the next tier on retry. Light → Standard → Heavy. This prevents cheap models from burning retries on work that needs more reasoning.

budget_pressure

When approaching the budget ceiling, the router progressively downgrades:

Budget Used Effect
< 50% No adjustment
50-75% Standard → Light
75-90% More aggressive downgrading
> 90% Nearly everything → Light; only Heavy stays at Standard

cross_provider

When enabled, the router may select models from providers other than your primary. This uses the built-in cost table to find the cheapest model at each tier. Requires the target provider to be configured.

capability_routing

When enabled (default: true), the router uses capability scoring to pick the best model in a tier rather than always defaulting to the cheapest. Set to false to revert to cheapest-in-tier behavior:

dynamic_routing:
  enabled: true
  capability_routing: false   # disable scoring, use cheapest-in-tier

Capability Profiles

Each model has a built-in capability profile — a 7-dimension score (0–100) representing how well it handles different task types:

Dimension What It Represents
coding Code generation and implementation accuracy
debugging Diagnosing and fixing errors
research Synthesizing information and exploring topics
reasoning Multi-step logical reasoning
speed Latency and throughput (inverse of capability depth)
longContext Handling large codebases and long documents
instruction Following structured instructions precisely

Built-in profiles exist for 9 models: claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5, gpt-4o, gpt-4o-mini, gemini-2.5-pro, gemini-2.0-flash, deepseek-chat, o3.

Models without a built-in profile receive uniform scores of 50 across all dimensions. This is a cold-start policy — unknown models compete but don't have an advantage. From the user's perspective, routing behaves the same as before capability scoring was introduced for those models.

Profiles are heuristic rankings, not benchmarks. They represent approximate relative strengths, not verified benchmark results. Use user overrides (below) to correct them for models you know well.

How Scoring Works

The routing pipeline within a tier:

classify complexity tier
    ↓
filter eligible models for tier
    ↓
fire before_model_select hook (optional override)
    ↓
capability score eligible models
    ↓
select winner (or first eligible if scoring is disabled)

Scoring formula: weighted average of capability dimensions

score = Σ(weight × capability) / Σ(weights)

Task requirements are dynamic — different task types weight dimensions differently:

Unit Type Key Dimensions
execute-task coding (0.9), instruction (0.7), speed (0.3)
research-* research (0.9), longContext (0.7), reasoning (0.5)
plan-* reasoning (0.9), coding (0.5)
replan-slice reasoning (0.9), debugging (0.6), coding (0.5)
complete-slice, run-uat instruction (0.8), speed (0.7)

For execute-task, requirements are further refined by task metadata signals:

  • Tags like docs, config, readme → boost instruction weight
  • Keywords like concurrency, compatibility → boost debugging and reasoning
  • Keywords like migration, architecture → boost reasoning and coding
  • Large file counts (≥6) or large estimated line counts (≥500) → boost coding and reasoning

Tie-breaking: When two models score within 2 points of each other, the cheaper model wins. If costs are equal, lexicographic model ID breaks the tie (deterministic).

User Overrides

Correct built-in capability profiles for models you know well using modelOverrides in your models configuration:

{
  "providers": {
    "anthropic": {
      "modelOverrides": {
        "claude-sonnet-4-6": {
          "capabilities": {
            "debugging": 90,
            "research": 85
          }
        }
      }
    }
  }
}

Overrides are deep-merged with built-in defaults — only the specified dimensions are overridden; others retain their built-in values.

Use case: You've found that a model consistently outperforms its built-in profile on specific task types. Override the relevant dimensions to steer the router toward that model for those tasks.

Verbose Output

When verbose mode is active, the router logs its routing decision. When capability scoring was used, the log includes a full scoring breakdown:

Dynamic routing [S]: claude-sonnet-4-6 (capability-scored) — claude-sonnet-4-6: 82.3, gpt-4o: 78.1, deepseek-chat: 72.0

When tier-only routing was used (scoring disabled, single eligible model, or routing guards applied):

Dynamic routing [S]: claude-sonnet-4-6 (standard complexity, multiple steps)

The selectionMethod field in the routing decision indicates which path was taken:

  • "capability-scored" — capability scoring selected the winner
  • "tier-only" — cheapest in tier (or explicit pin) was used

Extension Hook

Extensions can intercept and override model selection using the before_model_select hook.

The hook fires after tier filtering (eligible models are known) and before capability scoring (scores have not been computed yet). A hook can override selection entirely or return undefined to let scoring proceed normally.

Registering a handler:

pi.on("before_model_select", async (event) => {
  const { unitType, unitId, classification, taskMetadata, eligibleModels, phaseConfig } = event;

  // Custom routing strategy: always use gemini for research tasks
  if (unitType.startsWith("research-")) {
    const gemini = eligibleModels.find(id => id.includes("gemini"));
    if (gemini) return { modelId: gemini };
  }

  // Return undefined to let capability scoring proceed
  return undefined;
});

Event payload:

Field Type Description
unitType string The unit type being dispatched (e.g., "execute-task")
unitId string Unique identifier for this unit dispatch
classification { tier, reason, downgraded } The complexity classification result
taskMetadata Record<string, unknown> | undefined Task metadata extracted from the unit plan
eligibleModels string[] Models eligible for the classified tier
phaseConfig { primary, fallbacks } | undefined The user's configured model for this phase

Return value: { modelId: string } to override selection, or undefined to defer to capability scoring.

First-override-wins: If multiple extensions register handlers, the first one to return a non-undefined result wins. Subsequent handlers are not called.

Complexity Classification

Units are classified using pure heuristics — no LLM calls, sub-millisecond:

Unit Type Defaults

Unit Type Default Tier
complete-slice, run-uat Light
research-*, plan-*, complete-milestone Standard
execute-task Standard (upgraded by task analysis)
replan-slice, reassess-roadmap Heavy
hook/* Light

Task Plan Analysis

For execute-task units, the classifier analyzes the task plan:

Signal Simple → Light Complex → Heavy
Step count ≤ 3 ≥ 8
File count ≤ 3 ≥ 8
Description length < 500 chars > 2000 chars
Code blocks ≥ 5
Complexity keywords None Present

Complexity keywords: research, investigate, refactor, migrate, integrate, complex, architect, redesign, security, performance, concurrent, parallel, distributed, backward compat

Adaptive Learning

The routing history (.gsd/routing-history.json) tracks success/failure per tier per unit type. If a tier's failure rate exceeds 20% for a given pattern, future classifications are bumped up. User feedback (over/under/ok) is weighted 2× vs automatic outcomes.

Interaction with Token Profiles

Dynamic routing and token profiles are complementary:

  • Token profiles (budget/balanced/quality) control phase skipping and context compression
  • Dynamic routing controls per-unit model selection within the configured phase model

When both are active, token profiles set the baseline models and dynamic routing further optimizes within those baselines. The budget token profile + dynamic routing provides maximum cost savings.

Cost Table

The router includes a built-in cost table for common models, used for cross-provider cost comparison. Costs are per-million tokens (input/output):

Model Input Output
claude-haiku-4-5 $0.80 $4.00
claude-sonnet-4-6 $3.00 $15.00
claude-opus-4-6 $15.00 $75.00
gpt-4o-mini $0.15 $0.60
gpt-4o $2.50 $10.00
gemini-2.0-flash $0.10 $0.40

The cost table is used for comparison only — actual billing comes from your provider.