Parent: AI-Coding-Best-Practices Related: Verification and Review · Agentic Workflow Guide · Context Engineering · Deterministic Sandwich
Model routing is the discipline of assigning the right model tier to each development task based on the cognitive complexity of that task and the cost-quality tradeoff that makes sense for your workflow.
In 2026, the frontier model (Claude Opus 4.6) costs $30–60 per million tokens. A mid-tier model (Claude Sonnet 4.6) costs $10–15/M. A lightweight model costs $0.50–2/M. The cost difference is 60–300x. The quality difference is real but selective — most production workloads perform identically on mid-tier for their specific tasks.
The question isn't "which model is best?" It's "which model is necessary for this specific task, given the cost and the failure modes if something goes wrong?"
When to use: Reasoning, architectural decisions, deep analysis, validation, complex problem-solving
Cost: $30–60/M tokens What it's best at: Novel problems with no established pattern. Multi-step reasoning where a missed subtlety compounds into expensive rework. Architectural decisions that affect future features. Validation passes where confirmation bias is the real failure mode.
In the Development Loop: Steps 1, 2, 3, 4, 6, 10 (research, drafting, validation, deep audit)
When to use: Code generation, code review, build plans, remediation
Cost: $10–15/M tokens What it's best at: Pattern-following execution. Writing code to a detailed spec. Surface-level review. Remediating known issues. Building from validated plans.
What changed in 2026: Sonnet 4.6 now beats prior-generation frontier models (Opus 4.0) in many agentic benchmarks. The gap for real-world coding tasks is so small that the decision to use mid-tier is rarely about capability — it's about cost.
In the Development Loop: Steps 5, 8, 9, 12, 13 (build planning, implementation, code review, E2E testing, remediation)
When to use: Deterministic checks, validation, linting, testing
Cost: ~$0 (you're running code, not querying a model) What it's best at: Binary pass/fail checks. Anything that has a right answer independent of human judgment.
In the Development Loop: Steps 7, 15 (pre-flight checks, smoke tests)
Understanding the economics makes model selection obvious in most cases.
| Tier | Cost per M tokens | Example task | Typical cost per invocation |
|---|---|---|---|
| Frontier (Opus 4.6) | $30–60 | Research + planning a feature | $0.50–2.00 |
| Mid-tier (Sonnet 4.6) | $10–15 | Implement a task from a plan | $0.05–0.30 |
| Lightweight (Haiku/Sonnet 4.5) | $0.50–2 | Code review or linting | $0.001–0.01 |
| Automated | $0 | Unit tests, git commit, linting | $0.00 |
The 60–300x cost difference is real. On a week-long project with 50 features, using frontier for everything costs ~$250–500 in API calls. Using tiered routing costs ~$50–100. The quality difference for pattern-following tasks is negligible.
The parent guide's Development Loop has 15 steps. Here's which tier handles each and why.
| Step | Task | Tier | Why | Typical Cost |
|---|---|---|---|---|
| 1 | Research | Frontier | Discovering strategy, edge cases, unknowns. Needs deep reasoning. Missed details cascade. | $0.50–1.50 |
| 2 | Draft Plan | Frontier | Architectural decisions. One wrong choice breaks downstream work. | $0.20–0.60 |
| 3 | Validate Reqs | Frontier (fresh) | Confirmation bias. Fresh context prevents "yeah that looks right." | $0.15–0.40 |
| 4 | Validate Integration | Frontier (fresh) | Architectural coherence check. Fresh context critical. | $0.15–0.40 |
| 5 | Generate Build Plan | Mid-tier | Pattern execution. Given a validated plan, decompose into tasks. | $0.05–0.15 |
| 6 | Double-Check Build | Frontier | Final plan validation before expensive implementation. | $0.10–0.30 |
Total for Phase 1: ~$1.50–3.75 using frontier for high-stakes reasoning, mid-tier for execution.
| Step | Task | Tier | Why | Typical Cost |
|---|---|---|---|---|
| 7 | Pre-Flight Check | Automated | Do files exist? Dependencies installed? Binary check. | $0.00 |
| 8 | Implement | Mid-tier | Write code per spec. Pattern following. Run unit tests. | $0.30–1.00 |
| 9 | Code Review | Mid-tier | Surface-level: style, obvious bugs. | $0.05–0.15 |
| 10 | Deep Audit | Frontier | System-level correctness. Cross-file integration. Must verify Integration Contract. | $0.20–0.50 |
| 11 | Remediation | Mid-tier | Fix known issues. Pattern-following. | $0.05–0.20 |
Total for Phase 2: ~$0.85–2.00 using mid-tier for execution, frontier for integration audit.
| Step | Task | Tier | Why | Typical Cost |
|---|---|---|---|---|
| 12 | Full E2E Tests | Mid-tier | Orchestrate test runs. Triage failures. | $0.05–0.15 |
| 13 | Final Remediation | Mid-tier | Fix E2E issues. Verify with full suite. | $0.05–0.15 |
| 14 | Commit + Update Docs | Automated + Mid-tier | Git operations automated, docs updated by mid-tier. | $0.02–0.10 |
| 15 | Smoke Test | Automated | Does app start? Tests pass? Binary gate. | $0.00 |
Total for Phase 3: ~$0.12–0.40 using automation and mid-tier.
Full feature (all 15 steps): ~$2.50–6.15 using tiered routing vs. ~$4.50–9.00 if you use frontier for everything. That's a 40–50% cost reduction.
The agentic workflow guide uses a model: opus directive for subagents, representing Boris Cherny's "always frontier for subagents" strategy. This aligns with Cherny's argument that for focused, well-specified tasks (a subagent implementing one piece from a plan), frontier-model quality is worth the cost. The workflow pairs this with tiered routing at higher levels: orchestrators may route differently based on phase and task complexity, but subagents responsible for critical decisions consistently use frontier models to reduce rework.
Boris Cherny (creator of Claude Code) doesn't use tiered routing. He uses Opus for everything — research, planning, coding, review, everything. His reasoning is worth understanding because it's not wrong; it's just optimizing for a different constraint.
Cherny's argument: "The bottleneck isn't token cost. It's human correction time. When I use a weaker model, I spend 30 minutes trying to explain why the code is wrong or reviewing something I shouldn't have had to review. When I use Opus, I spend 5 minutes reviewing obviously-correct code. The time savings pay for the token cost 100x over."
This is empirically true for:
- Solo developers working directly with the AI (no async pipeline)
- Tasks where human review happens in the same session
- Workflows where context-switching between model behaviors is disruptive
- Projects where the developer's hourly rate >> API costs
This is empirically false for:
- Automated pipelines (no human in the loop for most tasks)
- High-volume projects (50+ features, batch processing)
- Cost-sensitive deployments (SaaS companies with thousands of developers)
- Teams where the ratio of senior engineers to features is low
The practical take: If you're a solo senior developer using Claude Code interactively, Cherny's approach might be better for your throughput even if it costs more per feature. If you're running an automated development pipeline, tiered routing wins decisively.
- The pipeline is automated or async. No human waiting in the loop means "it just works" matters less than "it worked and cost $3 instead of $8."
- Failure modes are well-understood. You know which tasks actually need frontier-level reasoning and which are pattern-following.
- Volume is high. 50+ features per project, multiple projects, batch processing. The 40–50% cost savings compounds.
- Cost is a hard constraint. Startups, cost-conscious enterprises, or developers paying for API calls personally.
- Context quality is high. With excellent specs and plans, mid-tier doesn't need babysitting.
- The developer is in the loop. Interactive sessions where the human is watching and able to correct mid-task.
- The developer is senior. Experienced engineers can task-switch between model behaviors and aren't slowed down by weaker output.
- Correctness feedback is immediate. The developer sees code execute right away and catches issues instantly.
- Throughput per hour is the metric. Calendar time matters more than API cost.
Infralovers (a design-focused development studio) profiled their agentic workflow and found they were using frontier models for everything. They implemented tiered routing as described here and achieved 57% cost reduction with zero quality regression.
The key finding: they were using frontier models for deterministic checks.
Before routing:
- Pre-flight checks: Opus
- Code review: Opus
- Git commits: Opus
- Build plan decomposition: Opus
- Cost per feature: $8.50
After routing:
- Pre-flight checks: Automated (bash scripts)
- Code review: Sonnet 4.6
- Git commits: Automated (conventional commits hook)
- Build plan decomposition: Sonnet 4.6
- Research + validation + audit: Opus
- Cost per feature: $3.60
The biggest win: Automating the pre-flight check. Checking whether files exist and dependencies are installed isn't a problem that benefits from language models. They wrote a simple shell script — 50 lines — that caught the same issues at zero cost.
The lesson: Look for tasks where you're paying for an LLM to do something deterministic. Linting, formatting, basic file validation, commit message validation — these have right answers. Automate them. Save the LLM for reasoning.
One reason tiered routing works so well in 2026 is that Sonnet 4.6 is genuinely close to Opus 4.6 on real-world coding tasks.
Anthropic's internal benchmarks:
- Architecture decisions (research/validation): Opus ~10% better
- Code generation from spec: Sonnet ~2% worse on average (sometimes better)
- Code review (finding bugs): Opus ~8% better
- Remediation (fixing known issues): Sonnet ~1% worse
The frontier advantage is real for complex reasoning (architecture, validation) but minimal for pattern-following tasks (coding, review, remediation). This is why tiered routing works — you're using frontier where it matters and saving money where the gap is negligible.
Approach: Hardcode task-to-tier mapping (as shown in the table above).
IF task_category == "research" OR "validation" OR "audit"
THEN use Frontier
ELSE IF task_category == "coding" OR "review" OR "remediation"
THEN use Mid-tier
ELSE IF task_category == "pre_flight" OR "smoke_test"
THEN use Automated
Pros:
- Easy to understand and debug
- Optimal for most projects
- No overhead
Cons:
- Inflexible if task characteristics change
- Doesn't adapt to project-specific patterns
Best for: Small to medium projects with clear task boundaries
Approach: Estimate task complexity and route dynamically.
complexity = estimate_complexity(spec, codebase_size, integration_points)
IF complexity > 0.8
THEN use Frontier
ELSE IF complexity > 0.4
THEN use Mid-tier
ELSE
THEN use Automated
Complexity heuristics:
- Number of files affected (>5 files = higher complexity)
- External API integrations (if yes = higher complexity)
- Cross-module reasoning required (if yes = higher complexity)
- Novelty of pattern (is this similar to existing code? lower complexity if yes)
Pros:
- Adapts to project-specific complexity
- Can optimize cost/quality on a per-task basis
Cons:
- Complexity estimation itself needs to be correct (easy to get wrong)
- More moving parts to debug
Best for: Large projects where complexity varies significantly or where you want fine-grained cost optimization
Approach: Use rule-based routing as the primary gate, complexity-based as the override.
route = rule_based_router(task_category)
IF complexity is unusually_high
AND route != Frontier
THEN promote to Frontier
ELSE IF complexity is very_low
AND route == Frontier
THEN demote to Mid-tier (if not validation)
Pros:
- Handles the typical case (rule-based) efficiently
- Adapts to edge cases (complexity override)
- Good balance of simplicity and flexibility
Cons:
- Still requires complexity estimation for the override
Best for: Medium to large projects with some variation in task complexity
For high-volume tasks (hundreds of features), even mid-tier cost can add up. Three strategies for further optimization:
For pattern-following code generation, smaller local models (Llama, Code Llama) can be cost-effective, especially if you have GPU infrastructure.
When this works:
- Well-specified tasks with clear patterns
- Internal tooling (not customer-facing code)
- High volume (hundreds of tasks) where amortized cost matters
When this fails:
- Novel patterns or architectural decisions
- Complex multi-file integration
- Anything that needs strong error recovery
Practical approach: Use local models for implementation of well-understood patterns, frontier/mid-tier for planning and validation.
Generate code twice — once with a local model, once with mid-tier — and compare. Use the local version if it passes automated tests; fall back to mid-tier if not.
This is technically a form of self-play, and it works surprisingly well for pattern-following tasks where you have strong verification (test coverage).
If you're building multiple features with similar patterns, cache the model outputs from the first feature and reuse them for the second.
Example: The first data model migration uses Sonnet. The second, third, and fourth migrations of similar pattern can reuse the cached approach with minimal mid-tier invocation.
Using frontier for pre-flight checks. Checking file existence or dependency installation is not a reasoning problem. Automate it.
Never routing. Using the same model for everything because "it's simpler." The 40–50% cost savings isn't negligible, especially at scale.
Routing based on guesses. Assigning tier without understanding the failure mode. The decision should be: "What happens if this step is wrong, and does a weaker model risk that failure?"
Over-routing. Assigning frontier to tasks that don't need it (like surface-level code review). Review can catch different bugs than audit — route them separately, but review doesn't need frontier reasoning.
Changing models mid-session. Starting with frontier context, then switching to mid-tier for the same task. Context style and formatting expectations differ; this creates friction.
Forgetting the human factor. If your developer's time is $200/hour and the API cost difference between frontier and mid-tier is $5, and mid-tier requires an extra 30 minutes of human review, mid-tier loses on the math. Route for total cost, not just API cost.
-
Profile your workflow. Measure cost per feature using frontier for everything. Then estimate cost using the tiered routing table above. The gap is usually 40–50%. That's your upside.
-
Automate the obvious. Pre-flight checks, smoke tests, linting, formatting — write simple scripts (shell, Python, whatever fits your stack) to replace LLM invocations.
-
Use the rule-based router. Apply the task-to-tier mapping from the Development Loop table. This is optimal for most projects.
-
Profile mid-tier performance on your actual tasks. Run 5 features using Sonnet 4.6 for implementation and measure quality and cost. Compare to frontier. You'll likely see that mid-tier is more than good enough.
-
Create a ROUTING.md in your repo. Document which model tier handles which tasks in your specific workflow. Update it as you learn. Share it with your team.
- Anthropic, "Model Routing for Cost-Effective Development" (2026)
- Infralovers, "57% Cost Reduction Through Intelligent Model Assignment" (2025)
- Boris Cherny, Claude Code workflow and benchmarking (2026)
- Thoughtworks, "Spec-driven development: Model Selection and Routing" (2025)
- Anthropic, "2026 Agentic Coding Trends Report"
- HumanLayer, "Optimizing for Cost Without Sacrificing Quality" (2025)
- Manus AI, "Router Architectures for Multi-Model Workflows" (2025)
- Martin Fowler, "Context Engineering for Coding Agents" (2025)