Building a Smart LLM Router: How We Benchmarked 46 Models and Built a 14-Dimension Classifier #124
1bcMax
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Building a Smart LLM Router: How We Benchmarked 46 Models and Built a 14-Dimension Classifier
March 20, 2026 | BlockRun Engineering
When you route AI requests across 55+ models from 8 providers, you can't just pick the cheapest one. You can't just pick the fastest one either. We learned this the hard way.
This is the technical story of how we benchmarked every model on our platform, discovered that speed and intelligence are poorly correlated, and built a production routing system that classifies requests in under 1ms using 14 weighted dimensions with sigmoid confidence calibration.
The Problem: One Gateway, 46 Models, Infinite Wrong Choices
BlockRun is an x402 micropayment gateway. Every LLM request flows through our proxy, gets authenticated via on-chain USDC payment, and is forwarded to the appropriate provider. The payment overhead adds 50-100ms to every request.
Our users set
model: "auto"and expect us to pick the right model. But "right" means different things for different requests:We needed a system that could classify any request and route it to the optimal model in real-time.
Step 1: Benchmarking the Fleet
Before building the router, we needed ground truth. We benchmarked all 55+ models through our production payment pipeline.
Methodology
This is not a synthetic benchmark. Every measurement includes the full payment-verification round trip that real users experience.
The Latency Landscape
Results revealed a 7x spread between the fastest and slowest models:
Two clear patterns:
Step 2: Adding the Quality Dimension
Speed alone tells you nothing about whether a model can actually handle your request. We cross-referenced our latency data with Artificial Analysis Intelligence Index v4.0 scores (composite of GPQA, MMLU, MATH, HumanEval, and other benchmarks):
The Efficiency Frontier
Plotting IQ against latency reveals a clear efficiency frontier:
The frontier runs from Gemini 2.5 Flash (IQ 20, 1.2s) up to Gemini 3.1 Pro (IQ 57, 1.6s). Everything above and to the right of this line is dominated — you can get equal or better quality at lower latency from a different model.
Key insight: Gemini 3.1 Pro matches GPT-5.4's IQ at 1/4 the latency and lower cost. Claude Sonnet 4.6 nearly matches Opus 4.6 quality at 60% of the price. These dominated pairings directly informed our routing fallback chains.
Step 3: The Failed Experiment (Latency-First Routing)
Armed with benchmark data, we initially optimized for speed. The routing config promoted fast models:
Users complained within 24 hours. The fast models were refusing complex tasks and giving shallow responses. A model with IQ 41 can't reliably handle architecture design or multi-step code generation, no matter how fast it is.
Lesson: optimizing for a single metric in a multi-objective system creates failure modes. We needed to optimize across speed, quality, and cost simultaneously.
Step 4: The 14-Dimension Scoring System
The router needs to determine what kind of request it's looking at before selecting a model. We built a rule-based classifier that scores requests across 14 weighted dimensions:
Architecture
The 14 Dimensions
Weights sum to 1.0. The weighted score maps to a continuous axis where tier boundaries partition the space.
Multilingual Support
Every keyword list includes translations in 9 languages (EN, ZH, JA, RU, DE, ES, PT, KO, AR). A Chinese user asking "证明这个定理" triggers the same reasoning classification as "prove this theorem."
Confidence Calibration
Raw tier assignments can be ambiguous when a score falls near a boundary. We use sigmoid calibration:
Where
steepness = 12anddistance_from_boundaryis the score's distance to the nearest tier boundary. This maps to a [0.5, 1.0] confidence range. Belowthreshold = 0.7, the request is classified as ambiguous and defaults to MEDIUM.Agentic Detection
A separate scoring pathway detects agentic tasks (multi-step, tool-using, iterative). When
agenticScore >= 0.5, the router switches to agentic-optimized tier configs that prefer models with strong instruction following (Claude Sonnet for complex tasks, GPT-4o-mini for simple tool calls).Step 5: Tier-to-Model Mapping
Once a request is classified into a tier, the router selects from 4 routing profiles:
Auto Profile (Default)
Tuned from our benchmark data + user retention metrics:
Eco Profile
Ultra cost-optimized. Uses free/near-free models:
Premium Profile
Best quality regardless of cost:
Fallback Chains
Each tier config includes an ordered fallback list. When the primary model returns a 402 (payment failed), 429 (rate limited), or 5xx, the proxy walks the fallback chain. Fallback ordering is benchmark-informed:
The chain descends by quality first (IQ 48 → 46 → 41), then trades quality for speed. GPT-5.4 is last despite having IQ 57, because its 6.2s latency is a worst-case user experience.
Step 6: Context-Aware Filtering
The fallback chain is filtered at runtime based on request properties:
If filtering eliminates all candidates, the full chain is used as a fallback (better to let the API error than return nothing).
Cost Calculation and Savings
Every routing decision includes a cost estimate and savings percentage against a baseline (Claude Opus 4.6 pricing):
For a typical SIMPLE request (500 input tokens, 256 output tokens):
Across our user base, the median savings rate is 85% compared to routing everything to a premium model.
Performance
The entire classification pipeline (14 dimensions + tier mapping + model selection) runs in under 1ms. No external API calls. No LLM inference. Pure keyword matching and arithmetic.
We originally designed a two-stage system where low-confidence rules-based classifications would fall back to an LLM classifier (Gemini 2.5 Flash). In practice, the rules handle 70-80% of requests with high confidence, and the remaining ambiguous cases default to MEDIUM — which is the correct conservative choice.
What We Learned
Speed and intelligence are weakly correlated. The fastest model (Grok 4 Fast, IQ 23) is at the bottom of the quality scale. The smartest model at low latency (Gemini 3.1 Pro, IQ 57, 1.6s) is a Google model, not OpenAI.
Optimizing for one metric fails. Latency-first routing breaks quality. Quality-first routing breaks latency budgets. You need multi-objective optimization.
User retention is the real metric. Our best-performing model for SIMPLE tasks isn't the cheapest or the fastest — it's Gemini 2.5 Flash (60% retention rate), which balances speed, cost, and just-enough quality.
Fallback ordering matters more than primary selection. The primary model handles the happy path. The fallback chain handles reality — rate limits, outages, payment failures. A well-ordered fallback chain is more important than picking the perfect primary.
Rule-based classification is underrated. 14 keyword dimensions with sigmoid confidence calibration handles 70-80% of requests correctly in <1ms. The remaining 20-30% default to a safe middle tier. For a routing system where every millisecond of overhead compounds across millions of requests, avoiding LLM inference in the classification step is worth the reduced accuracy.
Appendix: Full Benchmark Data
Raw data (55+ models, latency, throughput, IQ scores, pricing):
benchmark-merged.jsonRouting configuration:
src/router/config.tsScoring implementation:
src/router/rules.tsBlockRun is the x402 micropayment gateway for AI. One wallet, 55+ models, pay-per-request with USDC. blockrun.ai
Beta Was this translation helpful? Give feedback.
All reactions