This document traces the project's journey from cost-based model routing to RAG evaluation tool.
When GPT-3.5 launched, there was a significant cost difference between context sizes:
| Model | Context | Cost (1K input) |
|---|---|---|
| GPT-3.5-turbo (4K) | 4,096 | $0.0015 |
| GPT-3.5-turbo-16K | 16,384 | $0.003 |
For workloads with variable document sizes, this created an optimization opportunity: route short documents to the cheaper 4K model, and only use the 16K model when necessary.
The system classified documents into buckets:
- Short (< 500 tokens): Use GPT-3.5-turbo (cheapest)
- Medium (500-3500 tokens): Use GPT-3.5-turbo (still fits)
- Long (> 3500 tokens): Use GPT-3.5-turbo-16K (required)
This achieved measurable cost savings on mixed workloads.
By late 2023/early 2024, context windows grew fast:
| Model | Context | Released |
|---|---|---|
| GPT-4-turbo | 128K | Nov 2023 |
| Claude 3 | 200K | Mar 2024 |
| Gemini 1.5 | 1M+ | Feb 2024 |
The original cost-routing motivation became less compelling. With 128K+ contexts, even "long" documents fit easily.
The research question shifted:
If we're chunking for retrieval anyway, does adaptive chunk sizing improve retrieval quality?
The hypothesis was that document length should inform chunk size:
- Short documents (<1K tokens): Keep whole
- Medium documents (1K-4K): Standard chunks (2K)
- Long documents (>4K): Smaller chunks (1K)
This pivot required proper evaluation infrastructure:
- YAML-driven experiment configs
- Reproducible benchmarking
- Comprehensive metrics (P@k, R@k, efficiency, cost)
We tested adaptive chunking with reproducible CLI configs and local matrix comparisons across multiple datasets.
No win for routing. Precision@5 and Recall@5 were identical on the hosted mixed-corpus slice, and later expanded local-matrix runs showed uniform outperforming router.
This held across:
- Mixed-corpus runs with OpenAI-family embedding model
text-embedding-3-small - RFC-only runs with OpenRouter
qwen/qwen3-embedding-8b - Multiple repeated runs on the mixed corpus
In this project, routing by document length did not produce better retrieval than uniform chunking. The embedding models we tested were indifferent to chunk size differences, so the routing logic just added complexity for nothing.
The embedding layer turned out to be useful on its own, so we extracted it into chromaroute:
- Provider-agnostic ChromaDB embedding functions
- OpenRouter -> Local fallback chain
- Production-ready error handling
ContextRAG now depends on chromaroute for embeddings, keeping only the evaluation infrastructure.
-
Negative results are valuable. The null finding simplifies production RAG: just use uniform chunking.
-
Build evaluation early. Without the benchmarking framework, this result would have stayed invisible.
-
Extract reusable components. chromaroute emerged from ContextRAG's embedding abstraction and now stands on its own.
ContextRAG is a focused evaluation CLI:
# Primary commands
contextrag eval # Full evaluation with configurable strategies
contextrag demo # Offline evaluation with local embeddings
contextrag doctor # Check configuration health
contextrag matrix # Run baseline-by-k experiment matrix
contextrag compare # Compare two evaluation runs
# Database operations
contextrag db index # Build vector index
contextrag db query # Query indexThe routing and ingest commands were removed -- they represented historical complexity, not current value. Embedding functionality is now delegated to chromaroute.