A deterministic evaluation and QA utility for retrieval systems (RAG) without requiring model calls.
It focuses on enterprise-friendly metrics you can run in CI:
- Retrieval metrics: recall@k, precision@k, MRR, hit-rate@k
- Corpus QA: near-exact duplicate detection and overlap between corpora (leakage risk)
- Reproducible reports: JSON outputs suitable for gating
This is intentionally lightweight and safe to open source. A Pro version can add dashboards, policy enforcement, multi-tenant governance, and hosted storage for corpora and eval runs.
pip install -e ".[dev]"
pytest -qScore retrieval results:
toolkit-rag score --queries queries.jsonl --retrieved retrieved.jsonl --k 5 --out report.jsonCheck overlap/leakage between two corpora:
toolkit-rag overlap --a corpus_a.jsonl --b corpus_b.jsonl --out overlap.jsonCompare candidate report to a baseline (CI gating):
toolkit-rag compare --baseline baseline.json --candidate report.json --max-recall-regression-pct 2.0Queries JSONL (one per line):
{"id":"q1","query":"...","relevant_ids":["doc-1","doc-9"]}Retrieved JSONL (one per line):
{"id":"q1","retrieved_ids":["doc-9","doc-2","doc-1"]}Corpora JSONL (one per line):
{"id":"doc-1","text":"..."}score- Score retrieval results against queriesoverlap- Check overlap/leakage between corporacompare- Compare reports for CI gating
compare:0= passed,4= failed budgets
MIT License - see LICENSE file for details.