Skip to content

AKIVA-AI/toolkit-rag-quality

Repository files navigation

Toolkit RAG Quality Toolkit

A deterministic evaluation and QA utility for retrieval systems (RAG) without requiring model calls.

It focuses on enterprise-friendly metrics you can run in CI:

  • Retrieval metrics: recall@k, precision@k, MRR, hit-rate@k
  • Corpus QA: near-exact duplicate detection and overlap between corpora (leakage risk)
  • Reproducible reports: JSON outputs suitable for gating

This is intentionally lightweight and safe to open source. A Pro version can add dashboards, policy enforcement, multi-tenant governance, and hosted storage for corpora and eval runs.

Install (dev)

pip install -e ".[dev]"
pytest -q

Quickstart

Score retrieval results:

toolkit-rag score --queries queries.jsonl --retrieved retrieved.jsonl --k 5 --out report.json

Check overlap/leakage between two corpora:

toolkit-rag overlap --a corpus_a.jsonl --b corpus_b.jsonl --out overlap.json

Compare candidate report to a baseline (CI gating):

toolkit-rag compare --baseline baseline.json --candidate report.json --max-recall-regression-pct 2.0

Data formats

Queries JSONL (one per line):

{"id":"q1","query":"...","relevant_ids":["doc-1","doc-9"]}

Retrieved JSONL (one per line):

{"id":"q1","retrieved_ids":["doc-9","doc-2","doc-1"]}

Corpora JSONL (one per line):

{"id":"doc-1","text":"..."}

CLI Commands

  • score - Score retrieval results against queries
  • overlap - Check overlap/leakage between corpora
  • compare - Compare reports for CI gating

CI exit codes

  • compare: 0 = passed, 4 = failed budgets

License

MIT License - see LICENSE file for details.

About

Deterministic RAG evaluation toolkit -- retrieval metrics (recall, precision, MRR), corpus overlap detection, and CI regression gating without model calls.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors