Skip to content

feat: smriti bench — search quality benchmarking#53

Draft
ashu17706 wants to merge 1 commit intomainfrom
claude/beautiful-sinoussi
Draft

feat: smriti bench — search quality benchmarking#53
ashu17706 wants to merge 1 commit intomainfrom
claude/beautiful-sinoussi

Conversation

@ashu17706
Copy link
Contributor

Summary

  • Adds smriti bench command: a ground-truth quality benchmark inspired by Karpathy's autoresearch pattern
  • Produces one combined score (0.0–1.0) to track search quality over time and gate upstream QMD upgrades
  • 30 tests, all passing

What's in it

Corpus — 12 realistic sessions across 3 projects (webapp, ml-pipeline, infra), 2 agents, 6 categories. Session s04 has sidecar content (thinking, artifact, attachment) to exercise sidecar search paths.

Query suite — 36 ground-truth queries across 3 tiers:

  • Easy (12): exact keywords → tests FTS path
  • Medium (12): paraphrased → tests recall path
  • Hard (12): conceptual + filters → tests filtered search + cross-session recall

Metrics — MRR, Hit@3, Hit@5 per tier. Combined score formula:

combined = 0.15 * easy.mrr + 0.35 * medium.mrr + 0.35 * hard.mrr + 0.15 * filterBonus

CI thresholds — Easy MRR ≥ 0.70, Medium ≥ 0.25, Hard ≥ 0.12, Combined ≥ 0.30

CLI

smriti bench                                    # full run
smriti bench --profile ci-small --no-perf       # quality only
smriti bench --json --out bench/results/ci-small.json
smriti bench --compare bench/results/ci-small.json

Current scores (ci-small, FTS-only — no embeddings in bench)

Tier Hit@3 MRR
Easy 91.7% 0.917
Medium 8.3% 0.083
Hard 41.7% 0.417
Combined 0.362

Medium is low by design — recall queries need vector embeddings which aren't available in the in-memory bench. As QMD's embedding/reranker models improve (autoresearch loop), this score will rise.

Test plan

  • bun test test/bench.test.ts — 30/30 pass
  • bun src/index.ts bench --profile ci-small — runs end-to-end
  • bun src/index.ts bench --json --out /tmp/bench.json — JSON output works
  • bun src/index.ts bench --compare /tmp/bench.json — comparison works

Adds `smriti bench` command: a ground-truth quality benchmark inspired by
Karpathy's autoresearch pattern. Produces one combined score (0.0–1.0) to
track search quality over time and gate upstream QMD upgrades.

- 12 realistic corpus sessions across 3 projects (webapp, ml-pipeline, infra)
- 36 ground-truth queries: 12 easy (FTS), 12 medium (recall), 12 hard (filtered)
- MRR, Hit@3, Hit@5 per tier; weighted combined score (easy 15%, medium/hard 35% each)
- CI thresholds: easy MRR ≥ 0.70, medium ≥ 0.25, hard ≥ 0.12, combined ≥ 0.30
- CLI flags: --profile, --json, --out, --compare, --no-perf, --save, --history
- smriti_bench_runs table for historical tracking
- 30 tests (all passing)
@github-actions
Copy link
Contributor

Benchmark Scorecard (ci-small)

Bench Scorecard (ci-small)

threshold: 20.00%

metric baseline current (median) delta status
ingest_throughput_msgs_per_sec 1735.800 447.820 -74.20% WARN
ingest_p95_ms_per_session 6.960 25.482 +266.12% WARN
fts_p95_ms 0.410 0.639 +55.85% WARN
recall_p95_ms 0.436 0.777 +78.21% WARN

Summary: WARN (4 metrics)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant