feat: smriti bench — search quality benchmarking by ashu17706 · Pull Request #53 · zero8dotdev/smriti

ashu17706 · 2026-03-10T02:13:22Z

Summary

Adds smriti bench command: a ground-truth quality benchmark inspired by Karpathy's autoresearch pattern
Produces one combined score (0.0–1.0) to track search quality over time and gate upstream QMD upgrades
30 tests, all passing

What's in it

Corpus — 12 realistic sessions across 3 projects (webapp, ml-pipeline, infra), 2 agents, 6 categories. Session s04 has sidecar content (thinking, artifact, attachment) to exercise sidecar search paths.

Query suite — 36 ground-truth queries across 3 tiers:

Easy (12): exact keywords → tests FTS path
Medium (12): paraphrased → tests recall path
Hard (12): conceptual + filters → tests filtered search + cross-session recall

Metrics — MRR, Hit@3, Hit@5 per tier. Combined score formula:

combined = 0.15 * easy.mrr + 0.35 * medium.mrr + 0.35 * hard.mrr + 0.15 * filterBonus

CI thresholds — Easy MRR ≥ 0.70, Medium ≥ 0.25, Hard ≥ 0.12, Combined ≥ 0.30

CLI

smriti bench                                    # full run
smriti bench --profile ci-small --no-perf       # quality only
smriti bench --json --out bench/results/ci-small.json
smriti bench --compare bench/results/ci-small.json

Current scores (ci-small, FTS-only — no embeddings in bench)

Tier	Hit@3	MRR
Easy	91.7%	0.917
Medium	8.3%	0.083
Hard	41.7%	0.417
Combined	—	0.362

Medium is low by design — recall queries need vector embeddings which aren't available in the in-memory bench. As QMD's embedding/reranker models improve (autoresearch loop), this score will rise.

Test plan

bun test test/bench.test.ts — 30/30 pass
bun src/index.ts bench --profile ci-small — runs end-to-end
bun src/index.ts bench --json --out /tmp/bench.json — JSON output works
bun src/index.ts bench --compare /tmp/bench.json — comparison works

Adds `smriti bench` command: a ground-truth quality benchmark inspired by Karpathy's autoresearch pattern. Produces one combined score (0.0–1.0) to track search quality over time and gate upstream QMD upgrades. - 12 realistic corpus sessions across 3 projects (webapp, ml-pipeline, infra) - 36 ground-truth queries: 12 easy (FTS), 12 medium (recall), 12 hard (filtered) - MRR, Hit@3, Hit@5 per tier; weighted combined score (easy 15%, medium/hard 35% each) - CI thresholds: easy MRR ≥ 0.70, medium ≥ 0.25, hard ≥ 0.12, combined ≥ 0.30 - CLI flags: --profile, --json, --out, --compare, --no-perf, --save, --history - smriti_bench_runs table for historical tracking - 30 tests (all passing)

github-actions · 2026-03-10T02:13:45Z

Benchmark Scorecard (ci-small)

Bench Scorecard (ci-small)

threshold: 20.00%

metric	baseline	current (median)	delta	status
ingest_throughput_msgs_per_sec	1735.800	447.820	-74.20%	WARN
ingest_p95_ms_per_session	6.960	25.482	+266.12%	WARN
fts_p95_ms	0.410	0.639	+55.85%	WARN
recall_p95_ms	0.436	0.777	+78.21%	WARN

Summary: WARN (4 metrics)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: smriti bench — search quality benchmarking#53

feat: smriti bench — search quality benchmarking#53
ashu17706 wants to merge 1 commit intomainfrom
claude/beautiful-sinoussi

ashu17706 commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ashu17706 commented Mar 10, 2026

Summary

What's in it

Current scores (ci-small, FTS-only — no embeddings in bench)

Test plan

Uh oh!

github-actions bot commented Mar 10, 2026

Benchmark Scorecard (ci-small)

Bench Scorecard (ci-small)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant