feat: harness effectiveness benchmark suite (#49) by rlaope · Pull Request #52 · rlaope/bestwork-agent

rlaope · 2026-04-09T02:04:25Z

Closes #49

Summary

10개 시나리오 벤치마크 (hallucination, platform, security, deprecated)
Harness ON: 100% catch rate (9/9), 0 false positives
Harness OFF: 0% catch rate (0/9)
npm run benchmark로 실행, 결과 자동 저장 benchmarks/results/

Test plan

11 benchmark tests all pass
기존 682 tests 전부 통과
npm run benchmark 정상 작동

Benchmark framework that measures bestwork harness gates against known-bad code scenarios. 10 scenarios across 4 categories (hallucination, platform, security, deprecated). Harness ON catches 100% (9/9), vanilla catches 0%. Zero false positives. - benchmarks/harness-benchmark.test.ts with 10 scenarios - npm run benchmark command - Auto-saves JSON results to benchmarks/results/ Signed-off-by: rlaope <rlaope@users.noreply.github.com>

…cenarios - Add benchmarks/results/ to .gitignore, remove committed result file - Remove unused imports (execSync, rmSync, readFileSync, tmpdir) - Fix byCategory report: separate catchable vs negative denominator - Add relative import detection to simulateReviewHook - Add 3 new scenarios: missing relative import, 2 false-positive guards - 13 scenarios, 100% accuracy, 0 false positives Signed-off-by: rlaope <rlaope@users.noreply.github.com>

rlaope added 2 commits April 9, 2026 11:04

rlaope merged commit 7e4c02d into main Apr 9, 2026
1 of 2 checks passed

rlaope deleted the feat/issue-49 branch April 9, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: harness effectiveness benchmark suite (#49)#52

feat: harness effectiveness benchmark suite (#49)#52
rlaope merged 2 commits intomainfrom
feat/issue-49

rlaope commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rlaope commented Apr 9, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant