Problem
tests/datasets/golden-samples/deployment-skill/metadata.yaml contains expected_scores (overall_score: 0.87) and a Validation History row (2026-02-14 | 1.0.0 | 0.87 | PASS) that were hand-authored, not produced by any actual evaluation run. The deployment skill (generated output) does not even exist in the repo.
Risk
The Validation History format implies empirical provenance. Any contributor comparing benchmark results against these expected_scores is comparing against a made-up number.
Fix
Add a prominent disclaimer to metadata.yaml: expected_scores are aspirational targets, not recorded measurements. Annotate or remove the Validation History row.
Related
Tracked in audit plan. Part of the tests/ audit series alongside #23–#29.