Skip to content

fix: deployment-skill golden sample contains fabricated evaluation scores #32

@williamhallatt

Description

@williamhallatt

Problem

tests/datasets/golden-samples/deployment-skill/metadata.yaml contains expected_scores (overall_score: 0.87) and a Validation History row (2026-02-14 | 1.0.0 | 0.87 | PASS) that were hand-authored, not produced by any actual evaluation run. The deployment skill (generated output) does not even exist in the repo.

Risk

The Validation History format implies empirical provenance. Any contributor comparing benchmark results against these expected_scores is comparing against a made-up number.

Fix

Add a prominent disclaimer to metadata.yaml: expected_scores are aspirational targets, not recorded measurements. Annotate or remove the Validation History row.

Related

Tracked in audit plan. Part of the tests/ audit series alongside #23#29.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions