Skip to content

Add Audacity eval harness and regression reporting#96

Open
sehawq wants to merge 1 commit intoHKUDS:mainfrom
sehawq:codex/audacity-eval-harness
Open

Add Audacity eval harness and regression reporting#96
sehawq wants to merge 1 commit intoHKUDS:mainfrom
sehawq:codex/audacity-eval-harness

Conversation

@sehawq
Copy link
Collaborator

@sehawq sehawq commented Mar 17, 2026

Summary

  • Add a task-based evaluation harness for Audacity CLI with discovery, reporting, and baseline regression checks.
  • Introduce 4 stdlib-only eval tasks:
    • Project roundtrip (create → save → open → info)
    • Track + clip flow (add track, add clip, split)
    • Effects registry validation (normalize/fade_in)
    • WAV export (render_mix to file)
  • Add eval command to the CLI with --out, --baseline, --update-baseline, and --fail-on-regression.
  • Document eval usage and outputs in Audacity README.
  • Add pytest coverage for task discovery, report generation, and baseline regression detection.

Why

  • Provides an objective, repeatable quality signal for release readiness.
  • Detects regressions between versions using a baseline comparison.
  • Keeps MVP self-contained and portable (stdlib only, no external dependencies).

Implementation Details

  • New eval runner under audacity/agent-harness/cli_anything/audacity/eval/.
  • Task modules live in audacity/agent-harness/cli_anything/audacity/eval/tasks/.
  • Reports:
    • eval_report.json (machine-readable)
    • eval_report.md (human-readable)
    • artifacts/ for generated outputs
  • Baseline regression rules:
    • Pass → Fail on any task is regression
    • Overall success_rate decrease is regression

Known Limitations

  • Eval tasks are Audacity-only (other harnesses not yet wired).
  • Baseline comparison is MVP-level (status + success_rate only, no metric thresholds).
  • No CI integration in this PR (manual execution).
  • SoX-backed tests require a local SoX binary on PATH for full E2E coverage.

Future Work

  • Add metric thresholds per task (e.g., duration, RMS/peak bounds).
  • Add CI job to run eval and gate merges/releases.
  • Expand eval harness to other CLIs (Blender, GIMP, etc.).
  • Add YAML/JSON task definitions for non-code task authoring.

Tests

  • python -m pytest -v

@sehawq sehawq force-pushed the codex/audacity-eval-harness branch from e9a5fae to 1cdc4e2 Compare March 17, 2026 16:25
@sehawq sehawq requested a review from yuh-yang March 17, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant