Idea: Task Difficulty Tiers
Context
Some Terminal Bench tasks solve every run (log-summary-date-ranges, merge-diff-arc-agi-task), some never solve (gpt2-codegolf, feal-linear-cryptanalysis), and some flip between runs. The flippers are where all the improvement ROI lives — but /evolve currently treats all 89 tasks equally.
Proposal
Build a difficulty matrix from bench_history.json cross-run data:
- Always-solve (solved 80%+ of attempts) — skip or use as regression canaries
- Flippers (solved 20-80%) — highest improvement ROI, focus prompt/plan tuning here
- Never-solve (solved <20%) — model ceiling, don't waste budget
Feed this into /evolve targeting so improve phase focuses on flipping the flippers instead of "improve everything." Also useful for task ordering — attempt flippers first to maximize signal per dollar.
Why Now
With 12+ runs of history data, there's enough signal to build reliable tiers. Run 13 at 46% means the flipper tier is large and actionable.
Owner
Repo: sentinel + cata/bench/evolve
Priority signal: 🟡 important — focuses improvement effort
Open Questions
- How many runs needed for a stable tier assignment?
- Should tiers be recomputed every run or cached?
- Does task ordering (attempt flippers first) require Harbor adapter changes?
Filed from /brainstorm session
Idea: Task Difficulty Tiers
Context
Some Terminal Bench tasks solve every run (log-summary-date-ranges, merge-diff-arc-agi-task), some never solve (gpt2-codegolf, feal-linear-cryptanalysis), and some flip between runs. The flippers are where all the improvement ROI lives — but
/evolvecurrently treats all 89 tasks equally.Proposal
Build a difficulty matrix from bench_history.json cross-run data:
Feed this into
/evolvetargeting soimprovephase focuses on flipping the flippers instead of "improve everything." Also useful for task ordering — attempt flippers first to maximize signal per dollar.Why Now
With 12+ runs of history data, there's enough signal to build reliable tiers. Run 13 at 46% means the flipper tier is large and actionable.
Owner
Repo: sentinel + cata/bench/evolve
Priority signal: 🟡 important — focuses improvement effort
Open Questions
Filed from /brainstorm session