Sentinel: task difficulty tiers from cross-run data

## Idea: Task Difficulty Tiers

### Context
Some Terminal Bench tasks solve every run (log-summary-date-ranges, merge-diff-arc-agi-task), some never solve (gpt2-codegolf, feal-linear-cryptanalysis), and some flip between runs. The flippers are where all the improvement ROI lives — but `/evolve` currently treats all 89 tasks equally.

### Proposal
Build a difficulty matrix from bench_history.json cross-run data:
- **Always-solve** (solved 80%+ of attempts) — skip or use as regression canaries
- **Flippers** (solved 20-80%) — highest improvement ROI, focus prompt/plan tuning here
- **Never-solve** (solved <20%) — model ceiling, don't waste budget

Feed this into `/evolve` targeting so `improve` phase focuses on flipping the flippers instead of "improve everything." Also useful for task ordering — attempt flippers first to maximize signal per dollar.

### Why Now
With 12+ runs of history data, there's enough signal to build reliable tiers. Run 13 at 46% means the flipper tier is large and actionable.

### Owner
**Repo:** sentinel + cata/bench/evolve
**Priority signal:** 🟡 important — focuses improvement effort

### Open Questions
- How many runs needed for a stable tier assignment?
- Should tiers be recomputed every run or cached?
- Does task ordering (attempt flippers first) require Harbor adapter changes?

---
*Filed from /brainstorm session*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentinel: task difficulty tiers from cross-run data #1518

Idea: Task Difficulty Tiers

Context

Proposal

Why Now

Owner

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sentinel: task difficulty tiers from cross-run data #1518

Description

Idea: Task Difficulty Tiers

Context

Proposal

Why Now

Owner

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions