Skip to content
This repository was archived by the owner on Apr 12, 2026. It is now read-only.
This repository was archived by the owner on Apr 12, 2026. It is now read-only.

Sentinel: task difficulty tiers from cross-run data #1518

@jpleva91

Description

@jpleva91

Idea: Task Difficulty Tiers

Context

Some Terminal Bench tasks solve every run (log-summary-date-ranges, merge-diff-arc-agi-task), some never solve (gpt2-codegolf, feal-linear-cryptanalysis), and some flip between runs. The flippers are where all the improvement ROI lives — but /evolve currently treats all 89 tasks equally.

Proposal

Build a difficulty matrix from bench_history.json cross-run data:

  • Always-solve (solved 80%+ of attempts) — skip or use as regression canaries
  • Flippers (solved 20-80%) — highest improvement ROI, focus prompt/plan tuning here
  • Never-solve (solved <20%) — model ceiling, don't waste budget

Feed this into /evolve targeting so improve phase focuses on flipping the flippers instead of "improve everything." Also useful for task ordering — attempt flippers first to maximize signal per dollar.

Why Now

With 12+ runs of history data, there's enough signal to build reliable tiers. Run 13 at 46% means the flipper tier is large and actionable.

Owner

Repo: sentinel + cata/bench/evolve
Priority signal: 🟡 important — focuses improvement effort

Open Questions

  • How many runs needed for a stable tier assignment?
  • Should tiers be recomputed every run or cached?
  • Does task ordering (attempt flippers first) require Harbor adapter changes?

Filed from /brainstorm session

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent:claimedAgent dispatched — do not re-dispatchenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions