Skip to content

Perception benchmark: fix wallclock-time bug + ablation infrastructure#21

Open
trisha-ant wants to merge 4 commits intopskeshu:mainfrom
trisha-ant:tb/perception-accuracy
Open

Perception benchmark: fix wallclock-time bug + ablation infrastructure#21
trisha-ant wants to merge 4 commits intopskeshu:mainfrom
trisha-ant:tb/perception-accuracy

Conversation

@trisha-ant
Copy link
Copy Markdown

@trisha-ant trisha-ant commented Apr 9, 2026

Problem

The perception benchmark (benchmarks/perception/runner.py) was unrunnable
on main until #fix-perception-engine-loop, and once runnable it locks the
engine onto a single early stage for the entire run:

  • Opus 4.5, embryo_2 T33–84: predicts comma for 35 consecutive frames →
    25.0% exact / 84.6% adjacent
  • Sonnet 4.5, same window: predicts bean for all 52 frames →
    11.5% exact / 23.1% adjacent

The model isn't ignoring the image — it's obeying a quantitative signal
that says "don't advance yet."

Root cause

PerceptionEngine._build_prompt() includes a TEMPORAL CONTEXT block:
"Time at this stage: <N> minutes / Expected: <M> / Overtime: <N/M>×".
The minutes come from PerceptionSession.compute_temporal_analysis(),
which diffs observation timestamps.

runner.py called session.add_observation() without timestamp=,
so it defaulted to datetime.now() — i.e., the moment the benchmark code
processed that frame (~19s/frame), not when the embryo was at that stage
(~4min/frame). 22 frames of "comma" → 7 wallclock minutes vs comma's 30-min
typical duration → overtime 0.2×. The model is told it's 20% through comma,
so it stays. Direct quote from a reasoning trace at T55 (GT=1.5fold,
pred=comma):

"The embryo has been at comma for 7 minutes out of an expected 30
minutes — well within normal duration."

In a real timelapse those 22 frames are 88 minutes → 2.9× overtime →
"unusual, re-examine."

Fix

  • runner.py: pass timestamp=test_case.acquired_at (parsed from the
    TIFF filename YYYYmmdd_HHMMSS) to session.add_observation().
  • testset.py: expose TestCase.acquired_at (timestamp was already
    parsed for sorting, then discarded).
  • session.py: compute_temporal_analysis uses
    observations[-1].timestamp instead of datetime.now() so historical
    timestamps work.

Also in this PR (enabling the testing below):

  • runner.py now calls load_organism("celegans") — previously crashed
    with RuntimeError: No organism loaded when run standalone.
  • New --start-timepoint N flag to skip the ~33–50 easy early frames
    and target hard stages directly.
  • --no-temporal-context / --no-previous-observations ablation toggles
    on PerceptionEngine (default-on, live behavior unchanged).

Testing — ablation matrix

embryo_2 T33–84 (52 frames: bean 6, comma 6, 1.5fold 15, 2fold 20,
pretzel 5), GENTLY_MODEL_PERCEPTION=claude-sonnet-4-5-20250929, 5 runs:

run TEMPORAL block PREVIOUS OBS exact adj pattern
baseline wallclock (bug) on 11.5% 23.1% bean[33–84] — frozen
A off on 23.1% 57.7% races ahead — hatched by T77
B wallclock (bug) off 23.1% 59.6% bean[33–60], then clean advance
C off off 15.4% 46.2% chaotic frame-to-frame noise
D (this PR) real timestamps on 23.1% 59.6% bean[33–62], then clean advance — pretzel 100%

Per-stage exact accuracy:

stage (n) baseline A B C D
bean (6) 100% 100% 100% 100% 100%
comma (6) 0% 83% 0% 0% 0%
1.5fold (15) 0% 7% 0% 0% 0%
2fold (20) 0% 0% 15% 5% 5%
pretzel (5) 0% 0% 60% 20% 100%

Conclusions:

  • Buggy temporal + previous-obs together cause total paralysis (baseline
    worst by 2×).
  • Removing either anchor doubles accuracy; removing both (C) drops back
    to 15% with no frame-to-frame coherence — the anchors are load-bearing
    when correct.
  • D (the fix) matches B's accuracy without removing the feature, and
    gets the best late-stage tracking (pretzel 100%).

Behavior change to flag for review

session.py's now = observations[-1].timestamp is not strictly
neutral for the live agent: in manager.py, perceive() runs before
add_observation(), so observations[-1] is the previous frame. This
shifts reported time-in-stage down by ~one acquisition interval (~4 min)
and the 225-min arrest threshold fires one frame later. Arguably more
correct ("time observed in stage" rather than "time since first observation
including the current gap"), but it is a small live-path change.

Cosmetic: total_session_duration_min goes negative under benchmark mode
(created_at is wallclock 2026, now is TIFF 2025). That field never
reaches the prompt or arrest logic.

Why the middle stages are still 0% — and the path forward

Run D's per-stage result is bimodal: bean 100%, comma 0%, 1.5fold 0%,
2fold 5%
, pretzel 100%. Prediction streak: bean[33–62] → comma[63–65] → 1.5fold[66–67] → 2fold[68] → pretzel[69–84]. The fix changes the failure
from "never moves" (baseline) to "moves correctly but ~25 frames late."
By the time D breaks out of bean at T63, the real embryo is already at
2fold, so comma and 1.5fold are entirely missed. Once it catches up, it
tracks pretzel perfectly.

The lag persists because the visual discriminator can't tell bean ↔
comma ↔ 1.5fold apart on these projections. Even when the (now-correct)
temporal block says "you're at 2× overtime, re-examine," the model looks
again, applies the prompt's only fold rule, and concludes "still looks
like bean":

  • organisms/celegans/perception_prompt.py:45-62 reduces the entire
    comma↔fold decision to "In XZ, are masses side-by-side or stacked?"
    which doesn't survive max-intensity projection. Trace at e2 T62
    (GT=2fold, pred=comma, conf 0.88): "XZ shows lobes at same vertical
    level, not stacked."
    The model applies the rule correctly; the rule is
    wrong for this rendering. The 82%-accurate gently-perception/perception/ scientific.py:67-71 variant uses eggshell-fill-fraction instead
    (sparse → ≤1.5fold, moderate → 2fold, dense → pretzel), which is
    visible in projections.
  • organisms/celegans/stages.py:100-247 has detailed STAGE_CRITERIA
    with NOT_if exclusions and a ready format_stage_criteria_for_prompt()
    (line 363) — but the engine never calls it. The system prompt has ~24
    lines on early/bean/comma and 1 line each on fold stages, so the model
    has rich criteria for entering an early stage and almost nothing for
    leaving it.
  • engine.py:240 gates request_verification on conf < 0.7, but the
    errors are confidently wrong (41/52 had conf ≥ 0.79), so verification
    never fires. Triggering it on stagnation (frames-in-stage > expected)
    rather than low confidence would let the subagent break the tie.

This PR fixes the brake; a follow-up PR fixes the eyes. The harness
changes here (--start-timepoint, ablation toggles, real timestamps) are
what make those prompt-level improvements measurable on the hard stages
without burning ~45 API calls on early first.

Stacked on

fix-perception-engine-loop — the engine was un-runnable on main
(tuple-unpack crash in perceive()). Retarget to main after that merges.

_run_reasoning_loop returned 2-tuples but perceive() unpacked 3, and
_check_interval_rules passed an unsupported timepoint= kwarg to
IntervalRule.matches(). Both errors were swallowed by broad except
handlers, silently degrading every prediction to "early" and preventing
interval rules from ever firing. Also adds initial_stage/initial_confidence
to PerceptionResult and completes the messages history at early returns so
the multishot scaffold is type-clean and continuable.
The runner previously crashed with 'No organism loaded' because it never
called load_organism() (launch_gently.py does that, but the benchmark is
a separate entrypoint). Hardcode celegans for now.

Also add --start-timepoint so hard stages (comma/1.5fold/2fold/pretzel,
which start at T39-T90) can be targeted without burning API calls on the
~33-50 easy 'early' frames that precede them. iter_embryo() already
supported the parameter; this just threads it through the CLI/config.
Adds three runner CLI flags for isolating the comma-lock cause:
  --no-temporal-context        omit TEMPORAL CONTEXT block from prompt
  --no-previous-observations   omit PREVIOUS OBSERVATIONS block
  --real-timestamps            pass TIFF acquisition time to session

Threaded through BenchmarkConfig -> to_dict() -> PerceptionEngine ctor /
session.add_observation(timestamp=). All default to current behavior.

testset.py: expose TestCase.acquired_at parsed from TIFF YYYYmmdd_HHMMSS
filenames (was already parsed for sorting, then discarded).

session.py: compute_temporal_analysis now uses observations[-1].timestamp
instead of datetime.now() so real (historical) timestamps work. NOTE: in
the live-agent path (manager.py), perceive() runs before add_observation,
so this shifts the reported time-in-stage down by one acquisition interval
(~4 min) and the 225-min arrest threshold fires one frame later. Arguably
more correct (time *observed* in stage), but it is a small behavior change,
not a no-op.

total_session_duration_min goes negative under --real-timestamps (created_at
is wallclock vs now is TIFF time); cosmetic only, never reaches the prompt.
Ablation confirmed the wallclock-time bug is the dominant cause of the
perception engine's stage-lock in benchmarks. Make the fix unconditional:
pass test_case.acquired_at to session.add_observation() instead of letting
it default to datetime.now(). Removes the --real-timestamps flag (was for
ablation only).
@trisha-ant trisha-ant marked this pull request as ready for review April 9, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant