Perception benchmark: fix wallclock-time bug + ablation infrastructure by trisha-ant · Pull Request #21 · pskeshu/gently

trisha-ant · 2026-04-09T17:27:44Z

Problem

The perception benchmark (benchmarks/perception/runner.py) was unrunnable
on main until #fix-perception-engine-loop, and once runnable it locks the
engine onto a single early stage for the entire run:

Opus 4.5, embryo_2 T33–84: predicts comma for 35 consecutive frames →
25.0% exact / 84.6% adjacent
Sonnet 4.5, same window: predicts bean for all 52 frames →
11.5% exact / 23.1% adjacent

The model isn't ignoring the image — it's obeying a quantitative signal
that says "don't advance yet."

Root cause

PerceptionEngine._build_prompt() includes a TEMPORAL CONTEXT block:
"Time at this stage: <N> minutes / Expected: <M> / Overtime: <N/M>×".
The minutes come from PerceptionSession.compute_temporal_analysis(),
which diffs observation timestamps.

runner.py called session.add_observation() without timestamp=,
so it defaulted to datetime.now() — i.e., the moment the benchmark code
processed that frame (~19s/frame), not when the embryo was at that stage
(~4min/frame). 22 frames of "comma" → 7 wallclock minutes vs comma's 30-min
typical duration → overtime 0.2×. The model is told it's 20% through comma,
so it stays. Direct quote from a reasoning trace at T55 (GT=1.5fold,
pred=comma):

"The embryo has been at comma for 7 minutes out of an expected 30
minutes — well within normal duration."

In a real timelapse those 22 frames are 88 minutes → 2.9× overtime →
"unusual, re-examine."

Fix

runner.py: pass timestamp=test_case.acquired_at (parsed from the
TIFF filename YYYYmmdd_HHMMSS) to session.add_observation().
testset.py: expose TestCase.acquired_at (timestamp was already
parsed for sorting, then discarded).
session.py: compute_temporal_analysis uses
observations[-1].timestamp instead of datetime.now() so historical
timestamps work.

Also in this PR (enabling the testing below):

runner.py now calls load_organism("celegans") — previously crashed
with RuntimeError: No organism loaded when run standalone.
New --start-timepoint N flag to skip the ~33–50 easy early frames
and target hard stages directly.
--no-temporal-context / --no-previous-observations ablation toggles
on PerceptionEngine (default-on, live behavior unchanged).

Testing — ablation matrix

embryo_2 T33–84 (52 frames: bean 6, comma 6, 1.5fold 15, 2fold 20,
pretzel 5), GENTLY_MODEL_PERCEPTION=claude-sonnet-4-5-20250929, 5 runs:

run	TEMPORAL block	PREVIOUS OBS	exact	adj	pattern
baseline	wallclock (bug)	on	11.5%	23.1%	`bean[33–84]` — frozen
A	off	on	23.1%	57.7%	races ahead — `hatched` by T77
B	wallclock (bug)	off	23.1%	59.6%	`bean[33–60]`, then clean advance
C	off	off	15.4%	46.2%	chaotic frame-to-frame noise
D (this PR)	real timestamps	on	23.1%	59.6%	`bean[33–62]`, then clean advance — pretzel 100%

Per-stage exact accuracy:

stage (n)	baseline	A	B	C	D
bean (6)	100%	100%	100%	100%	100%
comma (6)	0%	83%	0%	0%	0%
1.5fold (15)	0%	7%	0%	0%	0%
2fold (20)	0%	0%	15%	5%	5%
pretzel (5)	0%	0%	60%	20%	100%

Conclusions:

Buggy temporal + previous-obs together cause total paralysis (baseline
worst by 2×).
Removing either anchor doubles accuracy; removing both (C) drops back
to 15% with no frame-to-frame coherence — the anchors are load-bearing
when correct.
D (the fix) matches B's accuracy without removing the feature, and
gets the best late-stage tracking (pretzel 100%).

Behavior change to flag for review

session.py's now = observations[-1].timestamp is not strictly
neutral for the live agent: in manager.py, perceive() runs before
add_observation(), so observations[-1] is the previous frame. This
shifts reported time-in-stage down by ~one acquisition interval (~4 min)
and the 225-min arrest threshold fires one frame later. Arguably more
correct ("time observed in stage" rather than "time since first observation
including the current gap"), but it is a small live-path change.

Cosmetic: total_session_duration_min goes negative under benchmark mode
(created_at is wallclock 2026, now is TIFF 2025). That field never
reaches the prompt or arrest logic.

Why the middle stages are still 0% — and the path forward

Run D's per-stage result is bimodal: bean 100%, comma 0%, 1.5fold 0%,
2fold 5%, pretzel 100%. Prediction streak: bean[33–62] → comma[63–65] → 1.5fold[66–67] → 2fold[68] → pretzel[69–84]. The fix changes the failure
from "never moves" (baseline) to "moves correctly but ~25 frames late."
By the time D breaks out of bean at T63, the real embryo is already at
2fold, so comma and 1.5fold are entirely missed. Once it catches up, it
tracks pretzel perfectly.

The lag persists because the visual discriminator can't tell bean ↔
comma ↔ 1.5fold apart on these projections. Even when the (now-correct)
temporal block says "you're at 2× overtime, re-examine," the model looks
again, applies the prompt's only fold rule, and concludes "still looks
like bean":

organisms/celegans/perception_prompt.py:45-62 reduces the entire
comma↔fold decision to "In XZ, are masses side-by-side or stacked?" —
which doesn't survive max-intensity projection. Trace at e2 T62
(GT=2fold, pred=comma, conf 0.88): "XZ shows lobes at same vertical
level, not stacked." The model applies the rule correctly; the rule is
wrong for this rendering. The 82%-accurate gently-perception/perception/ scientific.py:67-71 variant uses eggshell-fill-fraction instead
(sparse → ≤1.5fold, moderate → 2fold, dense → pretzel), which is
visible in projections.
organisms/celegans/stages.py:100-247 has detailed STAGE_CRITERIA
with NOT_if exclusions and a ready format_stage_criteria_for_prompt()
(line 363) — but the engine never calls it. The system prompt has ~24
lines on early/bean/comma and 1 line each on fold stages, so the model
has rich criteria for entering an early stage and almost nothing for
leaving it.
engine.py:240 gates request_verification on conf < 0.7, but the
errors are confidently wrong (41/52 had conf ≥ 0.79), so verification
never fires. Triggering it on stagnation (frames-in-stage > expected)
rather than low confidence would let the subagent break the tie.

This PR fixes the brake; a follow-up PR fixes the eyes. The harness
changes here (--start-timepoint, ablation toggles, real timestamps) are
what make those prompt-level improvements measurable on the hard stages
without burning ~45 API calls on early first.

Stacked on

fix-perception-engine-loop — the engine was un-runnable on main
(tuple-unpack crash in perceive()). Retarget to main after that merges.

_run_reasoning_loop returned 2-tuples but perceive() unpacked 3, and _check_interval_rules passed an unsupported timepoint= kwarg to IntervalRule.matches(). Both errors were swallowed by broad except handlers, silently degrading every prediction to "early" and preventing interval rules from ever firing. Also adds initial_stage/initial_confidence to PerceptionResult and completes the messages history at early returns so the multishot scaffold is type-clean and continuable.

The runner previously crashed with 'No organism loaded' because it never called load_organism() (launch_gently.py does that, but the benchmark is a separate entrypoint). Hardcode celegans for now. Also add --start-timepoint so hard stages (comma/1.5fold/2fold/pretzel, which start at T39-T90) can be targeted without burning API calls on the ~33-50 easy 'early' frames that precede them. iter_embryo() already supported the parameter; this just threads it through the CLI/config.

Adds three runner CLI flags for isolating the comma-lock cause: --no-temporal-context omit TEMPORAL CONTEXT block from prompt --no-previous-observations omit PREVIOUS OBSERVATIONS block --real-timestamps pass TIFF acquisition time to session Threaded through BenchmarkConfig -> to_dict() -> PerceptionEngine ctor / session.add_observation(timestamp=). All default to current behavior. testset.py: expose TestCase.acquired_at parsed from TIFF YYYYmmdd_HHMMSS filenames (was already parsed for sorting, then discarded). session.py: compute_temporal_analysis now uses observations[-1].timestamp instead of datetime.now() so real (historical) timestamps work. NOTE: in the live-agent path (manager.py), perceive() runs before add_observation, so this shifts the reported time-in-stage down by one acquisition interval (~4 min) and the 225-min arrest threshold fires one frame later. Arguably more correct (time *observed* in stage), but it is a small behavior change, not a no-op. total_session_duration_min goes negative under --real-timestamps (created_at is wallclock vs now is TIFF time); cosmetic only, never reaches the prompt.

Ablation confirmed the wallclock-time bug is the dominant cause of the perception engine's stage-lock in benchmarks. Make the fix unconditional: pass test_case.acquired_at to session.add_observation() instead of letting it default to datetime.now(). Removes the --real-timestamps flag (was for ablation only).

trisha-ant added 4 commits April 7, 2026 17:44

trisha-ant marked this pull request as ready for review April 9, 2026 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perception benchmark: fix wallclock-time bug + ablation infrastructure#21

Perception benchmark: fix wallclock-time bug + ablation infrastructure#21
trisha-ant wants to merge 4 commits intopskeshu:mainfrom
trisha-ant:tb/perception-accuracy

trisha-ant commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

trisha-ant commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Testing — ablation matrix

Behavior change to flag for review

Why the middle stages are still 0% — and the path forward

Stacked on

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

trisha-ant commented Apr 9, 2026 •

edited

Loading