Perception benchmark: fix wallclock-time bug + ablation infrastructure#21
Open
trisha-ant wants to merge 4 commits intopskeshu:mainfrom
Open
Perception benchmark: fix wallclock-time bug + ablation infrastructure#21trisha-ant wants to merge 4 commits intopskeshu:mainfrom
trisha-ant wants to merge 4 commits intopskeshu:mainfrom
Conversation
_run_reasoning_loop returned 2-tuples but perceive() unpacked 3, and _check_interval_rules passed an unsupported timepoint= kwarg to IntervalRule.matches(). Both errors were swallowed by broad except handlers, silently degrading every prediction to "early" and preventing interval rules from ever firing. Also adds initial_stage/initial_confidence to PerceptionResult and completes the messages history at early returns so the multishot scaffold is type-clean and continuable.
The runner previously crashed with 'No organism loaded' because it never called load_organism() (launch_gently.py does that, but the benchmark is a separate entrypoint). Hardcode celegans for now. Also add --start-timepoint so hard stages (comma/1.5fold/2fold/pretzel, which start at T39-T90) can be targeted without burning API calls on the ~33-50 easy 'early' frames that precede them. iter_embryo() already supported the parameter; this just threads it through the CLI/config.
Adds three runner CLI flags for isolating the comma-lock cause: --no-temporal-context omit TEMPORAL CONTEXT block from prompt --no-previous-observations omit PREVIOUS OBSERVATIONS block --real-timestamps pass TIFF acquisition time to session Threaded through BenchmarkConfig -> to_dict() -> PerceptionEngine ctor / session.add_observation(timestamp=). All default to current behavior. testset.py: expose TestCase.acquired_at parsed from TIFF YYYYmmdd_HHMMSS filenames (was already parsed for sorting, then discarded). session.py: compute_temporal_analysis now uses observations[-1].timestamp instead of datetime.now() so real (historical) timestamps work. NOTE: in the live-agent path (manager.py), perceive() runs before add_observation, so this shifts the reported time-in-stage down by one acquisition interval (~4 min) and the 225-min arrest threshold fires one frame later. Arguably more correct (time *observed* in stage), but it is a small behavior change, not a no-op. total_session_duration_min goes negative under --real-timestamps (created_at is wallclock vs now is TIFF time); cosmetic only, never reaches the prompt.
Ablation confirmed the wallclock-time bug is the dominant cause of the perception engine's stage-lock in benchmarks. Make the fix unconditional: pass test_case.acquired_at to session.add_observation() instead of letting it default to datetime.now(). Removes the --real-timestamps flag (was for ablation only).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The perception benchmark (
benchmarks/perception/runner.py) was unrunnableon
mainuntil #fix-perception-engine-loop, and once runnable it locks theengine onto a single early stage for the entire run:
commafor 35 consecutive frames →25.0% exact / 84.6% adjacent
beanfor all 52 frames →11.5% exact / 23.1% adjacent
The model isn't ignoring the image — it's obeying a quantitative signal
that says "don't advance yet."
Root cause
PerceptionEngine._build_prompt()includes aTEMPORAL CONTEXTblock:"Time at this stage: <N> minutes / Expected: <M> / Overtime: <N/M>×".The minutes come from
PerceptionSession.compute_temporal_analysis(),which diffs observation timestamps.
runner.pycalledsession.add_observation()withouttimestamp=,so it defaulted to
datetime.now()— i.e., the moment the benchmark codeprocessed that frame (~19s/frame), not when the embryo was at that stage
(~4min/frame). 22 frames of "comma" → 7 wallclock minutes vs comma's 30-min
typical duration → overtime 0.2×. The model is told it's 20% through comma,
so it stays. Direct quote from a reasoning trace at T55 (GT=1.5fold,
pred=comma):
In a real timelapse those 22 frames are 88 minutes → 2.9× overtime →
"unusual, re-examine."
Fix
runner.py: passtimestamp=test_case.acquired_at(parsed from theTIFF filename
YYYYmmdd_HHMMSS) tosession.add_observation().testset.py: exposeTestCase.acquired_at(timestamp was alreadyparsed for sorting, then discarded).
session.py:compute_temporal_analysisusesobservations[-1].timestampinstead ofdatetime.now()so historicaltimestamps work.
Also in this PR (enabling the testing below):
runner.pynow callsload_organism("celegans")— previously crashedwith
RuntimeError: No organism loadedwhen run standalone.--start-timepoint Nflag to skip the ~33–50 easyearlyframesand target hard stages directly.
--no-temporal-context/--no-previous-observationsablation toggleson
PerceptionEngine(default-on, live behavior unchanged).Testing — ablation matrix
embryo_2 T33–84 (52 frames: bean 6, comma 6, 1.5fold 15, 2fold 20,
pretzel 5),
GENTLY_MODEL_PERCEPTION=claude-sonnet-4-5-20250929, 5 runs:bean[33–84]— frozenhatchedby T77bean[33–60], then clean advancebean[33–62], then clean advance — pretzel 100%Per-stage exact accuracy:
Conclusions:
worst by 2×).
to 15% with no frame-to-frame coherence — the anchors are load-bearing
when correct.
gets the best late-stage tracking (pretzel 100%).
Behavior change to flag for review
session.py'snow = observations[-1].timestampis not strictlyneutral for the live agent: in
manager.py,perceive()runs beforeadd_observation(), soobservations[-1]is the previous frame. Thisshifts reported time-in-stage down by ~one acquisition interval (~4 min)
and the 225-min arrest threshold fires one frame later. Arguably more
correct ("time observed in stage" rather than "time since first observation
including the current gap"), but it is a small live-path change.
Cosmetic:
total_session_duration_mingoes negative under benchmark mode(
created_atis wallclock 2026,nowis TIFF 2025). That field neverreaches the prompt or arrest logic.
Why the middle stages are still 0% — and the path forward
Run D's per-stage result is bimodal: bean 100%, comma 0%, 1.5fold 0%,
2fold 5%, pretzel 100%. Prediction streak:
bean[33–62] → comma[63–65] → 1.5fold[66–67] → 2fold[68] → pretzel[69–84]. The fix changes the failurefrom "never moves" (baseline) to "moves correctly but ~25 frames late."
By the time D breaks out of bean at T63, the real embryo is already at
2fold, so comma and 1.5fold are entirely missed. Once it catches up, it
tracks pretzel perfectly.
The lag persists because the visual discriminator can't tell bean ↔
comma ↔ 1.5fold apart on these projections. Even when the (now-correct)
temporal block says "you're at 2× overtime, re-examine," the model looks
again, applies the prompt's only fold rule, and concludes "still looks
like bean":
organisms/celegans/perception_prompt.py:45-62reduces the entirecomma↔fold decision to "In XZ, are masses side-by-side or stacked?" —
which doesn't survive max-intensity projection. Trace at e2 T62
(GT=2fold, pred=comma, conf 0.88): "XZ shows lobes at same vertical
level, not stacked." The model applies the rule correctly; the rule is
wrong for this rendering. The 82%-accurate
gently-perception/perception/ scientific.py:67-71variant uses eggshell-fill-fraction instead(sparse → ≤1.5fold, moderate → 2fold, dense → pretzel), which is
visible in projections.
organisms/celegans/stages.py:100-247has detailedSTAGE_CRITERIAwith
NOT_ifexclusions and a readyformat_stage_criteria_for_prompt()(line 363) — but the engine never calls it. The system prompt has ~24
lines on early/bean/comma and 1 line each on fold stages, so the model
has rich criteria for entering an early stage and almost nothing for
leaving it.
engine.py:240gatesrequest_verificationon conf < 0.7, but theerrors are confidently wrong (41/52 had conf ≥ 0.79), so verification
never fires. Triggering it on stagnation (frames-in-stage > expected)
rather than low confidence would let the subagent break the tie.
This PR fixes the brake; a follow-up PR fixes the eyes. The harness
changes here (
--start-timepoint, ablation toggles, real timestamps) arewhat make those prompt-level improvements measurable on the hard stages
without burning ~45 API calls on
earlyfirst.Stacked on
fix-perception-engine-loop— the engine was un-runnable onmain(tuple-unpack crash in
perceive()). Retarget tomainafter that merges.