Skip to content

Replay a user's routine with OpenBrowser#54

Merged
softpudding merged 23 commits intomainfrom
codex/recording-mode-foundation
Apr 9, 2026
Merged

Replay a user's routine with OpenBrowser#54
softpudding merged 23 commits intomainfrom
codex/recording-mode-foundation

Conversation

@softpudding
Copy link
Copy Markdown
Owner

No description provided.

softpudding and others added 23 commits April 4, 2026 01:22
Introduce a first end-to-end recording workflow across the server, extension, and frontend. This adds a recording manager plus REST routes for creating, listing, stopping, and appending recording sessions, and wires a new recording_control command through the processor so the server can drive recorder state in the browser extension.

Add extension-side recording support with a background recorder module, content-script event capture for trusted user interactions, and tab lifecycle tracking. Exclude the OpenBrowser UI itself from recording so the recorder only captures the target workflow instead of localhost:8765 control interactions.

Add a dedicated recording panel in the frontend with a header-level Record entry point, live event polling, recording summaries, and event detail inspection. Fix the recording event list layout so long traces keep full-height cards inside an internally scrollable panel instead of collapsing into thin horizontal rows.

Verification: pytest server/tests/unit/test_recording_routes.py server/tests/unit/test_api_uuid.py; npm run build (extension); node --check on the extracted frontend script.
Add a dedicated recording workflow that is separate from task chat and can launch recordings in an isolated browser window. The backend now supports recording launch modes, recording start/stop control, and test coverage for the new API behavior.

Extend the extension recorder to track scoped tabs, browser-level navigation events, semantic container context for recorded elements, and keyframe screenshots for actionable events. Recording review UI now lives in a standalone panel, shows captured events and keyframe previews, and excludes the OpenBrowser app itself from recorded activity.

Document the screenshot finding in AGENTS.md and codify the final recording rule: page_view is a lifecycle signal only and must not capture keyframes. Startup or refresh page_view screenshots were reproduced to shrink the live Chrome page into the top-left corner, while tab_ready remained safe for startup snapshots.
Rebuild the recording panel timeline around page-side event timestamps instead of raw persistence order so click events with keyframes no longer appear artificially late.

Fold near-adjacent focus and ambient scroll events into the surrounding click card when they refer to the same element and tab, while still exposing the supporting events in the details panel.

Also add the missing normalizeWhitespace helper used by the new timeline grouping logic.
…ompilation pipeline

Restructure the recording panel into three distinct phases:
1. Record — live capture with event list and controls
2. Review — inspect trace, add intent note, continue to compile
3. Compile — interactive compiler agent session with streaming log, Q&A, and SOP output

Backend: implement compiler agent with SSE streaming via background thread + queue pattern,
add trace_viewer/file/submit_workflow tools, compile and compile/answer endpoints,
recording metadata update support, and intent note persistence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Frontend: Compress recording panel chrome (smaller header, stepper, toolbar)
and event cards so 5-6 events fit in the list. Move intent note out of the
right detail pane into a footer row so the EVENT DETAIL pane has full
vertical space, and open the raw JSON details by default. Fix wheel
scrolling on the EVENT DETAIL pane by switching the nested flex chain
(.recording-phase-content → .recording-view → .recording-split-layout →
.recording-detail-content) from flex: 1 1 auto to flex: 1 1 0 with explicit
overflow: hidden, so the inner scroller is hard-bounded by its parent's
track height.

Compiler agent: Persist conversation traces to
~/.openbrowser/compiler_traces/{recording_id}_{timestamp}.json on
completion, error, or clarification, with long base64 strings truncated,
and surface the trace path through SSE results so failures can be replayed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The new skill/claude/open-browser/ tree mirrors the codex flavor but is
tuned for Claude Code's execution model:

- send_task.py no longer truncates [message:assistant], [thought], or
  [observation] lines, so the final agent answer always lands in the
  conversation log intact.
- The 17 KB SystemPromptEvent is collapsed to a single
  "[system_prompt] suppressed (N chars)" line by default; pass
  --show-system-prompt to opt back in.
- New --conversation-id flag lets follow-up turns reuse an existing
  browser session instead of always creating a fresh conversation.
- SKILL.md drops the "background + sleep + tail" guidance and points at
  Claude Code's native run_in_background Bash option, with foreground
  SSE streaming as the default.
- references/ refresh the script paths and add a "final assistant
  message looks cut off" troubleshooting entry plus the
  NO_PROXY="127.0.0.1,localhost" tip for proxied environments.

Verified end-to-end against a real example.com task: full assistant
message arrives untruncated, system prompt is suppressed, and the log
shrinks from hundreds of lines to ~8 for a one-step task.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e dumps in AGENTS.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
submit_workflow no longer flips the conversation to FINISHED. After it
validates, the agent gets one more turn to send a plain-language wrap-up
message and then the loop ends naturally. _collect_result detects this
state via _detect_review_state (walks events for the latest successful
submit + the most recent agent text after it) and returns
status:"review", keeping the session alive so the user can either
finalize or send revision feedback that triggers another submit cycle.

Adds POST /recordings/{id}/compile/finalize wired to a new
finalize_compiler_session helper for the approval path. The frontend
gains a green review block (wrap-up summary, finalize button, revision
textbox), auto-scrolls and flashes it when the SOP is drafted, and
persists the draft on review events so navigating away doesn't lose
work. max_iteration_per_run bumped to 80 to accommodate multi-round
revisions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a routines layer on top of the compile pipeline so finalized SOPs are
named, persisted, and replayable from the Execute panel without re-recording.

- routine_manager + /routines CRUD API (SQLite-backed, validated via the
  newly extracted validate_sop_markdown helper)
- compile/finalize now requires a name and atomically creates a routine
- Compiler review block prompts for the routine name (suggested from the
  SOP goal) before the finalize button enables
- Execute panel gains a Saved Routines section, a routine card that can
  be staged in the input area, slash-command autocomplete in the task
  textarea, and a management modal for edit/rename/delete

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Folds the inline routine chip list behind a single launcher button that
opens a modal dialog, paginating ten routines at a time so long lists
stay scannable instead of overflowing the toolbar.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compiler agent was assuming OpenBrowser's select action matched options
by visible label, so SOPs would name the human-readable text and the
runtime would fail to find a match. Sharpens the compiler tool prompt
and the select command description so SOPs always quote the literal
option.value (with the visible label as a parenthetical cue), records
both value and selectedText on <select> change events, and adds a
value -> exact-text -> case-insensitive-contains fallback in the
extension that returns the available option inventory on miss.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tions

The OpenBrowserAction base class exposed conversation_id as a regular pydantic
field, so it leaked into every tool's JSON schema. The LLM occasionally filled
it (e.g. mistaking it for tab_id and passing 1737540392), and the executor then
overwrote its real conversation_id from the action, sending bogus routing data
to the Chrome extension server and getting back HTTP 400. Mark the field as
SkipJsonSchema/exclude=True so it never appears in the tool schema, and remove
the executor override that read action.conversation_id — the real id is set in
__call__ from conversation._state.id.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e-recording

- Rename "SOP" → "Browser Routine" across docs, frontend, API,
  compiler agent, validators, and the routines table (with an
  on-startup column migration).
- Put `select_element` behind the YELLOW 2PC preview alongside click
  and keyboard_input. New `confirm_select` action; pending state now
  echoes the chosen `value` so the LLM can verify it against the
  rendered `<option>` list.
- Introduce a `mode="routine_replay"` conversation tag that flows from
  the API through session metadata into the system prompt and tool
  schemas. In replay mode, small models get a restricted highlight
  action that exposes `keywords` only for tokens copied verbatim from
  the active Routine step's `**Keywords:**` line. The compiler agent
  learns to emit those optional Keywords lines for stable
  testid-style identifiers, and the validator enforces the single
  bare-token rule. The highlight detector now recognises
  data-testid / data-test / data-cy / data-qa so those tokens can be
  surfaced.
- Frontend always opens a fresh routine-replay conversation when the
  user runs a saved routine, so the replay system prompt is in force.
- Add DELETE /recordings/{id} (refuses active recordings, closes any
  bound compiler session, drops events + session in one transaction)
  and a hover-revealed delete button on the recording history cards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The submit_workflow tool description used to call **Keywords:** an
optional line "only for 100% fixed elements", which contradicted the
system prompt's "include whenever there's a clean candidate" stance and
trained the compiler agent to write empty `**Keywords:**` boilerplate
that the validator then rejected. Realign the tool description with the
system prompt and explicitly instruct the agent to OMIT the line when
no clean token exists.

Repoint openhands-sdk/openhands-tools at the published agent-sdk commit
316612396c25e3c4396ce3282829b07399a5d30c (which adds visible-text words
as a last-resort keyword candidate, matching the runtime matcher).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ef on TS

- Apply black/prettier reformatting across server and extension so the branch
  satisfies pre-commit (no behavioral change).
- Update test_base_classes::TestOpenBrowserAction to assert that
  conversation_id is internal-only: still settable from Python, but excluded
  from model_dump() and from the JSON schema exposed to the LLM. Matches the
  intent of 98cf819 where exclude=True was added alongside SkipJsonSchema.
- Disable core `no-undef` for TS files in extension/eslint.config.mjs so DOM
  type references like `RequestInit` (used as type-only casts in the recorder
  tests) don't get flagged. TypeScript already validates these.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All four were flagged in a Codex review of the branch diff against main.
None of the fixes change public schemas; all touched code paths are covered
by new or updated tests.

1. [P1] Stop-on-disconnect no longer strands recordings as ACTIVE.
   Previously, if the browser websocket dropped during a recording,
   POST /recordings/{id}/stop returned 409 without transitioning state.
   Because DELETE and create_recording both refuse ACTIVE rows, the user
   was locked out of that browser until the DB was fixed manually. The
   stop handler now transitions the row to STOPPED locally with a
   stop_reason=browser_disconnected note, and does NOT dispatch a stop
   command the extension can't receive.

2. [P2] finalize_compiler_session now guards on review state AND
   re-validates the draft. _collect_result keeps sessions alive in both
   "asking" and "review" states, so a client could previously finalize
   while the agent was still asking a clarifying question and persist a
   half-formed routine. The new guard:
     - refuses to finalize unless _detect_review_state() returns true
       (i.e. a successful submit_workflow observation exists)
     - runs validate_routine_markdown() on the draft before teardown
     - leaves the session alive on validation failure so the user can
       send revision feedback via /compile/answer instead of being
       stranded
   This mirrors the validation the /routines create/update paths already
   run via _validate_or_raise.

3. [P2] /recordings/{id}/events now rejects writes once the row leaves
   ACTIVE. Keyframe capture in the extension runs async, so an /events
   POST started before /stop could land after /stop finished — letting a
   trace the user had already reviewed or compiled change underneath
   them. The handler now returns 409 with the current status when the
   session is no longer ACTIVE.

4. [P2] <select> resolution drops the substring fallback. The previous
   resolveOption() used a case-insensitive .includes() fallback as a
   third-choice match, which silently picked the first option whose
   label contained the requested token. On filters/screeners with
   overlapping labels (e.g. several "Market cap over ..." choices), this
   caused select_element to mutate page state against an arbitrary
   option without surfacing the ambiguity. Matching is now exact-only on
   option.value and trimmed option.text, restoring the intent of b18824c
   ("Teach compiler and runtime that <select> matches by option value").
   The error path already reports the full inventory so callers can
   retry with the correct value.

Tests:
- test_recording_routes.py:
  - test_stop_recording_handles_disconnected_browser_by_stopping_locally
    replaces the old "rejects disconnected browser" test and verifies
    the new local-stop path, metadata note, and that no extension
    command is dispatched.
  - test_append_recording_event_rejects_non_active_recording verifies
    the new 409 on writes to a stopped session and that the trace is
    untouched.
- test_compiler_agent_finalize.py (new):
  - asking-state rejection (session stays alive)
  - invalid-markdown rejection (session stays alive)
  - happy-path finalize tears down the session and returns a completed
    routine doc populated from validate_routine_markdown's summary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three small fixes uncovered while running the qwen3.5 eval:

- Emit usage_metrics before complete in both SSE worker paths
  (server/agent/api.py, server/core/browser_executor_bundle.py).
  The streamer drains and breaks right after yielding complete, so
  anything queued after it can race and be dropped — that left the
  eval logging "no usage_metrics event received" and the frontend
  showing all-zero usage stats.
- Always render cost in RMB on the frontend; this project accounts
  in RMB across the board, so the USD/¥ branching was unnecessary.
- Reset .main-terminal scrollTop on advanced-mode toggle. The shell
  is position:absolute inside main-terminal, so a leftover scrollTop
  (carried over when overflow flips from auto to hidden) pushed the
  panel above the visible viewport.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@softpudding softpudding merged commit 6f02ded into main Apr 9, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant