Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5d1e848
Add recording mode foundation and trace review flow
softpudding Apr 3, 2026
02d7d91
Expand recording mode into a dedicated workflow foundation
softpudding Apr 5, 2026
e1f80ca
Improve recording trace review ordering and event folding
softpudding Apr 5, 2026
f3e82b1
Improve recording review UI and cleanup flow
softpudding Apr 6, 2026
5cbf909
Adjust recording keyframe capture policy
softpudding Apr 6, 2026
5a60eaa
Refine recording review UX and annotate keyframes
softpudding Apr 6, 2026
617aaa9
Use full-size screenshot capture and agent-only workflow drafts
softpudding Apr 6, 2026
ebf0cbd
Capture pre-action recording keyframes and compile semantic scrolls
softpudding Apr 6, 2026
757f525
Refine recording semantics and simplify review flow
softpudding Apr 6, 2026
75dad95
Add three-phase recording UI, compiler agent SSE streaming, and SOP c…
softpudding Apr 6, 2026
4df3248
Refine review pane layout, scroll behavior, and compiler trace dumps
softpudding Apr 7, 2026
8cadd47
Add Claude Code flavor of the open-browser skill
softpudding Apr 7, 2026
ad5f984
Document server test commands, vendored SDK layout, and compiler trac…
softpudding Apr 7, 2026
ede62fc
Turn compiler submit into a review checkpoint instead of a hard finish
softpudding Apr 7, 2026
53bcd26
Save finalized SOPs as named, replayable routines
softpudding Apr 7, 2026
9b16a3a
Replace routines toolbar with launcher modal and pagination
softpudding Apr 7, 2026
b18824c
Teach compiler and runtime that <select> matches by option value
softpudding Apr 7, 2026
98cf819
Hide conversation_id from LLM tool schemas and stop trusting it on ac…
softpudding Apr 7, 2026
368fba2
Rename SOP to Routine, add select 2PC, routine-replay mode, and delet…
softpudding Apr 8, 2026
97d62a1
Tighten compiler keyword guidance and pin agent-sdk
softpudding Apr 8, 2026
4e2bc01
Fix CI: lint/format, assert conversation_id is hidden, silence no-und…
softpudding Apr 8, 2026
22715cc
Fix four correctness issues in recording + routine flows
softpudding Apr 8, 2026
159a983
Fix usage metrics race, RMB display, and advanced-mode scroll bug
softpudding Apr 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 29 additions & 6 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ The visual interaction workflow is implemented across 4 focused tools:
|------|----------|---------|
| `tab` | `tab init`, `tab open`, `tab close`, `tab switch`, `tab list`, `tab refresh`, `tab view`, `tab back`, `tab forward` | Session and tab management |
| `highlight` | `highlight_elements` | Element discovery with blue overlays |
| `element_interaction` | `click_element`, `confirm_click_element`, `hover_element`, `scroll_element`, `keyboard_input`, `confirm_keyboard_input`, `select_element` | Element interaction with 2PC only for click and keyboard input |
| `element_interaction` | `click_element`, `confirm_click_element`, `hover_element`, `scroll_element`, `keyboard_input`, `confirm_keyboard_input`, `select_element`, `confirm_select` | Element interaction with 2PC for click, keyboard input, and select |
| `dialog` | `handle_dialog` | Dialog handling (accept/dismiss) |

## UNIQUE PATTERNS
Expand All @@ -272,9 +272,10 @@ If operation fails twice:
## PERFORMANCE OPTIMIZATIONS

### Selective 2PC
- `click_element` and `keyboard_input` require an ORANGE confirmation preview followed by `confirm_click_element` or `confirm_keyboard_input`
- `hover_element`, `scroll_element`, `swipe_element`, and `select_element` execute immediately and return the post-action screenshot
- Starting a different action clears any pending confirmation from a previous `click_element` or `keyboard_input`
- `click_element`, `keyboard_input`, and `select_element` require a YELLOW confirmation preview followed by `confirm_click_element`, `confirm_keyboard_input`, or `confirm_select`
- `select` confirmation messages also echo the chosen `value` so the agent can verify option text against the rendered `<option>` list
- `hover_element`, `scroll_element`, and `swipe_element` execute immediately and return the post-action screenshot
- Starting a different action clears any pending confirmation from a previous `click_element`, `keyboard_input`, or `select_element`

## SISYPHUS MODE

Expand Down Expand Up @@ -313,18 +314,39 @@ Configuration is saved to `localStorage` (key: `openbrowser_sisyphus_config`).
## COMMANDS

```bash
# Start server
# Start server (HTTP 8765, WebSocket 8766)
uv run local-chrome-server serve
uv run local-chrome-server serve --multi-process # one worker process per conversation

# Build extension
cd extension && npm run build

# Server tests (pytest, async mode auto, paths under server/tests)
uv run pytest # all
uv run pytest server/tests/unit/test_recording_routes.py # one file
uv run pytest server/tests/unit/test_recording_routes.py::TestName::test_x # one test
uv run pytest -m integration # needs running server + extension
```

## SCREENSHOT BEHAVIOR

OpenBrowser has explicit screenshot control for maximum flexibility:

- Screenshots also serve as a practical page warmup mechanism for background tabs. They can unblock page paint and media decode work that passive DOM/readiness inspection does not reliably trigger on its own.
- Screenshot output sizing must not rely on `Page.captureScreenshot` with `clip.scale < 1` on a live tab.
- Reason: scaled CDP captures were reproduced to leave the visible page shrunk into the top-left corner, including during recording.
- Preferred strategy: capture at the tab's natural device-pixel size first, then downscale/compress offline inside the extension.

### Recording Keyframes

- Recording keyframes must **not** be attached to `page_view` events.
- Reason: `page_view` is emitted during content-script `resume` / `start-recording` after refresh or reload, which is earlier than a stable post-load milestone.
- Capturing a screenshot in that early `page_view` phase was reproduced to shrink the live Chrome page into the top-left corner during recording sessions.
- `tab_ready` should stay as a lifecycle event and must not be the sole source of recording screenshots.
- Reason: on slow pages, users often start interacting while the tab still reports loading; waiting for `tab_ready` can miss the meaningful pre-load-complete actions entirely.
- Prefer action-timed keyframes on `click` / `submit`, but discard them when the captured screenshot has already drifted to a different URL than the source event page. This preserves useful action context without trusting navigation-transition screenshots.
- Recording output size limits such as `960x540` should be enforced only by offline downscale/compression after a full-size capture, never by CDP `clip.scale`.
- Action-timed recording keyframes may include an in-image bbox/banner annotation for the acted-on element (or submitted form) so review UI can show exactly what the user just clicked or typed into.

### Commands That Return Screenshots

Expand Down Expand Up @@ -606,7 +628,8 @@ Criteria match tracked events using flexible pattern matching:

## NOTES

- **Git dependencies:** `openhands-sdk` and `openhands-tools` from git subdirectories
- **Vendored SDK:** `openhands-sdk` and `openhands-tools` are editable installs from `../agent-sdk/openhands-sdk` and `../agent-sdk/openhands-tools` (see `[tool.uv.sources]` in `pyproject.toml`). Modify those paths directly when adding agents or tools — there is no separate package to publish.
- **CDP required:** Extension uses Chrome DevTools Protocol for screenshots/JS execution
- **Preset coordinates:** Screenshots at 1280x720, mouse in 0-1280/0-720 coordinate system
- **Config storage:** LLM config in `~/.openbrowser/llm_config.json`
- **Compiler agent traces:** dumped on completion / asking / error to `~/.openbrowser/compiler_traces/{recording_id}_{timestamp}.json`. The path is included in the SSE `complete` / `error` payload from `POST /recordings/{id}/compile`. The compiler agent's `QueueVisualizer` intentionally does not pass a `conversation_id`, so debugging relies on these dump files rather than the sessions DB.
296 changes: 296 additions & 0 deletions WORKFLOW_COMPILATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
# Recording To Workflow Compilation

## Status

This document records the current design direction for turning a recorded user
trace into an executable OpenBrowser workflow.

Current product status:

- Recording infrastructure exists in
[server/api/routes/recordings.py](/Users/yangxiao/git/OpenBrowser/server/api/routes/recordings.py)
and
[server/core/recording_manager.py](/Users/yangxiao/git/OpenBrowser/server/core/recording_manager.py).
- Browser-side recording exists in
[extension/src/recording/recorder.ts](/Users/yangxiao/git/OpenBrowser/extension/src/recording/recorder.ts)
and
[extension/src/content/index.ts](/Users/yangxiao/git/OpenBrowser/extension/src/content/index.ts).
- Recording review UI exists in
[frontend/index.html](/Users/yangxiao/git/OpenBrowser/frontend/index.html),
and it now focuses on reviewing captured events and keyframes only.
- A first-pass rule-based compiler still exists in
[server/core/workflow_compiler.py](/Users/yangxiao/git/OpenBrowser/server/core/workflow_compiler.py),
but it is no longer the intended product path for user-facing workflow draft
generation.

Recent progress on the recording layer:

- pre-action keyframes now exist for `pointerdown -> click` and `keydown Enter`,
so keyframes can capture the pre-navigation or pre-submit state instead of the
post-action page.
- noisy root-level clicks are filtered out.
- text input is merged into a final meaningful input result instead of one event
per character.
- `page_view` is now restricted to the top-level frame only, avoiding iframe
noise.
- browser history navigation is now distinguished as `tab_back` and
`tab_forward`, instead of collapsing everything into `tab_navigated`.

The current system should be treated as:

`recording trace capture and review`

The missing gap is:

`recording trace -> compiler-agent-authored workflow draft -> approved workflow -> execution`

## Core Decision

OpenBrowser should not rely on rule code to directly turn low-level events into
a user-facing workflow draft.

Reason:

- low-level events capture what physically happened,
- but workflow compilation must express what the user meant.

Examples:

- `click` is a physical browser event,
- `select market cap = largeover` is the user intent,
- `change` is a DOM event,
- `search for housing in Xixi Wetland` is the workflow meaning.

That semantic jump is too brittle to encode as a fixed event-to-step mapping.

Therefore:

- raw recording trace remains the source of truth,
- lightweight normalization may still exist as an internal cleanup layer,
- but the user-facing workflow draft should be generated by a Compiler Agent,
not by deterministic trace rules.

## Product Flow

The intended product flow is now explicitly four stages.

### Stage 1: Recording

The user records the browser workflow manually.

Requirements for this stage:

- capture the key browser events,
- capture meaningful keyframes,
- keep the trace reviewable,
- keep the trace faithful to what actually happened.

This stage is the current implementation focus and should be considered the
main completed foundation.

Status update:

- mostly complete as a product foundation,
- still open for continued event-quality improvements where trace semantics are
obviously under-specified or noisy.

### Stage 2: User Intent Note

After stopping recording, the user adds a short note that explains the overall
intent.

Examples:

- "Search Zhihu for posts about AI rent discussions."
- "Filter Finviz to large-cap stocks and inspect the results."
- "Collect useful findings and prepare a summary post."

This note is required because trace alone usually does not encode the business
rule behind the actions.

This stage is part of the intended design, but it is not yet integrated into
the current product flow.

This is now the immediate next product task.

### Stage 3: Compiler Agent Draft

A dedicated Compiler Agent reads:

- the raw recording trace,
- the keyframes,
- the user intent note,
- and optionally an internal normalized trace.

It then produces a workflow draft.

This draft should describe:

- the intended steps,
- the reasoning behind the steps,
- the missing ambiguities,
- and the questions that must be answered before execution.

Important constraint:

The Compiler Agent is not executing the browser at this stage. It is only
interpreting and compiling the demonstrated workflow.

This is now the next major implementation milestone after the intent-note step.

### Stage 4: Draft Iteration And Finalization

The Compiler Agent and the user iterate on the workflow draft.

The user should be able to:

- correct the draft,
- answer clarification questions,
- refine ambiguous rules,
- and approve the final version.

The result of this stage is the final workflow that later execution will use.

## What Recording Must Guarantee

The current recording system does not need to solve workflow semantics.

It only needs to be strong on the following:

- accurate event capture,
- accurate keyframe capture,
- noise reduction where clearly justified,
- stable review of the trace after recording.

That means the recording layer is responsible for facts, not interpretation.

Examples of facts:

- which page was open,
- which element was clicked,
- what text was entered,
- what keyframe was visible before or during the interaction,
- what scroll or form change happened.

Interpretation belongs to the Compiler Agent, not the recorder.

## Role Of Normalization

Normalization may still exist, but only as an internal helper.

Its purpose is to make the trace easier for the Compiler Agent to consume.

Examples:

- remove obvious iframe or ad noise,
- group closely related low-level events,
- dedupe focus and click when they represent the same interaction,
- collapse scroll bursts,
- keep supporting events attached to a primary interaction.

Normalization should not pretend to be the final workflow.

That is the key design boundary.

## Compiler Agent Responsibilities

The Compiler Agent should:

- inspect the recorded trace,
- inspect keyframes,
- read the user intent note,
- infer the likely workflow,
- decide where the trace is ambiguous,
- ask clarification questions,
- produce a reviewable draft,
- update that draft after user feedback.

The Compiler Agent should not:

- replay the trace mechanically,
- assume every click is a workflow step,
- assume every DOM event directly maps to intent,
- skip clarification when the trace is ambiguous.

## Draft Shape

The exact final schema is still open, but the draft should eventually include:

- workflow goal,
- ordered steps,
- reasoning per step,
- evidence references back to trace events,
- clarification questions,
- approved user answers,
- final execution-ready version.

The important point is not the exact JSON format yet.
The important point is that the draft is agent-authored and user-reviewable.

## UI Direction

The current recording UI should focus on trace review only.

That means:

- captured events,
- keyframes,
- history of saved recordings,
- raw event detail.

It should not present the current rule-generated workflow draft as if it were a
reliable semantic interpretation.

Workflow generation should appear later as a dedicated Compiler Agent step.

## Near-Term Implementation Plan

Next product work should proceed in this order:

1. Keep improving recording quality until trace and keyframes are trustworthy.
2. Add a post-recording user intent note step.
3. Implement a Compiler Agent that consumes trace plus intent note.
4. Build a review loop where the Compiler Agent and the user refine the draft.
5. Produce the final workflow artifact for later execution.

Recent progress on the compilation layer:

- post-recording intent note is now implemented: the user can add a short text
note after stopping a recording, saved to recording session metadata via
`POST /recordings/{id}/intent-note`.
- the Compiler Agent is implemented in `server/core/compiler_agent.py` using
the openhands-sdk `Agent` + `Conversation` pattern with three tools:
- `trace_viewer` — lets the agent navigate events incrementally (summary,
paginated event list, single event detail, keyframe screenshots,
normalized steps) instead of receiving the entire trace in one message.
- `file` (FileEditorTool) — lets the agent write the Routine file.
- `submit_workflow` — validates the Routine file structure and ends the
conversation.
- the compiler agent system prompt teaches OpenBrowser's tool vocabulary
(highlight, click, keyboard_input, scroll, etc.) and the Browser Routine
format, so the output is an executable Browser Routine.
- the Routine is pure text (no embedded images). Keyframes are only used by
the compiler agent to understand the recorded trace.
- the compile endpoint is `POST /recordings/{id}/compile`. The previous
iteration endpoint has been removed — clarification happens during
compilation as part of the agent conversation loop.
- the frontend has a "Compile Routine" button and displays the resulting
Routine markdown.

The next concrete work item is now:

1. test the full end-to-end flow with a real recording,
2. tune the compiler agent system prompt and tool descriptions based on
real-world trace quality,
3. integrate the approved Routine with the execution layer.

## Summary

OpenBrowser should separate facts from interpretation.

- Recording captures facts.
- The user provides intent.
- The Compiler Agent produces the executable Browser Routine.
- The Routine instructs OpenBrowser what to do step by step.

That is the intended foundation for workflow execution.
3 changes: 3 additions & 0 deletions extension/eslint.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ export default [
'no-case-declarations': 'off',
'no-empty': 'off',
'no-useless-escape': 'off',
// TypeScript already resolves type references (e.g. `RequestInit`)
// that the core `no-undef` rule doesn't understand. Let tsc own this.
'no-undef': 'off',
'@typescript-eslint/no-explicit-any': 'off',
'@typescript-eslint/no-unused-vars': 'off',
},
Expand Down
1 change: 1 addition & 0 deletions extension/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
"host_permissions": ["<all_urls>"],
"permissions": [
"tabs",
"windows",
"tabGroups",
"activeTab",
"scripting",
Expand Down
Loading
Loading