softpudding · softpudding · Apr 9, 2026 · Apr 3, 2026 · Apr 5, 2026 · Apr 5, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -253,7 +253,7 @@ The visual interaction workflow is implemented across 4 focused tools:
 |------|----------|---------|
 | `tab` | `tab init`, `tab open`, `tab close`, `tab switch`, `tab list`, `tab refresh`, `tab view`, `tab back`, `tab forward` | Session and tab management |
 | `highlight` | `highlight_elements` | Element discovery with blue overlays |
-| `element_interaction` | `click_element`, `confirm_click_element`, `hover_element`, `scroll_element`, `keyboard_input`, `confirm_keyboard_input`, `select_element` | Element interaction with 2PC only for click and keyboard input |
+| `element_interaction` | `click_element`, `confirm_click_element`, `hover_element`, `scroll_element`, `keyboard_input`, `confirm_keyboard_input`, `select_element`, `confirm_select` | Element interaction with 2PC for click, keyboard input, and select |
 | `dialog` | `handle_dialog` | Dialog handling (accept/dismiss) |
 
 ## UNIQUE PATTERNS
@@ -272,9 +272,10 @@ If operation fails twice:
 ## PERFORMANCE OPTIMIZATIONS
 
 ### Selective 2PC
-- `click_element` and `keyboard_input` require an ORANGE confirmation preview followed by `confirm_click_element` or `confirm_keyboard_input`
-- `hover_element`, `scroll_element`, `swipe_element`, and `select_element` execute immediately and return the post-action screenshot
-- Starting a different action clears any pending confirmation from a previous `click_element` or `keyboard_input`
+- `click_element`, `keyboard_input`, and `select_element` require a YELLOW confirmation preview followed by `confirm_click_element`, `confirm_keyboard_input`, or `confirm_select`
+- `select` confirmation messages also echo the chosen `value` so the agent can verify option text against the rendered `<option>` list
+- `hover_element`, `scroll_element`, and `swipe_element` execute immediately and return the post-action screenshot
+- Starting a different action clears any pending confirmation from a previous `click_element`, `keyboard_input`, or `select_element`
 
 ## SISYPHUS MODE
 
@@ -313,18 +314,39 @@ Configuration is saved to `localStorage` (key: `openbrowser_sisyphus_config`).
 ## COMMANDS
 
 ```bash
-# Start server
+# Start server (HTTP 8765, WebSocket 8766)
 uv run local-chrome-server serve
+uv run local-chrome-server serve --multi-process     # one worker process per conversation
 
 # Build extension
 cd extension && npm run build
+
+# Server tests (pytest, async mode auto, paths under server/tests)
+uv run pytest                                                                # all
+uv run pytest server/tests/unit/test_recording_routes.py                     # one file
+uv run pytest server/tests/unit/test_recording_routes.py::TestName::test_x   # one test
+uv run pytest -m integration                                                 # needs running server + extension
 ```
 
 ## SCREENSHOT BEHAVIOR
 
 OpenBrowser has explicit screenshot control for maximum flexibility:
 
 - Screenshots also serve as a practical page warmup mechanism for background tabs. They can unblock page paint and media decode work that passive DOM/readiness inspection does not reliably trigger on its own.
+- Screenshot output sizing must not rely on `Page.captureScreenshot` with `clip.scale < 1` on a live tab.
+- Reason: scaled CDP captures were reproduced to leave the visible page shrunk into the top-left corner, including during recording.
+- Preferred strategy: capture at the tab's natural device-pixel size first, then downscale/compress offline inside the extension.
+
+### Recording Keyframes
+
+- Recording keyframes must **not** be attached to `page_view` events.
+- Reason: `page_view` is emitted during content-script `resume` / `start-recording` after refresh or reload, which is earlier than a stable post-load milestone.
+- Capturing a screenshot in that early `page_view` phase was reproduced to shrink the live Chrome page into the top-left corner during recording sessions.
+- `tab_ready` should stay as a lifecycle event and must not be the sole source of recording screenshots.
+- Reason: on slow pages, users often start interacting while the tab still reports loading; waiting for `tab_ready` can miss the meaningful pre-load-complete actions entirely.
+- Prefer action-timed keyframes on `click` / `submit`, but discard them when the captured screenshot has already drifted to a different URL than the source event page. This preserves useful action context without trusting navigation-transition screenshots.
+- Recording output size limits such as `960x540` should be enforced only by offline downscale/compression after a full-size capture, never by CDP `clip.scale`.
+- Action-timed recording keyframes may include an in-image bbox/banner annotation for the acted-on element (or submitted form) so review UI can show exactly what the user just clicked or typed into.
 
 ### Commands That Return Screenshots
 
@@ -606,7 +628,8 @@ Criteria match tracked events using flexible pattern matching:
 
 ## NOTES
 
-- **Git dependencies:** `openhands-sdk` and `openhands-tools` from git subdirectories
+- **Vendored SDK:** `openhands-sdk` and `openhands-tools` are editable installs from `../agent-sdk/openhands-sdk` and `../agent-sdk/openhands-tools` (see `[tool.uv.sources]` in `pyproject.toml`). Modify those paths directly when adding agents or tools — there is no separate package to publish.
 - **CDP required:** Extension uses Chrome DevTools Protocol for screenshots/JS execution
 - **Preset coordinates:** Screenshots at 1280x720, mouse in 0-1280/0-720 coordinate system
 - **Config storage:** LLM config in `~/.openbrowser/llm_config.json`
+- **Compiler agent traces:** dumped on completion / asking / error to `~/.openbrowser/compiler_traces/{recording_id}_{timestamp}.json`. The path is included in the SSE `complete` / `error` payload from `POST /recordings/{id}/compile`. The compiler agent's `QueueVisualizer` intentionally does not pass a `conversation_id`, so debugging relies on these dump files rather than the sessions DB.
diff --git a/WORKFLOW_COMPILATION.md b/WORKFLOW_COMPILATION.md
@@ -0,0 +1,296 @@
+# Recording To Workflow Compilation
+
+## Status
+
+This document records the current design direction for turning a recorded user
+trace into an executable OpenBrowser workflow.
+
+Current product status:
+
+- Recording infrastructure exists in
+  [server/api/routes/recordings.py](/Users/yangxiao/git/OpenBrowser/server/api/routes/recordings.py)
+  and
+  [server/core/recording_manager.py](/Users/yangxiao/git/OpenBrowser/server/core/recording_manager.py).
+- Browser-side recording exists in
+  [extension/src/recording/recorder.ts](/Users/yangxiao/git/OpenBrowser/extension/src/recording/recorder.ts)
+  and
+  [extension/src/content/index.ts](/Users/yangxiao/git/OpenBrowser/extension/src/content/index.ts).
+- Recording review UI exists in
+  [frontend/index.html](/Users/yangxiao/git/OpenBrowser/frontend/index.html),
+  and it now focuses on reviewing captured events and keyframes only.
+- A first-pass rule-based compiler still exists in
+  [server/core/workflow_compiler.py](/Users/yangxiao/git/OpenBrowser/server/core/workflow_compiler.py),
+  but it is no longer the intended product path for user-facing workflow draft
+generation.
+
+Recent progress on the recording layer:
+
+- pre-action keyframes now exist for `pointerdown -> click` and `keydown Enter`,
+  so keyframes can capture the pre-navigation or pre-submit state instead of the
+  post-action page.
+- noisy root-level clicks are filtered out.
+- text input is merged into a final meaningful input result instead of one event
+  per character.
+- `page_view` is now restricted to the top-level frame only, avoiding iframe
+  noise.
+- browser history navigation is now distinguished as `tab_back` and
+  `tab_forward`, instead of collapsing everything into `tab_navigated`.
+
+The current system should be treated as:
+
+`recording trace capture and review`
+
+The missing gap is:
+
+`recording trace -> compiler-agent-authored workflow draft -> approved workflow -> execution`
+
+## Core Decision
+
+OpenBrowser should not rely on rule code to directly turn low-level events into
+a user-facing workflow draft.
+
+Reason:
+
+- low-level events capture what physically happened,
+- but workflow compilation must express what the user meant.
+
+Examples:
+
+- `click` is a physical browser event,
+- `select market cap = largeover` is the user intent,
+- `change` is a DOM event,
+- `search for housing in Xixi Wetland` is the workflow meaning.
+
+That semantic jump is too brittle to encode as a fixed event-to-step mapping.
+
+Therefore:
+
+- raw recording trace remains the source of truth,
+- lightweight normalization may still exist as an internal cleanup layer,
+- but the user-facing workflow draft should be generated by a Compiler Agent,
+  not by deterministic trace rules.
+
+## Product Flow
+
+The intended product flow is now explicitly four stages.
+
+### Stage 1: Recording
+
+The user records the browser workflow manually.
+
+Requirements for this stage:
+
+- capture the key browser events,
+- capture meaningful keyframes,
+- keep the trace reviewable,
+- keep the trace faithful to what actually happened.
+
+This stage is the current implementation focus and should be considered the
+main completed foundation.
+
+Status update:
+
+- mostly complete as a product foundation,
+- still open for continued event-quality improvements where trace semantics are
+  obviously under-specified or noisy.
+
+### Stage 2: User Intent Note
+
+After stopping recording, the user adds a short note that explains the overall
+intent.
+
+Examples:
+
+- "Search Zhihu for posts about AI rent discussions."
+- "Filter Finviz to large-cap stocks and inspect the results."
+- "Collect useful findings and prepare a summary post."
+
+This note is required because trace alone usually does not encode the business
+rule behind the actions.
+
+This stage is part of the intended design, but it is not yet integrated into
+the current product flow.
+
+This is now the immediate next product task.
+
+### Stage 3: Compiler Agent Draft
+
+A dedicated Compiler Agent reads:
+
+- the raw recording trace,
+- the keyframes,
+- the user intent note,
+- and optionally an internal normalized trace.
+
+It then produces a workflow draft.
+
+This draft should describe:
+
+- the intended steps,
+- the reasoning behind the steps,
+- the missing ambiguities,
+- and the questions that must be answered before execution.
+
+Important constraint:
+
+The Compiler Agent is not executing the browser at this stage. It is only
+interpreting and compiling the demonstrated workflow.
+
+This is now the next major implementation milestone after the intent-note step.
+
+### Stage 4: Draft Iteration And Finalization
+
+The Compiler Agent and the user iterate on the workflow draft.
+
+The user should be able to:
+
+- correct the draft,
+- answer clarification questions,
+- refine ambiguous rules,
+- and approve the final version.
+
+The result of this stage is the final workflow that later execution will use.
+
+## What Recording Must Guarantee
+
+The current recording system does not need to solve workflow semantics.
+
+It only needs to be strong on the following:
+
+- accurate event capture,
+- accurate keyframe capture,
+- noise reduction where clearly justified,
+- stable review of the trace after recording.
+
+That means the recording layer is responsible for facts, not interpretation.
+
+Examples of facts:
+
+- which page was open,
+- which element was clicked,
+- what text was entered,
+- what keyframe was visible before or during the interaction,
+- what scroll or form change happened.
+
+Interpretation belongs to the Compiler Agent, not the recorder.
+
+## Role Of Normalization
+
+Normalization may still exist, but only as an internal helper.
+
+Its purpose is to make the trace easier for the Compiler Agent to consume.
+
+Examples:
+
+- remove obvious iframe or ad noise,
+- group closely related low-level events,
+- dedupe focus and click when they represent the same interaction,
+- collapse scroll bursts,
+- keep supporting events attached to a primary interaction.
+
+Normalization should not pretend to be the final workflow.
+
+That is the key design boundary.
+
+## Compiler Agent Responsibilities
+
+The Compiler Agent should:
+
+- inspect the recorded trace,
+- inspect keyframes,
+- read the user intent note,
+- infer the likely workflow,
+- decide where the trace is ambiguous,
+- ask clarification questions,
+- produce a reviewable draft,
+- update that draft after user feedback.
+
+The Compiler Agent should not:
+
+- replay the trace mechanically,
+- assume every click is a workflow step,
+- assume every DOM event directly maps to intent,
+- skip clarification when the trace is ambiguous.
+
+## Draft Shape
+
+The exact final schema is still open, but the draft should eventually include:
+
+- workflow goal,
+- ordered steps,
+- reasoning per step,
+- evidence references back to trace events,
+- clarification questions,
+- approved user answers,
+- final execution-ready version.
+
+The important point is not the exact JSON format yet.
+The important point is that the draft is agent-authored and user-reviewable.
+
+## UI Direction
+
+The current recording UI should focus on trace review only.
+
+That means:
+
+- captured events,
+- keyframes,
+- history of saved recordings,
+- raw event detail.
+
+It should not present the current rule-generated workflow draft as if it were a
+reliable semantic interpretation.
+
+Workflow generation should appear later as a dedicated Compiler Agent step.
+
+## Near-Term Implementation Plan
+
+Next product work should proceed in this order:
+
+1. Keep improving recording quality until trace and keyframes are trustworthy.
+2. Add a post-recording user intent note step.
+3. Implement a Compiler Agent that consumes trace plus intent note.
+4. Build a review loop where the Compiler Agent and the user refine the draft.
+5. Produce the final workflow artifact for later execution.
+
+Recent progress on the compilation layer:
+
+- post-recording intent note is now implemented: the user can add a short text
+  note after stopping a recording, saved to recording session metadata via
+  `POST /recordings/{id}/intent-note`.
+- the Compiler Agent is implemented in `server/core/compiler_agent.py` using
+  the openhands-sdk `Agent` + `Conversation` pattern with three tools:
+  - `trace_viewer` — lets the agent navigate events incrementally (summary,
+    paginated event list, single event detail, keyframe screenshots,
+    normalized steps) instead of receiving the entire trace in one message.
+  - `file` (FileEditorTool) — lets the agent write the Routine file.
+  - `submit_workflow` — validates the Routine file structure and ends the
+    conversation.
+- the compiler agent system prompt teaches OpenBrowser's tool vocabulary
+  (highlight, click, keyboard_input, scroll, etc.) and the Browser Routine
+  format, so the output is an executable Browser Routine.
+- the Routine is pure text (no embedded images). Keyframes are only used by
+  the compiler agent to understand the recorded trace.
+- the compile endpoint is `POST /recordings/{id}/compile`. The previous
+  iteration endpoint has been removed — clarification happens during
+  compilation as part of the agent conversation loop.
+- the frontend has a "Compile Routine" button and displays the resulting
+  Routine markdown.
+
+The next concrete work item is now:
+
+1. test the full end-to-end flow with a real recording,
+2. tune the compiler agent system prompt and tool descriptions based on
+   real-world trace quality,
+3. integrate the approved Routine with the execution layer.
+
+## Summary
+
+OpenBrowser should separate facts from interpretation.
+
+- Recording captures facts.
+- The user provides intent.
+- The Compiler Agent produces the executable Browser Routine.
+- The Routine instructs OpenBrowser what to do step by step.
+
+That is the intended foundation for workflow execution.
diff --git a/extension/eslint.config.mjs b/extension/eslint.config.mjs
@@ -34,6 +34,9 @@ export default [
       'no-case-declarations': 'off',
       'no-empty': 'off',
       'no-useless-escape': 'off',
+      // TypeScript already resolves type references (e.g. `RequestInit`)
+      // that the core `no-undef` rule doesn't understand. Let tsc own this.
+      'no-undef': 'off',
       '@typescript-eslint/no-explicit-any': 'off',
       '@typescript-eslint/no-unused-vars': 'off',
     },

diff --git a/extension/manifest.json b/extension/manifest.json
@@ -27,6 +27,7 @@
   "host_permissions": ["<all_urls>"],
   "permissions": [
     "tabs",
+    "windows",
     "tabGroups",
     "activeTab",
     "scripting",