softpudding
diff --git a/‎AGENTS.md‎
Lines changed: 62 additions & 4 deletions b/‎AGENTS.md‎
Lines changed: 62 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 30 additions & 0 deletions b/‎README.md‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎eval/AGENTS.md‎
Lines changed: 73 additions & 0 deletions b/‎eval/AGENTS.md‎
Lines changed: 73 additions & 0 deletions
@@ -1,12 +1,12 @@
 # OpenBrowser Project Knowledge Base
 
-**Generated:** 2026-03-16
-**Commit:** 8836b0b (main)
+**Generated:** 2026-04-10
+**Commit:** 25b3a2e (main)
 **Stack:** Python 3.12+ (FastAPI) + TypeScript (Chrome Extension MV3)
 
 ## OVERVIEW
 
-Visual AI assistant for browser automation powered by Qwen3.5-Plus (primary) with Qwen3.5-Flash support as a cost-effective alternative. Provides AI-powered visual understanding and interaction for web automation, data extraction, and interactive workflows. Single-model automation loop: visual perception → decision making → browser interaction → verification.
+Visual AI assistant for browser automation powered by Qwen3.5-Plus (primary) with Qwen3.5-Flash support as a cost-effective alternative. Provides AI-powered visual understanding and interaction for web automation, data extraction, interactive workflows, and a record -> compile -> replay pipeline for reusable Browser Routines. Single-model automation loop: visual perception → decision making → browser interaction → verification.
 
 ## STRUCTURE
 
@@ -19,6 +19,7 @@ OpenBrowser/
 │   └── websocket/    # WebSocket server
 ├── extension/        # Chrome extension (MV3) for browser control
 ├── frontend/         # Static web UI (HTML)
+├── eval/             # Mock sites + routine compile/replay evaluation
 └── reference/        # External SDK references (read-only)
 ```
 
@@ -30,23 +31,33 @@ OpenBrowser/
 | Browser commands | `server/core/processor.py` | Command routing, multi-session |
 | Dialog handling | `server/models/commands.py` | HandleDialogCommand, DialogAction |
 | REST API routes | `server/api/routes/` | FastAPI endpoints |
+| Recording routes | `server/api/routes/recordings.py` | Recording lifecycle, workflow draft, compiler, finalize |
+| Routine routes | `server/api/routes/routines.py` | Saved Browser Routine CRUD for replay |
 | Browser UUID routing | `server/api/routes/browsers.py` | Browser UUID registration and validation |
 | WebSocket handling | `server/websocket/manager.py` | Extension communication |
 | Browser UUID registry | `server/core/uuid_manager.py` | `uuid -> websocket` capability mapping |
 | Command models | `server/models/commands.py` | Pydantic command/response types |
+| Recording persistence | `server/core/recording_manager.py` | SQLite recording sessions/events, immutability boundaries |
+| Workflow draft compiler | `server/core/workflow_compiler.py` | Normalize raw recording traces into high-level draft steps/IR |
+| Compiler Agent | `server/core/compiler_agent.py` | TraceViewer, clarify-with-user loop, Routine validation |
+| Routine persistence | `server/core/routine_manager.py` | Saved routines linked back to source recordings |
 | **Prompt templates** | `server/agent/prompts/` | **Jinja2 templates for agent prompts** |
 | Tab tool | `server/agent/tools/tab_tool.py` | TabTool for tab management |
 | Highlight tool | `server/agent/tools/highlight_tool.py` | HighlightTool for element discovery |
 | Element interaction | `server/agent/tools/element_interaction_tool.py` | ElementInteractionTool with 2PC flow |
 | Dialog tool | `server/agent/tools/dialog_tool.py` | DialogTool for dialog handling |
 | ToolSet aggregator | `server/agent/tools/toolset.py` | OpenBrowserToolSet aggregates all 4 tools |
 | Extension entry | `extension/src/background/index.ts` | Command handler, dialog processing |
+| Extension recorder | `extension/src/recording/recorder.ts` | Recording scope, event capture, keyframe upload |
+| Recording keyframe policy | `extension/src/recording/keyframe-policy.ts` | Which events get screenshots and when drift is discarded |
 | Dialog manager | `extension/src/commands/dialog.ts` | CDP dialog events, cascading |
 | JavaScript execution | `extension/src/commands/javascript.ts` | CDP Runtime.evaluate, dialog race |
 | Screenshot capture | `extension/src/commands/screenshot.ts` | CDP Page.captureScreenshot |
 | Tab management | `extension/src/commands/tab-manager.ts` | Session isolation, tab groups |
 | UUID page | `extension/src/uuid/uuidPage.ts` | Browser UUID display and registration status |
-| Frontend chat UI | `frontend/index.html` | Browser UUID input, conversation UI, Sisyphus |
+| Frontend recording/replay UI | `frontend/index.html` | Browser UUID input, recording panel, compile flow, saved routines, slash-menu replay |
+| Routine evaluation | `eval/routine_eval/` | Compile-track + replay-track eval harness for record/replay |
+
 ## ARCHITECTURE
 
 ```
@@ -96,6 +107,53 @@ OpenBrowser now uses the browser UUID as a capability token, not just an interna
 - Frontend flow lives in `frontend/index.html`
 - UUID registration and validation live in `server/api/routes/browsers.py`, `server/core/uuid_manager.py`, and `server/websocket/manager.py`
 
+## RECORD & REPLAY DESIGN
+
+OpenBrowser's record/replay system is deliberately not a raw event replayer. The recording trace is evidence used to understand what the human did, compile a reusable Browser Routine, and debug failures later. Replay runs that compiled Routine as a fresh agent session.
+
+### Pipeline
+```
+1. POST /recordings
+2. Extension recorder starts in `dedicated_window` (default) or `current_window`
+3. Recorder captures scoped browser events + selective keyframes
+4. POST /recordings/{id}/events persists rows while the session is ACTIVE
+5. POST /recordings/{id}/stop freezes the trace
+6. GET /recordings/{id}/workflow-draft builds normalized steps / workflow IR
+7. POST /recordings/{id}/compile runs the Compiler Agent over raw events, keyframes, normalized steps, and `intent_note`
+8. Compiler may ask clarification questions, then emits validated Routine markdown
+9. POST /recordings/{id}/compile/finalize saves a named Routine in `routines`
+10. Frontend replay starts a fresh conversation with `mode="routine_replay"` and sends the Routine markdown as the first message
+```
+
+### Core Design Rules
+- Replay is **NOT** low-level click/scroll/input playback
+- Raw recording events are a source artifact for review, compilation, and debugging
+- `workflow-draft` is intermediate IR for review/compiler context, not the final replay format
+- The executable replay artifact is the finalized Routine markdown saved in `routines`
+- Saved routines keep a back-reference to `source_recording_id`
+
+### Recording Invariants
+- Only one ACTIVE recording may exist per browser UUID
+- Default launch mode is `dedicated_window`; `current_window` is opt-in
+- The recorder owns a recording scope (window/group/tab set) and automatically absorbs new in-scope tabs
+- `recording_started` and `recording_stopped` are ambient lifecycle events; they should not compile into replay steps
+- Once a recording leaves ACTIVE, `/recordings/{id}/events` must reject late async uploads so the reviewed trace stays immutable
+- If the browser websocket is gone at stop time, the server marks the row STOPPED locally with `stop_reason=browser_disconnected` instead of leaving it stranded ACTIVE
+- `page_view` intentionally does **not** carry a keyframe; early lifecycle captures were observed to distort the live Chrome page
+- Keyframes are selective: mainly `click`, `change`, `submit`, and input-like `focus`; some click/enter flows use pre-action captures, and post-capture screenshots are discarded if the capture already drifted to a different URL
+- Input/focus noise is merged before review/compiler consumption so the trace reflects intent instead of every transient keystroke
+
+### Replay Invariants
+- Replay always starts a **fresh conversation** with metadata `mode="routine_replay"`
+- `routine_replay_mode` is a server-side flag propagated from session metadata into system prompt rendering; the model never infers replay mode from free-form text
+- The frontend replay entry points are the saved-routine launcher and the `/` slash-menu routine picker in `frontend/index.html`
+- Small-model `highlight_elements(keywords=...)` is only allowed in routine replay, and the token must be copied verbatim from the active Routine step's `**Keywords:**` line
+
+### Evaluation Hooks
+- `eval/routine_eval/` has a compile track and a replay track
+- The compile track can ingest fixture traces through `POST /recordings/ingest`, gated by `OPENBROWSER_ENABLE_TEST_ROUTES=1`
+- The replay track executes golden routines in `routine_replay` mode on the mock sites
+
 ## DIALOG HANDLING
 
 When JavaScript triggers a dialog (alert/confirm/prompt), the browser pauses.
 
@@ -11,6 +11,7 @@ It treats browser automation as a visual and interactive systems problem, not ju
 OpenBrowser is built around that view:
 
 - Operate pages visually through screenshots and direct browser actions
+- Turn manual browser demonstrations into reusable routines through record -> compile -> replay
 - Keep browser execution isolated from the control window
 - Evaluate continuously on mocked sites and real workflows
 - Treat model cost as a first-class engineering constraint
@@ -68,6 +69,17 @@ OpenBrowser is not iterated by vibe alone. The repo includes mocked websites wit
 
 Model capability matters, but so does price. We do not assume token costs stay cheap forever. OpenBrowser is developed with that constraint in mind, including separate handling for stronger and cheaper models.
 
+## Record, Compile, Replay
+
+OpenBrowser is not only a free-form browser agent. It can also turn a human demonstration into a reusable Browser Routine.
+
+1. Record. The frontend calls `/recordings` to start the extension recorder. The recorder scopes itself to a dedicated recording window by default, captures browser events, and attaches selective keyframes for meaningful actions.
+2. Review. After stopping, the UI shows the trace, folded supporting events, and captured keyframes. You can also save a short intent note that explains what the workflow was trying to accomplish.
+3. Compile. `/recordings/{id}/compile` runs a Compiler Agent over the raw trace, normalized high-level steps, and keyframes. If the trace is ambiguous, it asks clarification questions before producing validated Routine markdown.
+4. Replay. Finalizing the compile stores a named Browser Routine under `/routines`. Running that routine starts a fresh conversation in `routine_replay` mode and executes the high-level Routine, not the raw event stream.
+
+Important design rule: replay is not literal event playback. The recording trace is evidence used to compile and debug the workflow; the saved Routine is the executable artifact.
+
 ## Evaluation
 
 The primary evaluation signal in this repo is the latest checked-in report:
@@ -101,6 +113,11 @@ Older side-by-side comparisons with OpenClaw are kept only as archived context:
 
 Those archived results are still useful for historical tradeoff discussion, but they are not the main metric we optimize against now.
 
+For the record/replay pipeline, the repo also includes a dedicated routine evaluation harness under [`eval/routine_eval/`](eval/routine_eval/README.md):
+
+- Compile track: does a recording become the right Routine, with good clarification behavior?
+- Replay track: does a saved Routine execute end-to-end in `routine_replay` mode?
+
 ### Run Your Own Evaluation
 
 ```bash
@@ -235,6 +252,18 @@ The permission flow is:
 
 This means browser control is authorized by possession of the UUID capability token.
 
+#### 8. Record and Replay a Workflow
+
+Once the frontend and extension are connected:
+
+1. Click `Record` -> `Start recording`
+2. Perform the workflow manually in the recording browser window, then stop the recording
+3. Review the captured trace and keyframes, and add an intent note if the goal needs extra context
+4. Click `Compile Routine`, answer any clarification questions, and finalize the result with a name
+5. Run the saved routine from the routine launcher or by typing `/` in the command box to insert it
+
+Routine runs always start a fresh conversation in `routine_replay` mode so replay stays separate from free-form chat sessions.
+
 ---
 
 ### Try OpenBrowser with SKILL - install to your local agents
@@ -291,6 +320,7 @@ Browser agents are only useful if they remain practical to run. OpenBrowser ther
 
 - **Visual AI Automation**: See and interact with web pages using AI-powered visual recognition
 - **Browser Control**: Click, type, scroll, and navigate through visual understanding and JavaScript execution
+- **Record -> Compile -> Replay**: Capture a manual browser workflow, compile it into validated Routine markdown, and rerun it as a reusable task
 - **Tab Management**: Open, close, switch, and manage browser tabs with session isolation
 - **Data Extraction**: Scrape and collect data from websites with AI understanding of page structure
 - **Form Filling & Submission**: Automatically fill forms, submit data, and handle multi-step workflows
 
@@ -0,0 +1,73 @@
+# eval/ — Agent Notes
+
+Implementation knowledge for the mock sites under `eval/`. Repo-wide conventions live in `../AGENTS.md`. Read **both** before editing sites or test cases.
+
+## Source of truth
+
+- **`SPEC_NEW_SITES.md`** is the design brief that generated the four post-2026-03 sites: `mapquest/`, `staybnb/`, `taskflow/`, `vidhub/`. When regenerating or extending a site, re-read the relevant section there instead of reverse-engineering from the rendered HTML. The spec pins the **intended behaviours**; the HTML only reflects what actually shipped.
+- Test-case YAMLs in `dataset/` are the second source of truth — a site change that breaks a criterion's expected event shape is a site bug, not a criterion bug, unless the spec disagrees.
+
+## Tracker contract
+
+- Every site uses the shared `eval/js/tracker.js` — `window.tracker = new AgentTracker('<site>', '<difficulty>')`. Do not create site-local trackers.
+- Emitted event values must match YAML criteria **exactly**, including case. Normalize at the emit site, not in the criterion. Concrete case: StayBnB amenity checkboxes render "Wifi"/"Kitchen" but `staybnb_book.yaml` expects `amenity: "wifi"`. The fix is `.toLowerCase()` in the tracker call, not capitalizing the YAML.
+- When instruction text references a button label, the label must be the literal DOM text. StayBnB's filter apply button reads "Show N stays" (live count), so the instruction and criterion description both say "Show N stays" — never a generic "Apply".
+
+## Default-state events — do not auto-credit
+
+Tempting anti-pattern: if a criterion expects `route_select` / `transport_mode_select` but the UI **pre-selects** the default option (shortest route, drive mode, etc.), a user who agrees with the default never clicks, so no event fires and the criterion would fail. The tempting "fix" is to emit a synthetic `*_select` on state entry with `defaultSelected: true`.
+
+**Don't do this.** The synthetic event also fires when the agent does nothing, so the criterion passes for a no-op run — and in tests like `mapquest_nearby_pins` where "Choose driving mode" is scored, the agent gets credit without ever identifying the icon. It silently dilutes the test signal.
+
+**Rules:**
+
+1. Criteria must match only on explicit user interaction events. No state-entry auto-emits for `*_select` style events.
+2. If a test asks the agent to click a pre-selected default, it is the test author's job to make the target unambiguous. Either:
+   - Change the task to select a **non-default** option (e.g., `mode: "walk"` instead of `"drive"`), or
+   - Pin the criterion to specific field values (e.g., `routeIndex: 0`) so explicit clicks still match, and accept that re-clicking the default is part of the task.
+3. When naming the criterion "Select the shortest route", pin `routeIndex: 0` in the YAML so the scorer distinguishes shortest from non-shortest.
+
+History: `mapquest.js` briefly emitted both events as state-entry defaults; the pattern was removed after a review showed `mapquest_nearby_pins` was auto-crediting `transport_mode_select: drive` on directions-panel entry.
+
+## Panel-state machines vs page routes
+
+MapQuest and StayBnB both have panels whose views swap in place rather than navigating. Keep state transitions **stateful** (class toggles on panel containers, `switchPanelState(...)`), not URL-based — the spec intentionally tests panel-state navigation.
+
+Consequence: if two tests need the same control in different panel states, duplicate the DOM rather than routing between states. Concrete case: `mapquest_navigate` wants the Directions flow inside `place-detail`, but `mapquest_nearby_pins` wants the category chip bar visible from inside `place-detail` too. The chip bar is duplicated inside the place-detail state; active-class sync is done across both copies via `document.querySelectorAll('.chip[data-category="..."]')`.
+
+## Deep-link entry points
+
+Some tests start pre-loaded into a non-home view so the agent doesn't have to traverse navigation it isn't being scored on. StayBnB supports `#results` in `staybnb/js/staybnb.js#init()` to jump straight into results with Tokyo listings rendered. The test YAML's `start_url` ends with `/staybnb/#results`. Add a similar hash handler whenever a new test needs a non-home starting view.
+
+## Stacking contexts — popovers and headers
+
+Popovers anchored inside a header will be clipped to the header's effective z-order because the header's own `z-index` creates a stacking context. If a full-screen dismiss backdrop sits above that header z-index, clicks on the popover input land on the backdrop instead and close the popover immediately.
+
+Concrete case: StayBnB's search pill popovers were unclickable until `.topbar` was raised from `z-index: 100` to `300`, above the `.popover-backdrop` at `150`. When adding any overlay that uses a backdrop pattern, verify the anchoring element's stacking context dominates the backdrop.
+
+## Real images
+
+Agents occasionally need real visual content (thumbnails, destination cards, gallery photos). Use **Lorem Picsum seeded URLs** so images are deterministic across runs:
+
+```
+https://picsum.photos/seed/<stable-seed>/<W>/<H>
+```
+
+Seeds used so far live in `staybnb/index.html` (home cards) and `staybnb/js/staybnb.js` (results + detail + gallery — seeds of form `staybnb-<listingId>-<k>`). Prefer `background-image: url(...)` with `background-size: cover` so the layout tolerates any aspect ratio.
+
+## Drag-and-drop primitives
+
+- **StayBnB price slider** is dual-handle. Each handle drag emits `price_slider_change` with `handle: "min"` or `handle: "max"`. The two criteria are independent, so the handles must be draggable independently — don't couple them.
+- **TaskFlow cards** use HTML5 drag-and-drop, not click-to-move. Criteria expect `card_drop` with source and destination column IDs.
+
+## Manual-test harness workflow
+
+When validating a new or changed test via `evaluate_browser_agent.py --manual`:
+
+1. `mkfifo /tmp/eval_in` and hold the write end open with a background `sleep` so the harness's `stdin` stays open across multiple tests.
+2. The harness clears events at instruction display time — timing starts then, not at page load.
+3. Sending `ok\n` to the FIFO completes a test and prints the score breakdown. A criterion that reports 0 despite visibly correct behaviour is almost always an event-shape mismatch (case, missing field, or default-state pattern above).
+
+## Scope discipline
+
+Keep site code minimal and aligned with the spec's stated challenges. Do **not** add extra animations, fallbacks, or abstractions beyond what a criterion exercises. The sites exist to probe specific agent weaknesses; incidental complexity dilutes the signal and creates spurious event noise.