You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Visual AI assistant for browser automation powered by Qwen3.5-Plus (primary) with Qwen3.5-Flash support as a cost-effective alternative. Provides AI-powered visual understanding and interaction for web automation, data extraction, and interactive workflows. Single-model automation loop: visual perception → decision making → browser interaction → verification.
9
+
Visual AI assistant for browser automation powered by Qwen3.5-Plus (primary) with Qwen3.5-Flash support as a cost-effective alternative. Provides AI-powered visual understanding and interaction for web automation, data extraction, interactive workflows, and a record -> compile -> replay pipeline for reusable Browser Routines. Single-model automation loop: visual perception → decision making → browser interaction → verification.
10
10
11
11
## STRUCTURE
12
12
@@ -19,6 +19,7 @@ OpenBrowser/
19
19
│ └── websocket/ # WebSocket server
20
20
├── extension/ # Chrome extension (MV3) for browser control
@@ -96,6 +107,53 @@ OpenBrowser now uses the browser UUID as a capability token, not just an interna
96
107
- Frontend flow lives in `frontend/index.html`
97
108
- UUID registration and validation live in `server/api/routes/browsers.py`, `server/core/uuid_manager.py`, and `server/websocket/manager.py`
98
109
110
+
## RECORD & REPLAY DESIGN
111
+
112
+
OpenBrowser's record/replay system is deliberately not a raw event replayer. The recording trace is evidence used to understand what the human did, compile a reusable Browser Routine, and debug failures later. Replay runs that compiled Routine as a fresh agent session.
113
+
114
+
### Pipeline
115
+
```
116
+
1. POST /recordings
117
+
2. Extension recorder starts in `dedicated_window` (default) or `current_window`
4. POST /recordings/{id}/events persists rows while the session is ACTIVE
120
+
5. POST /recordings/{id}/stop freezes the trace
121
+
6. GET /recordings/{id}/workflow-draft builds normalized steps / workflow IR
122
+
7. POST /recordings/{id}/compile runs the Compiler Agent over raw events, keyframes, normalized steps, and `intent_note`
123
+
8. Compiler may ask clarification questions, then emits validated Routine markdown
124
+
9. POST /recordings/{id}/compile/finalize saves a named Routine in `routines`
125
+
10. Frontend replay starts a fresh conversation with `mode="routine_replay"` and sends the Routine markdown as the first message
126
+
```
127
+
128
+
### Core Design Rules
129
+
- Replay is **NOT** low-level click/scroll/input playback
130
+
- Raw recording events are a source artifact for review, compilation, and debugging
131
+
-`workflow-draft` is intermediate IR for review/compiler context, not the final replay format
132
+
- The executable replay artifact is the finalized Routine markdown saved in `routines`
133
+
- Saved routines keep a back-reference to `source_recording_id`
134
+
135
+
### Recording Invariants
136
+
- Only one ACTIVE recording may exist per browser UUID
137
+
- Default launch mode is `dedicated_window`; `current_window` is opt-in
138
+
- The recorder owns a recording scope (window/group/tab set) and automatically absorbs new in-scope tabs
139
+
-`recording_started` and `recording_stopped` are ambient lifecycle events; they should not compile into replay steps
140
+
- Once a recording leaves ACTIVE, `/recordings/{id}/events` must reject late async uploads so the reviewed trace stays immutable
141
+
- If the browser websocket is gone at stop time, the server marks the row STOPPED locally with `stop_reason=browser_disconnected` instead of leaving it stranded ACTIVE
142
+
-`page_view` intentionally does **not** carry a keyframe; early lifecycle captures were observed to distort the live Chrome page
143
+
- Keyframes are selective: mainly `click`, `change`, `submit`, and input-like `focus`; some click/enter flows use pre-action captures, and post-capture screenshots are discarded if the capture already drifted to a different URL
144
+
- Input/focus noise is merged before review/compiler consumption so the trace reflects intent instead of every transient keystroke
145
+
146
+
### Replay Invariants
147
+
- Replay always starts a **fresh conversation** with metadata `mode="routine_replay"`
148
+
-`routine_replay_mode` is a server-side flag propagated from session metadata into system prompt rendering; the model never infers replay mode from free-form text
149
+
- The frontend replay entry points are the saved-routine launcher and the `/` slash-menu routine picker in `frontend/index.html`
150
+
- Small-model `highlight_elements(keywords=...)` is only allowed in routine replay, and the token must be copied verbatim from the active Routine step's `**Keywords:**` line
151
+
152
+
### Evaluation Hooks
153
+
-`eval/routine_eval/` has a compile track and a replay track
154
+
- The compile track can ingest fixture traces through `POST /recordings/ingest`, gated by `OPENBROWSER_ENABLE_TEST_ROUTES=1`
155
+
- The replay track executes golden routines in `routine_replay` mode on the mock sites
156
+
99
157
## DIALOG HANDLING
100
158
101
159
When JavaScript triggers a dialog (alert/confirm/prompt), the browser pauses.
Copy file name to clipboardExpand all lines: README.md
+30Lines changed: 30 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,7 @@ It treats browser automation as a visual and interactive systems problem, not ju
11
11
OpenBrowser is built around that view:
12
12
13
13
- Operate pages visually through screenshots and direct browser actions
14
+
- Turn manual browser demonstrations into reusable routines through record -> compile -> replay
14
15
- Keep browser execution isolated from the control window
15
16
- Evaluate continuously on mocked sites and real workflows
16
17
- Treat model cost as a first-class engineering constraint
@@ -68,6 +69,17 @@ OpenBrowser is not iterated by vibe alone. The repo includes mocked websites wit
68
69
69
70
Model capability matters, but so does price. We do not assume token costs stay cheap forever. OpenBrowser is developed with that constraint in mind, including separate handling for stronger and cheaper models.
70
71
72
+
## Record, Compile, Replay
73
+
74
+
OpenBrowser is not only a free-form browser agent. It can also turn a human demonstration into a reusable Browser Routine.
75
+
76
+
1. Record. The frontend calls `/recordings` to start the extension recorder. The recorder scopes itself to a dedicated recording window by default, captures browser events, and attaches selective keyframes for meaningful actions.
77
+
2. Review. After stopping, the UI shows the trace, folded supporting events, and captured keyframes. You can also save a short intent note that explains what the workflow was trying to accomplish.
78
+
3. Compile. `/recordings/{id}/compile` runs a Compiler Agent over the raw trace, normalized high-level steps, and keyframes. If the trace is ambiguous, it asks clarification questions before producing validated Routine markdown.
79
+
4. Replay. Finalizing the compile stores a named Browser Routine under `/routines`. Running that routine starts a fresh conversation in `routine_replay` mode and executes the high-level Routine, not the raw event stream.
80
+
81
+
Important design rule: replay is not literal event playback. The recording trace is evidence used to compile and debug the workflow; the saved Routine is the executable artifact.
82
+
71
83
## Evaluation
72
84
73
85
The primary evaluation signal in this repo is the latest checked-in report:
@@ -101,6 +113,11 @@ Older side-by-side comparisons with OpenClaw are kept only as archived context:
101
113
102
114
Those archived results are still useful for historical tradeoff discussion, but they are not the main metric we optimize against now.
103
115
116
+
For the record/replay pipeline, the repo also includes a dedicated routine evaluation harness under [`eval/routine_eval/`](eval/routine_eval/README.md):
117
+
118
+
- Compile track: does a recording become the right Routine, with good clarification behavior?
119
+
- Replay track: does a saved Routine execute end-to-end in `routine_replay` mode?
120
+
104
121
### Run Your Own Evaluation
105
122
106
123
```bash
@@ -235,6 +252,18 @@ The permission flow is:
235
252
236
253
This means browser control is authorized by possession of the UUID capability token.
237
254
255
+
#### 8. Record and Replay a Workflow
256
+
257
+
Once the frontend and extension are connected:
258
+
259
+
1. Click `Record` -> `Start recording`
260
+
2. Perform the workflow manually in the recording browser window, then stop the recording
261
+
3. Review the captured trace and keyframes, and add an intent note if the goal needs extra context
262
+
4. Click `Compile Routine`, answer any clarification questions, and finalize the result with a name
263
+
5. Run the saved routine from the routine launcher or by typing `/` in the command box to insert it
264
+
265
+
Routine runs always start a fresh conversation in `routine_replay` mode so replay stays separate from free-form chat sessions.
266
+
238
267
---
239
268
240
269
### Try OpenBrowser with SKILL - install to your local agents
@@ -291,6 +320,7 @@ Browser agents are only useful if they remain practical to run. OpenBrowser ther
291
320
292
321
-**Visual AI Automation**: See and interact with web pages using AI-powered visual recognition
293
322
-**Browser Control**: Click, type, scroll, and navigate through visual understanding and JavaScript execution
323
+
-**Record -> Compile -> Replay**: Capture a manual browser workflow, compile it into validated Routine markdown, and rerun it as a reusable task
294
324
-**Tab Management**: Open, close, switch, and manage browser tabs with session isolation
295
325
-**Data Extraction**: Scrape and collect data from websites with AI understanding of page structure
296
326
-**Form Filling & Submission**: Automatically fill forms, submit data, and handle multi-step workflows
Implementation knowledge for the mock sites under `eval/`. Repo-wide conventions live in `../AGENTS.md`. Read **both** before editing sites or test cases.
4
+
5
+
## Source of truth
6
+
7
+
-**`SPEC_NEW_SITES.md`** is the design brief that generated the four post-2026-03 sites: `mapquest/`, `staybnb/`, `taskflow/`, `vidhub/`. When regenerating or extending a site, re-read the relevant section there instead of reverse-engineering from the rendered HTML. The spec pins the **intended behaviours**; the HTML only reflects what actually shipped.
8
+
- Test-case YAMLs in `dataset/` are the second source of truth — a site change that breaks a criterion's expected event shape is a site bug, not a criterion bug, unless the spec disagrees.
9
+
10
+
## Tracker contract
11
+
12
+
- Every site uses the shared `eval/js/tracker.js` — `window.tracker = new AgentTracker('<site>', '<difficulty>')`. Do not create site-local trackers.
13
+
- Emitted event values must match YAML criteria **exactly**, including case. Normalize at the emit site, not in the criterion. Concrete case: StayBnB amenity checkboxes render "Wifi"/"Kitchen" but `staybnb_book.yaml` expects `amenity: "wifi"`. The fix is `.toLowerCase()` in the tracker call, not capitalizing the YAML.
14
+
- When instruction text references a button label, the label must be the literal DOM text. StayBnB's filter apply button reads "Show N stays" (live count), so the instruction and criterion description both say "Show N stays" — never a generic "Apply".
15
+
16
+
## Default-state events — do not auto-credit
17
+
18
+
Tempting anti-pattern: if a criterion expects `route_select` / `transport_mode_select` but the UI **pre-selects** the default option (shortest route, drive mode, etc.), a user who agrees with the default never clicks, so no event fires and the criterion would fail. The tempting "fix" is to emit a synthetic `*_select` on state entry with `defaultSelected: true`.
19
+
20
+
**Don't do this.** The synthetic event also fires when the agent does nothing, so the criterion passes for a no-op run — and in tests like `mapquest_nearby_pins` where "Choose driving mode" is scored, the agent gets credit without ever identifying the icon. It silently dilutes the test signal.
21
+
22
+
**Rules:**
23
+
24
+
1. Criteria must match only on explicit user interaction events. No state-entry auto-emits for `*_select` style events.
25
+
2. If a test asks the agent to click a pre-selected default, it is the test author's job to make the target unambiguous. Either:
26
+
- Change the task to select a **non-default** option (e.g., `mode: "walk"` instead of `"drive"`), or
27
+
- Pin the criterion to specific field values (e.g., `routeIndex: 0`) so explicit clicks still match, and accept that re-clicking the default is part of the task.
28
+
3. When naming the criterion "Select the shortest route", pin `routeIndex: 0` in the YAML so the scorer distinguishes shortest from non-shortest.
29
+
30
+
History: `mapquest.js` briefly emitted both events as state-entry defaults; the pattern was removed after a review showed `mapquest_nearby_pins` was auto-crediting `transport_mode_select: drive` on directions-panel entry.
31
+
32
+
## Panel-state machines vs page routes
33
+
34
+
MapQuest and StayBnB both have panels whose views swap in place rather than navigating. Keep state transitions **stateful** (class toggles on panel containers, `switchPanelState(...)`), not URL-based — the spec intentionally tests panel-state navigation.
35
+
36
+
Consequence: if two tests need the same control in different panel states, duplicate the DOM rather than routing between states. Concrete case: `mapquest_navigate` wants the Directions flow inside `place-detail`, but `mapquest_nearby_pins` wants the category chip bar visible from inside `place-detail` too. The chip bar is duplicated inside the place-detail state; active-class sync is done across both copies via `document.querySelectorAll('.chip[data-category="..."]')`.
37
+
38
+
## Deep-link entry points
39
+
40
+
Some tests start pre-loaded into a non-home view so the agent doesn't have to traverse navigation it isn't being scored on. StayBnB supports `#results` in `staybnb/js/staybnb.js#init()` to jump straight into results with Tokyo listings rendered. The test YAML's `start_url` ends with `/staybnb/#results`. Add a similar hash handler whenever a new test needs a non-home starting view.
41
+
42
+
## Stacking contexts — popovers and headers
43
+
44
+
Popovers anchored inside a header will be clipped to the header's effective z-order because the header's own `z-index` creates a stacking context. If a full-screen dismiss backdrop sits above that header z-index, clicks on the popover input land on the backdrop instead and close the popover immediately.
45
+
46
+
Concrete case: StayBnB's search pill popovers were unclickable until `.topbar` was raised from `z-index: 100` to `300`, above the `.popover-backdrop` at `150`. When adding any overlay that uses a backdrop pattern, verify the anchoring element's stacking context dominates the backdrop.
47
+
48
+
## Real images
49
+
50
+
Agents occasionally need real visual content (thumbnails, destination cards, gallery photos). Use **Lorem Picsum seeded URLs** so images are deterministic across runs:
51
+
52
+
```
53
+
https://picsum.photos/seed/<stable-seed>/<W>/<H>
54
+
```
55
+
56
+
Seeds used so far live in `staybnb/index.html` (home cards) and `staybnb/js/staybnb.js` (results + detail + gallery — seeds of form `staybnb-<listingId>-<k>`). Prefer `background-image: url(...)` with `background-size: cover` so the layout tolerates any aspect ratio.
57
+
58
+
## Drag-and-drop primitives
59
+
60
+
-**StayBnB price slider** is dual-handle. Each handle drag emits `price_slider_change` with `handle: "min"` or `handle: "max"`. The two criteria are independent, so the handles must be draggable independently — don't couple them.
61
+
-**TaskFlow cards** use HTML5 drag-and-drop, not click-to-move. Criteria expect `card_drop` with source and destination column IDs.
62
+
63
+
## Manual-test harness workflow
64
+
65
+
When validating a new or changed test via `evaluate_browser_agent.py --manual`:
66
+
67
+
1.`mkfifo /tmp/eval_in` and hold the write end open with a background `sleep` so the harness's `stdin` stays open across multiple tests.
68
+
2. The harness clears events at instruction display time — timing starts then, not at page load.
69
+
3. Sending `ok\n` to the FIFO completes a test and prints the score breakdown. A criterion that reports 0 despite visibly correct behaviour is almost always an event-shape mismatch (case, missing field, or default-state pattern above).
70
+
71
+
## Scope discipline
72
+
73
+
Keep site code minimal and aligned with the spec's stated challenges. Do **not** add extra animations, fallbacks, or abstractions beyond what a criterion exercises. The sites exist to probe specific agent weaknesses; incidental complexity dilutes the signal and creates spurious event noise.
0 commit comments