Skip to content

Commit 3d251b5

Browse files
authored
Merge pull request #56 from softpudding/eval/full-20260411-benchmark
Eval: 10 mock sites + routine-compile eval + full-eval benchmark report
2 parents 6f02ded + 13f1069 commit 3d251b5

File tree

100 files changed

+36445
-4167
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

100 files changed

+36445
-4167
lines changed

AGENTS.md

Lines changed: 62 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# OpenBrowser Project Knowledge Base
22

3-
**Generated:** 2026-03-16
4-
**Commit:** 8836b0b (main)
3+
**Generated:** 2026-04-10
4+
**Commit:** 25b3a2e (main)
55
**Stack:** Python 3.12+ (FastAPI) + TypeScript (Chrome Extension MV3)
66

77
## OVERVIEW
88

9-
Visual AI assistant for browser automation powered by Qwen3.5-Plus (primary) with Qwen3.5-Flash support as a cost-effective alternative. Provides AI-powered visual understanding and interaction for web automation, data extraction, and interactive workflows. Single-model automation loop: visual perception → decision making → browser interaction → verification.
9+
Visual AI assistant for browser automation powered by Qwen3.5-Plus (primary) with Qwen3.5-Flash support as a cost-effective alternative. Provides AI-powered visual understanding and interaction for web automation, data extraction, interactive workflows, and a record -> compile -> replay pipeline for reusable Browser Routines. Single-model automation loop: visual perception → decision making → browser interaction → verification.
1010

1111
## STRUCTURE
1212

@@ -19,6 +19,7 @@ OpenBrowser/
1919
│ └── websocket/ # WebSocket server
2020
├── extension/ # Chrome extension (MV3) for browser control
2121
├── frontend/ # Static web UI (HTML)
22+
├── eval/ # Mock sites + routine compile/replay evaluation
2223
└── reference/ # External SDK references (read-only)
2324
```
2425

@@ -30,23 +31,33 @@ OpenBrowser/
3031
| Browser commands | `server/core/processor.py` | Command routing, multi-session |
3132
| Dialog handling | `server/models/commands.py` | HandleDialogCommand, DialogAction |
3233
| REST API routes | `server/api/routes/` | FastAPI endpoints |
34+
| Recording routes | `server/api/routes/recordings.py` | Recording lifecycle, workflow draft, compiler, finalize |
35+
| Routine routes | `server/api/routes/routines.py` | Saved Browser Routine CRUD for replay |
3336
| Browser UUID routing | `server/api/routes/browsers.py` | Browser UUID registration and validation |
3437
| WebSocket handling | `server/websocket/manager.py` | Extension communication |
3538
| Browser UUID registry | `server/core/uuid_manager.py` | `uuid -> websocket` capability mapping |
3639
| Command models | `server/models/commands.py` | Pydantic command/response types |
40+
| Recording persistence | `server/core/recording_manager.py` | SQLite recording sessions/events, immutability boundaries |
41+
| Workflow draft compiler | `server/core/workflow_compiler.py` | Normalize raw recording traces into high-level draft steps/IR |
42+
| Compiler Agent | `server/core/compiler_agent.py` | TraceViewer, clarify-with-user loop, Routine validation |
43+
| Routine persistence | `server/core/routine_manager.py` | Saved routines linked back to source recordings |
3744
| **Prompt templates** | `server/agent/prompts/` | **Jinja2 templates for agent prompts** |
3845
| Tab tool | `server/agent/tools/tab_tool.py` | TabTool for tab management |
3946
| Highlight tool | `server/agent/tools/highlight_tool.py` | HighlightTool for element discovery |
4047
| Element interaction | `server/agent/tools/element_interaction_tool.py` | ElementInteractionTool with 2PC flow |
4148
| Dialog tool | `server/agent/tools/dialog_tool.py` | DialogTool for dialog handling |
4249
| ToolSet aggregator | `server/agent/tools/toolset.py` | OpenBrowserToolSet aggregates all 4 tools |
4350
| Extension entry | `extension/src/background/index.ts` | Command handler, dialog processing |
51+
| Extension recorder | `extension/src/recording/recorder.ts` | Recording scope, event capture, keyframe upload |
52+
| Recording keyframe policy | `extension/src/recording/keyframe-policy.ts` | Which events get screenshots and when drift is discarded |
4453
| Dialog manager | `extension/src/commands/dialog.ts` | CDP dialog events, cascading |
4554
| JavaScript execution | `extension/src/commands/javascript.ts` | CDP Runtime.evaluate, dialog race |
4655
| Screenshot capture | `extension/src/commands/screenshot.ts` | CDP Page.captureScreenshot |
4756
| Tab management | `extension/src/commands/tab-manager.ts` | Session isolation, tab groups |
4857
| UUID page | `extension/src/uuid/uuidPage.ts` | Browser UUID display and registration status |
49-
| Frontend chat UI | `frontend/index.html` | Browser UUID input, conversation UI, Sisyphus |
58+
| Frontend recording/replay UI | `frontend/index.html` | Browser UUID input, recording panel, compile flow, saved routines, slash-menu replay |
59+
| Routine evaluation | `eval/routine_eval/` | Compile-track + replay-track eval harness for record/replay |
60+
5061
## ARCHITECTURE
5162

5263
```
@@ -96,6 +107,53 @@ OpenBrowser now uses the browser UUID as a capability token, not just an interna
96107
- Frontend flow lives in `frontend/index.html`
97108
- UUID registration and validation live in `server/api/routes/browsers.py`, `server/core/uuid_manager.py`, and `server/websocket/manager.py`
98109

110+
## RECORD & REPLAY DESIGN
111+
112+
OpenBrowser's record/replay system is deliberately not a raw event replayer. The recording trace is evidence used to understand what the human did, compile a reusable Browser Routine, and debug failures later. Replay runs that compiled Routine as a fresh agent session.
113+
114+
### Pipeline
115+
```
116+
1. POST /recordings
117+
2. Extension recorder starts in `dedicated_window` (default) or `current_window`
118+
3. Recorder captures scoped browser events + selective keyframes
119+
4. POST /recordings/{id}/events persists rows while the session is ACTIVE
120+
5. POST /recordings/{id}/stop freezes the trace
121+
6. GET /recordings/{id}/workflow-draft builds normalized steps / workflow IR
122+
7. POST /recordings/{id}/compile runs the Compiler Agent over raw events, keyframes, normalized steps, and `intent_note`
123+
8. Compiler may ask clarification questions, then emits validated Routine markdown
124+
9. POST /recordings/{id}/compile/finalize saves a named Routine in `routines`
125+
10. Frontend replay starts a fresh conversation with `mode="routine_replay"` and sends the Routine markdown as the first message
126+
```
127+
128+
### Core Design Rules
129+
- Replay is **NOT** low-level click/scroll/input playback
130+
- Raw recording events are a source artifact for review, compilation, and debugging
131+
- `workflow-draft` is intermediate IR for review/compiler context, not the final replay format
132+
- The executable replay artifact is the finalized Routine markdown saved in `routines`
133+
- Saved routines keep a back-reference to `source_recording_id`
134+
135+
### Recording Invariants
136+
- Only one ACTIVE recording may exist per browser UUID
137+
- Default launch mode is `dedicated_window`; `current_window` is opt-in
138+
- The recorder owns a recording scope (window/group/tab set) and automatically absorbs new in-scope tabs
139+
- `recording_started` and `recording_stopped` are ambient lifecycle events; they should not compile into replay steps
140+
- Once a recording leaves ACTIVE, `/recordings/{id}/events` must reject late async uploads so the reviewed trace stays immutable
141+
- If the browser websocket is gone at stop time, the server marks the row STOPPED locally with `stop_reason=browser_disconnected` instead of leaving it stranded ACTIVE
142+
- `page_view` intentionally does **not** carry a keyframe; early lifecycle captures were observed to distort the live Chrome page
143+
- Keyframes are selective: mainly `click`, `change`, `submit`, and input-like `focus`; some click/enter flows use pre-action captures, and post-capture screenshots are discarded if the capture already drifted to a different URL
144+
- Input/focus noise is merged before review/compiler consumption so the trace reflects intent instead of every transient keystroke
145+
146+
### Replay Invariants
147+
- Replay always starts a **fresh conversation** with metadata `mode="routine_replay"`
148+
- `routine_replay_mode` is a server-side flag propagated from session metadata into system prompt rendering; the model never infers replay mode from free-form text
149+
- The frontend replay entry points are the saved-routine launcher and the `/` slash-menu routine picker in `frontend/index.html`
150+
- Small-model `highlight_elements(keywords=...)` is only allowed in routine replay, and the token must be copied verbatim from the active Routine step's `**Keywords:**` line
151+
152+
### Evaluation Hooks
153+
- `eval/routine_eval/` has a compile track and a replay track
154+
- The compile track can ingest fixture traces through `POST /recordings/ingest`, gated by `OPENBROWSER_ENABLE_TEST_ROUTES=1`
155+
- The replay track executes golden routines in `routine_replay` mode on the mock sites
156+
99157
## DIALOG HANDLING
100158

101159
When JavaScript triggers a dialog (alert/confirm/prompt), the browser pauses.

README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ It treats browser automation as a visual and interactive systems problem, not ju
1111
OpenBrowser is built around that view:
1212

1313
- Operate pages visually through screenshots and direct browser actions
14+
- Turn manual browser demonstrations into reusable routines through record -> compile -> replay
1415
- Keep browser execution isolated from the control window
1516
- Evaluate continuously on mocked sites and real workflows
1617
- Treat model cost as a first-class engineering constraint
@@ -68,6 +69,17 @@ OpenBrowser is not iterated by vibe alone. The repo includes mocked websites wit
6869

6970
Model capability matters, but so does price. We do not assume token costs stay cheap forever. OpenBrowser is developed with that constraint in mind, including separate handling for stronger and cheaper models.
7071

72+
## Record, Compile, Replay
73+
74+
OpenBrowser is not only a free-form browser agent. It can also turn a human demonstration into a reusable Browser Routine.
75+
76+
1. Record. The frontend calls `/recordings` to start the extension recorder. The recorder scopes itself to a dedicated recording window by default, captures browser events, and attaches selective keyframes for meaningful actions.
77+
2. Review. After stopping, the UI shows the trace, folded supporting events, and captured keyframes. You can also save a short intent note that explains what the workflow was trying to accomplish.
78+
3. Compile. `/recordings/{id}/compile` runs a Compiler Agent over the raw trace, normalized high-level steps, and keyframes. If the trace is ambiguous, it asks clarification questions before producing validated Routine markdown.
79+
4. Replay. Finalizing the compile stores a named Browser Routine under `/routines`. Running that routine starts a fresh conversation in `routine_replay` mode and executes the high-level Routine, not the raw event stream.
80+
81+
Important design rule: replay is not literal event playback. The recording trace is evidence used to compile and debug the workflow; the saved Routine is the executable artifact.
82+
7183
## Evaluation
7284

7385
The primary evaluation signal in this repo is the latest checked-in report:
@@ -101,6 +113,11 @@ Older side-by-side comparisons with OpenClaw are kept only as archived context:
101113

102114
Those archived results are still useful for historical tradeoff discussion, but they are not the main metric we optimize against now.
103115

116+
For the record/replay pipeline, the repo also includes a dedicated routine evaluation harness under [`eval/routine_eval/`](eval/routine_eval/README.md):
117+
118+
- Compile track: does a recording become the right Routine, with good clarification behavior?
119+
- Replay track: does a saved Routine execute end-to-end in `routine_replay` mode?
120+
104121
### Run Your Own Evaluation
105122

106123
```bash
@@ -235,6 +252,18 @@ The permission flow is:
235252

236253
This means browser control is authorized by possession of the UUID capability token.
237254

255+
#### 8. Record and Replay a Workflow
256+
257+
Once the frontend and extension are connected:
258+
259+
1. Click `Record` -> `Start recording`
260+
2. Perform the workflow manually in the recording browser window, then stop the recording
261+
3. Review the captured trace and keyframes, and add an intent note if the goal needs extra context
262+
4. Click `Compile Routine`, answer any clarification questions, and finalize the result with a name
263+
5. Run the saved routine from the routine launcher or by typing `/` in the command box to insert it
264+
265+
Routine runs always start a fresh conversation in `routine_replay` mode so replay stays separate from free-form chat sessions.
266+
238267
---
239268

240269
### Try OpenBrowser with SKILL - install to your local agents
@@ -291,6 +320,7 @@ Browser agents are only useful if they remain practical to run. OpenBrowser ther
291320

292321
- **Visual AI Automation**: See and interact with web pages using AI-powered visual recognition
293322
- **Browser Control**: Click, type, scroll, and navigate through visual understanding and JavaScript execution
323+
- **Record -> Compile -> Replay**: Capture a manual browser workflow, compile it into validated Routine markdown, and rerun it as a reusable task
294324
- **Tab Management**: Open, close, switch, and manage browser tabs with session isolation
295325
- **Data Extraction**: Scrape and collect data from websites with AI understanding of page structure
296326
- **Form Filling & Submission**: Automatically fill forms, submit data, and handle multi-step workflows

eval/AGENTS.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# eval/ — Agent Notes
2+
3+
Implementation knowledge for the mock sites under `eval/`. Repo-wide conventions live in `../AGENTS.md`. Read **both** before editing sites or test cases.
4+
5+
## Source of truth
6+
7+
- **`SPEC_NEW_SITES.md`** is the design brief that generated the four post-2026-03 sites: `mapquest/`, `staybnb/`, `taskflow/`, `vidhub/`. When regenerating or extending a site, re-read the relevant section there instead of reverse-engineering from the rendered HTML. The spec pins the **intended behaviours**; the HTML only reflects what actually shipped.
8+
- Test-case YAMLs in `dataset/` are the second source of truth — a site change that breaks a criterion's expected event shape is a site bug, not a criterion bug, unless the spec disagrees.
9+
10+
## Tracker contract
11+
12+
- Every site uses the shared `eval/js/tracker.js``window.tracker = new AgentTracker('<site>', '<difficulty>')`. Do not create site-local trackers.
13+
- Emitted event values must match YAML criteria **exactly**, including case. Normalize at the emit site, not in the criterion. Concrete case: StayBnB amenity checkboxes render "Wifi"/"Kitchen" but `staybnb_book.yaml` expects `amenity: "wifi"`. The fix is `.toLowerCase()` in the tracker call, not capitalizing the YAML.
14+
- When instruction text references a button label, the label must be the literal DOM text. StayBnB's filter apply button reads "Show N stays" (live count), so the instruction and criterion description both say "Show N stays" — never a generic "Apply".
15+
16+
## Default-state events — do not auto-credit
17+
18+
Tempting anti-pattern: if a criterion expects `route_select` / `transport_mode_select` but the UI **pre-selects** the default option (shortest route, drive mode, etc.), a user who agrees with the default never clicks, so no event fires and the criterion would fail. The tempting "fix" is to emit a synthetic `*_select` on state entry with `defaultSelected: true`.
19+
20+
**Don't do this.** The synthetic event also fires when the agent does nothing, so the criterion passes for a no-op run — and in tests like `mapquest_nearby_pins` where "Choose driving mode" is scored, the agent gets credit without ever identifying the icon. It silently dilutes the test signal.
21+
22+
**Rules:**
23+
24+
1. Criteria must match only on explicit user interaction events. No state-entry auto-emits for `*_select` style events.
25+
2. If a test asks the agent to click a pre-selected default, it is the test author's job to make the target unambiguous. Either:
26+
- Change the task to select a **non-default** option (e.g., `mode: "walk"` instead of `"drive"`), or
27+
- Pin the criterion to specific field values (e.g., `routeIndex: 0`) so explicit clicks still match, and accept that re-clicking the default is part of the task.
28+
3. When naming the criterion "Select the shortest route", pin `routeIndex: 0` in the YAML so the scorer distinguishes shortest from non-shortest.
29+
30+
History: `mapquest.js` briefly emitted both events as state-entry defaults; the pattern was removed after a review showed `mapquest_nearby_pins` was auto-crediting `transport_mode_select: drive` on directions-panel entry.
31+
32+
## Panel-state machines vs page routes
33+
34+
MapQuest and StayBnB both have panels whose views swap in place rather than navigating. Keep state transitions **stateful** (class toggles on panel containers, `switchPanelState(...)`), not URL-based — the spec intentionally tests panel-state navigation.
35+
36+
Consequence: if two tests need the same control in different panel states, duplicate the DOM rather than routing between states. Concrete case: `mapquest_navigate` wants the Directions flow inside `place-detail`, but `mapquest_nearby_pins` wants the category chip bar visible from inside `place-detail` too. The chip bar is duplicated inside the place-detail state; active-class sync is done across both copies via `document.querySelectorAll('.chip[data-category="..."]')`.
37+
38+
## Deep-link entry points
39+
40+
Some tests start pre-loaded into a non-home view so the agent doesn't have to traverse navigation it isn't being scored on. StayBnB supports `#results` in `staybnb/js/staybnb.js#init()` to jump straight into results with Tokyo listings rendered. The test YAML's `start_url` ends with `/staybnb/#results`. Add a similar hash handler whenever a new test needs a non-home starting view.
41+
42+
## Stacking contexts — popovers and headers
43+
44+
Popovers anchored inside a header will be clipped to the header's effective z-order because the header's own `z-index` creates a stacking context. If a full-screen dismiss backdrop sits above that header z-index, clicks on the popover input land on the backdrop instead and close the popover immediately.
45+
46+
Concrete case: StayBnB's search pill popovers were unclickable until `.topbar` was raised from `z-index: 100` to `300`, above the `.popover-backdrop` at `150`. When adding any overlay that uses a backdrop pattern, verify the anchoring element's stacking context dominates the backdrop.
47+
48+
## Real images
49+
50+
Agents occasionally need real visual content (thumbnails, destination cards, gallery photos). Use **Lorem Picsum seeded URLs** so images are deterministic across runs:
51+
52+
```
53+
https://picsum.photos/seed/<stable-seed>/<W>/<H>
54+
```
55+
56+
Seeds used so far live in `staybnb/index.html` (home cards) and `staybnb/js/staybnb.js` (results + detail + gallery — seeds of form `staybnb-<listingId>-<k>`). Prefer `background-image: url(...)` with `background-size: cover` so the layout tolerates any aspect ratio.
57+
58+
## Drag-and-drop primitives
59+
60+
- **StayBnB price slider** is dual-handle. Each handle drag emits `price_slider_change` with `handle: "min"` or `handle: "max"`. The two criteria are independent, so the handles must be draggable independently — don't couple them.
61+
- **TaskFlow cards** use HTML5 drag-and-drop, not click-to-move. Criteria expect `card_drop` with source and destination column IDs.
62+
63+
## Manual-test harness workflow
64+
65+
When validating a new or changed test via `evaluate_browser_agent.py --manual`:
66+
67+
1. `mkfifo /tmp/eval_in` and hold the write end open with a background `sleep` so the harness's `stdin` stays open across multiple tests.
68+
2. The harness clears events at instruction display time — timing starts then, not at page load.
69+
3. Sending `ok\n` to the FIFO completes a test and prints the score breakdown. A criterion that reports 0 despite visibly correct behaviour is almost always an event-shape mismatch (case, missing field, or default-state pattern above).
70+
71+
## Scope discipline
72+
73+
Keep site code minimal and aligned with the spec's stated challenges. Do **not** add extra animations, fallbacks, or abstractions beyond what a criterion exercises. The sites exist to probe specific agent weaknesses; incidental complexity dilutes the signal and creates spurious event noise.

0 commit comments

Comments
 (0)