Skip to content

Commit 5fbd1cd

Browse files
authored
Add per-conversation multiprocess execution and browser interaction/eval improvements
## Summary This branch started as a per-conversation multi-process isolation change, but it also includes a broader set of browser automation, prompt, and evaluation improvements that landed along the way. At a high level, this PR: - adds worker-process execution for agent messages and browser command handling - improves screenshot and highlight stability for background-tab automation - extends element interaction support with swipe/carousel handling - adds a new BlueBook evaluation site and datasets - refreshes prompt guidance, model/tool selection behavior, and dependency pins - fixes evaluation cost extraction and updates the checked-in evaluation report ## Main Changes ### 1. Per-conversation multi-process execution - Added optional multi-process mode in the agent manager, with one worker process per conversation. - Introduced `ProcessManager` and `BrowserExecutorBundle` to encapsulate process lifecycle, worker queues, agent manager initialization, and command execution in workers. - Updated agent message processing so SSE can stream from worker queues instead of only in-process threads. - Added worker-side handling for: - agent message execution - browser command execution - pause control - clean shutdown - Updated session/status handling around worker execution and disconnect paths. - Added unit coverage for multi-process API flow, process lifecycle, bundle behavior, and related command/model cases. Key files: `server/agent/manager.py` `server/agent/api.py` `server/api/routes/agent.py` `server/core/process_manager.py` `server/core/browser_executor_bundle.py` ### 2. Highlight readiness and screenshot stability - Reworked highlight readiness to use a snapshot-first approach instead of relying on page-side polling loops that can be throttled in background tabs. - Added consistency/stability handling around highlight capture and screenshot timing. - Improved screenshot capture by waiting for page settle before capture, including fonts, viewport mutations, media readiness, and quiet windows. - Adjusted collision/highlight behavior so `element_type="any"` better surfaces scrollable regions and remains more stable visually. - Documented the new highlight-readiness design in project docs. Key files: `extension/src/commands/highlight-detection.injected.js` `extension/src/commands/highlight-detection.ts` `extension/src/commands/screenshot.ts` `extension/src/utils/layout-stability.ts` `extension/src/utils/collision-detection.ts` `AGENTS.md` ### 3. Element interaction improvements, including swipe support - Added swipe/carousel interaction support for swipable regions. - Updated element-interaction prompts and tool guidance so the agent can distinguish scrollable containers from swipable carousel/slider regions. - Fixed async JavaScript result handling for click flows where dialogs can open during JS-driven interactions, avoiding false failures. - Expanded command/type support so swipe is treated as a first-class interaction. Key files: `extension/src/commands/element-actions.ts` `extension/src/background/index.ts` `server/models/commands.py` `server/agent/prompts/element_interaction_tool.j2` `server/agent/prompts/highlight_tool.j2` `server/agent/prompts/tab_tool.j2` ### 4. Prompt and model/tooling refinements - Refined highlight prompt guidance, especially around when to use keywords vs pagination. - Tightened JavaScript prompt guidance so JS is positioned as a narrow system-specific fallback instead of a primary interaction path. - Updated prompt context/tool profile behavior to better match model tier and browser tool availability. - Bumped `openhands-sdk` / `openhands-tools` dependency revisions and refreshed the lockfile. Key files: `server/agent/prompts/highlight_tool.j2` `server/agent/prompts/javascript_tool.j2` `server/agent/tools/base.py` `server/agent/tools/browser_executor.py` `server/agent/tools/prompt_context.py` `pyproject.toml` `uv.lock` ### 5. Evaluation expansion and maintenance - Added the new BlueBook evaluation site, assets, and two datasets: - `bluebook_simple` - `bluebook_complex` - Updated the eval server/docs to expose the new scenario. - Refreshed the checked-in evaluation report. - Fixed evaluation cost extraction to use the final `usage_metrics` SSE snapshot rather than the first one, which was underreporting costs. - Recomputed `eval/evaluation_report.json` from the saved SSE output. Key files: `eval/bluebook/index.html` `eval/bluebook/js/bluebook.js` `eval/bluebook/css/bluebook.css` `eval/dataset/bluebook_simple.yaml` `eval/dataset/bluebook_complex.yaml` `eval/server.py` `eval/README.md` `eval/evaluate_browser_agent.py` `eval/evaluation_report.json` ## Testing Added or updated tests for: - multi-process agent API behavior - process manager / worker lifecycle - browser executor bundle behavior - command model coverage - prompt/profile behavior - highlight detection and layout stability - screenshot capture behavior - element-action regression coverage - eval client cost extraction behavior Representative test files: `server/tests/unit/test_agent_api_multiprocess.py` `server/tests/unit/test_agent_manager_process.py` `server/tests/unit/test_browser_executor_bundle.py` `server/tests/unit/test_eval_client.py` `extension/src/__tests__/highlight-detection.test.ts` `extension/src/__tests__/highlight-layout-stability.test.ts` `extension/src/__tests__/element-actions-regression.test.ts` ## Notes This branch also contains a few non-core artifacts alongside product changes, including: - checked-in eval output/report updates - lock files under `eval/.locks/` - `bug_report_highlight_any_scrollable.md`
2 parents 85b5ca6 + 6c4ab2f commit 5fbd1cd

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+7969
-1165
lines changed

AGENTS.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,16 @@ Elements are paginated to ensure **no visual overlap** in each screenshot:
205205
- AI calls `page=1, page=2, page=3...` to see all elements of that type
206206
- No offset/limit - pages are determined by collision geometry
207207

208+
### Highlight Readiness Behavior
209+
210+
- `highlight_elements` now uses a **snapshot-first** readiness check instead of page-side polling loops.
211+
- Reason: OpenBrowser intentionally keeps automated tabs in the browser background, and Chrome may heavily throttle hidden-tab timers. A page-side `setTimeout` stability loop can therefore take far longer than its nominal budget and become the main cause of highlight timeouts.
212+
- The extension samples viewport readiness signals once per attempt: document readiness, viewport text/media density, pending images, and loading placeholders such as skeleton/shimmer/spinner indicators.
213+
- Readiness is graded as `ready`, `provisionally_ready`, or `not_ready`.
214+
- If readiness is `not_ready`, the extension performs only a couple of short **background-side** retries before proceeding or returning the latest result.
215+
- After screenshot capture, highlight still runs a **consistency check**. This is a drift detector, not a loading detector: it verifies whether sampled highlighted elements moved or disappeared between detection and screenshot.
216+
- Design rule: prefer snapshot classification plus bounded retries; avoid depending on repeated timers inside the target page for highlight stability.
217+
208218
```
209219
# Highlight clickable elements (default)
210220
highlight_elements() → Page 1 of clickable elements
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
## Bug Report: `highlight_elements` with `type='any'` Never Returns Scrollable Elements
2+
3+
**Date:** 2026-03-22
4+
**Component:** `extension/src/commands/highlight-detection.injected.js`
5+
**Severity:** Medium
6+
**Type:** Logic Error / Dead Code
7+
8+
---
9+
10+
### Summary
11+
12+
When calling `highlight_elements(element_type='any')`, scrollable elements are effectively **never returned** due to short-circuit logic in `resolveElementCandidate()`. The scrollable check (lines 948-957) is dead code for `type='any'`.
13+
14+
---
15+
16+
### Root Cause
17+
18+
In `resolveElementCandidate(el, 'any')` (lines 837-973):
19+
20+
```javascript
21+
const clickableCandidate = resolveClickableCandidate(el); // Traverses UP DOM tree
22+
23+
if (clickableCandidate) {
24+
return { type: 'clickable', ... }; // ← EARLY RETURN
25+
}
26+
27+
// Lines 924-969 below are NEVER reached when clickableCandidate exists:
28+
if (isInputableCandidate(el)) // dead code
29+
if (isSelectableCandidate(el)) // dead code
30+
if (isScrollableCandidate(el)) // dead code ← SCROLLABLE NEVER RETURNED
31+
if (isHoverableCandidate(el)) // dead code
32+
```
33+
34+
The function short-circuits on the first clickable ancestor found (by traversing UP the DOM tree), making all subsequent type checks dead code for `type='any'`.
35+
36+
---
37+
38+
### Expected Behavior
39+
40+
Per collision-aware pagination design, `type='any'` should return elements across **all types** (clickable, inputable, selectable, scrollable, hoverable), prioritized by `HIGHLIGHT_TYPE_PRIORITY`.
41+
42+
---
43+
44+
### Actual Behavior
45+
46+
`type='any'` effectively becomes `type='clickable'`. Elements are only returned if:
47+
- They themselves are clickable, OR
48+
- Any **ancestor** has clickable characteristics (pointer cursor + text content)
49+
50+
This causes:
51+
- Scrollable divs to be misclassified as "clickable" (their ancestor)
52+
- Hoverable elements to be skipped entirely
53+
- False positives where a non-interactive ancestor shadows the actual target element
54+
55+
---
56+
57+
### Affected Code Paths
58+
59+
- `resolveElementCandidate()` — lines 912-922 (early return on `clickableCandidate`)
60+
- `resolveClickableCandidate()` — lines 632-689 (DOM tree traversal with `isTightClickableWrapper` checks)
61+
62+
---
63+
64+
### Suggested Fix
65+
66+
For `type='any'`, collect candidates across ALL types and use `compareCandidates()` with `HIGHLIGHT_TYPE_PRIORITY` for selection — matching the behavior of the specific-type paths (lines 840-910) rather than short-circuiting.

eval/.locks/evaluation_7a77a63b-ab2a-4e1b-9734-66a4dfe1d6fe.lock

Whitespace-only changes.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"pid": 48460, "browser_uuid": "ff2b4397-b8f6-4346-9f66-bf1b4d9a9804", "started_at": "2026-03-22T11:28:57.972109"}

eval/README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Mock Websites for AI Agent Evaluation
22

3-
This directory contains 5 mocked frontend websites designed to test browser-operating AI agents' capabilities.
3+
This directory contains mocked frontend websites designed to test browser-operating AI agents' capabilities.
44

55
## Websites
66

@@ -61,6 +61,18 @@ This directory contains 5 mocked frontend websites designed to test browser-oper
6161
- Multiple view modes (Overview, Valuation, Financial, etc.)
6262
- Dark theme matching original finviz.com
6363

64+
### 6. BlueBook Feed (Hard)
65+
- **URL**: `/bluebook/`
66+
- **Difficulty**: Hard
67+
- **Purpose**: Test Xiaohongshu-style visual browsing, search, dense card layouts, modal note reading, and comment interactions
68+
- **Features**:
69+
- Dense masonry feed with 70+ mocked posts
70+
- Search bar with separate clear/search icon buttons
71+
- Floating "graphic only" and "reload" buttons
72+
- Note detail modal with left media area and right comment panel
73+
- Comment like / reply interactions with author-specific tracking
74+
- Shared tracker integration plus site-specific events
75+
6476
## Event Tracking
6577

6678
All websites include comprehensive event tracking that records:

0 commit comments

Comments
 (0)