Commit 5fbd1cd
authored
Add per-conversation multiprocess execution and browser interaction/eval improvements
## Summary
This branch started as a per-conversation multi-process isolation change, but it also includes a broader set of browser automation, prompt, and evaluation improvements that landed along the way.
At a high level, this PR:
- adds worker-process execution for agent messages and browser command handling
- improves screenshot and highlight stability for background-tab automation
- extends element interaction support with swipe/carousel handling
- adds a new BlueBook evaluation site and datasets
- refreshes prompt guidance, model/tool selection behavior, and dependency pins
- fixes evaluation cost extraction and updates the checked-in evaluation report
## Main Changes
### 1. Per-conversation multi-process execution
- Added optional multi-process mode in the agent manager, with one worker process per conversation.
- Introduced `ProcessManager` and `BrowserExecutorBundle` to encapsulate process lifecycle, worker queues, agent manager initialization, and command execution in workers.
- Updated agent message processing so SSE can stream from worker queues instead of only in-process threads.
- Added worker-side handling for:
- agent message execution
- browser command execution
- pause control
- clean shutdown
- Updated session/status handling around worker execution and disconnect paths.
- Added unit coverage for multi-process API flow, process lifecycle, bundle behavior, and related command/model cases.
Key files:
`server/agent/manager.py`
`server/agent/api.py`
`server/api/routes/agent.py`
`server/core/process_manager.py`
`server/core/browser_executor_bundle.py`
### 2. Highlight readiness and screenshot stability
- Reworked highlight readiness to use a snapshot-first approach instead of relying on page-side polling loops that can be throttled in background tabs.
- Added consistency/stability handling around highlight capture and screenshot timing.
- Improved screenshot capture by waiting for page settle before capture, including fonts, viewport mutations, media readiness, and quiet windows.
- Adjusted collision/highlight behavior so `element_type="any"` better surfaces scrollable regions and remains more stable visually.
- Documented the new highlight-readiness design in project docs.
Key files:
`extension/src/commands/highlight-detection.injected.js`
`extension/src/commands/highlight-detection.ts`
`extension/src/commands/screenshot.ts`
`extension/src/utils/layout-stability.ts`
`extension/src/utils/collision-detection.ts`
`AGENTS.md`
### 3. Element interaction improvements, including swipe support
- Added swipe/carousel interaction support for swipable regions.
- Updated element-interaction prompts and tool guidance so the agent can distinguish scrollable containers from swipable carousel/slider regions.
- Fixed async JavaScript result handling for click flows where dialogs can open during JS-driven interactions, avoiding false failures.
- Expanded command/type support so swipe is treated as a first-class interaction.
Key files:
`extension/src/commands/element-actions.ts`
`extension/src/background/index.ts`
`server/models/commands.py`
`server/agent/prompts/element_interaction_tool.j2`
`server/agent/prompts/highlight_tool.j2`
`server/agent/prompts/tab_tool.j2`
### 4. Prompt and model/tooling refinements
- Refined highlight prompt guidance, especially around when to use keywords vs pagination.
- Tightened JavaScript prompt guidance so JS is positioned as a narrow system-specific fallback instead of a primary interaction path.
- Updated prompt context/tool profile behavior to better match model tier and browser tool availability.
- Bumped `openhands-sdk` / `openhands-tools` dependency revisions and refreshed the lockfile.
Key files:
`server/agent/prompts/highlight_tool.j2`
`server/agent/prompts/javascript_tool.j2`
`server/agent/tools/base.py`
`server/agent/tools/browser_executor.py`
`server/agent/tools/prompt_context.py`
`pyproject.toml`
`uv.lock`
### 5. Evaluation expansion and maintenance
- Added the new BlueBook evaluation site, assets, and two datasets:
- `bluebook_simple`
- `bluebook_complex`
- Updated the eval server/docs to expose the new scenario.
- Refreshed the checked-in evaluation report.
- Fixed evaluation cost extraction to use the final `usage_metrics` SSE snapshot rather than the first one, which was underreporting costs.
- Recomputed `eval/evaluation_report.json` from the saved SSE output.
Key files:
`eval/bluebook/index.html`
`eval/bluebook/js/bluebook.js`
`eval/bluebook/css/bluebook.css`
`eval/dataset/bluebook_simple.yaml`
`eval/dataset/bluebook_complex.yaml`
`eval/server.py`
`eval/README.md`
`eval/evaluate_browser_agent.py`
`eval/evaluation_report.json`
## Testing
Added or updated tests for:
- multi-process agent API behavior
- process manager / worker lifecycle
- browser executor bundle behavior
- command model coverage
- prompt/profile behavior
- highlight detection and layout stability
- screenshot capture behavior
- element-action regression coverage
- eval client cost extraction behavior
Representative test files:
`server/tests/unit/test_agent_api_multiprocess.py`
`server/tests/unit/test_agent_manager_process.py`
`server/tests/unit/test_browser_executor_bundle.py`
`server/tests/unit/test_eval_client.py`
`extension/src/__tests__/highlight-detection.test.ts`
`extension/src/__tests__/highlight-layout-stability.test.ts`
`extension/src/__tests__/element-actions-regression.test.ts`
## Notes
This branch also contains a few non-core artifacts alongside product changes, including:
- checked-in eval output/report updates
- lock files under `eval/.locks/`
- `bug_report_highlight_any_scrollable.md`File tree
58 files changed
+7969
-1165
lines changed- eval
- .locks
- bluebook
- css
- js
- dataset
- extension/src
- __tests__
- background
- commands
- utils
- websocket
- server
- agent
- prompts
- tools
- api/routes
- core
- models
- tests/unit
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
58 files changed
+7969
-1165
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
205 | 205 | | |
206 | 206 | | |
207 | 207 | | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
208 | 218 | | |
209 | 219 | | |
210 | 220 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
Whitespace-only changes.
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
64 | 76 | | |
65 | 77 | | |
66 | 78 | | |
| |||
0 commit comments