Skip to content

Commit dafb26c

Browse files
committed
Refresh evaluation and skill documentation
1 parent a358244 commit dafb26c

File tree

3 files changed

+89
-50
lines changed

3 files changed

+89
-50
lines changed

README.md

Lines changed: 41 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -70,32 +70,36 @@ Model capability matters, but so does price. We do not assume token costs stay c
7070

7171
## Evaluation
7272

73-
OpenBrowser is evaluated in two complementary ways:
73+
The primary evaluation signal in this repo is the latest checked-in report:
7474

75-
- Real browser workflows and side-by-side comparisons against existing approaches
76-
- A custom regression suite of mocked websites with event tracking in [`eval/`](eval/)
75+
- [`eval/evaluation_report.json`](eval/evaluation_report.json)
7776

78-
The main archived comparison in this repo keeps the same control setup and compares `OpenClaw Browser Relay` with `OpenClaw + OpenBrowser skill`:
77+
The test set is a series of local mock websites in [`eval/`](eval/) that simulate realistic browser tasks and record structured interaction events.
7978

80-
- [`eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md`](eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md)
81-
- [`eval/evaluation_report.json`](eval/evaluation_report.json)
79+
That snapshot was generated on `2026-03-30 11:17:06` and evaluates OpenBrowser on `12` tracked browser tasks across two models. We care about three things first:
80+
81+
- Correctness: pass/fail plus task-score coverage
82+
- Efficiency: average execution time
83+
- Cost: average RMB cost per task
8284

83-
What we track:
85+
Current snapshot:
8486

85-
- Pass rate
86-
- Execution time
87-
- Cost
88-
- Remaining context headroom in the control window
87+
- Overall: `24/24` runs passed, `100%` pass rate
88+
- `dashscope/qwen3.5-flash`: `12/12` passed, `68.5/68.5` task score, `114.89s` average duration, `0.075442 RMB` average cost
89+
- `dashscope/qwen3.5-plus`: `12/12` passed, `67.5/68.5` task score, `149.63s` average duration, `0.291952 RMB` average cost
8990

90-
Representative archived results from `2026-03-16`:
91+
| Model | Correctness | Avg. Time | Avg. Cost (RMB) | Composite Score |
92+
|-------|-------------|-----------|------------------|-----------------|
93+
| `dashscope/qwen3.5-flash` | `12/12` passed, `68.5/68.5` | `114.89s` | `0.075442` | `0.9358` |
94+
| `dashscope/qwen3.5-plus` | `12/12` passed, `67.5/68.5` | `149.63s` | `0.291952` | `0.8774` |
9195

92-
| Setup | Pass Rate | Avg. Time | Control Window Context |
93-
|--------|-----------|-----------|------------------------|
94-
| OpenClaw Browser Relay | 6/7 | 211s | 640% |
95-
| OpenClaw + OpenBrowser (`qwen3.5-plus`) | 7/7 | 274s | 21% |
96-
| OpenClaw + OpenBrowser (`qwen3.5-flash`) | 5/7 first pass, 7/7 with retry | 317s | 12% |
96+
On the current suite, `qwen3.5-flash` is the better efficiency-cost point: it keeps the same `100%` pass rate, while being about `23.2%` faster and `74.2%` cheaper than `qwen3.5-plus`. `qwen3.5-plus` still remains useful as a stronger fallback profile for harder visual workflows, but the repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost."
97+
98+
Older side-by-side comparisons with OpenClaw are kept only as archived context:
99+
100+
- [`eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md`](eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md)
97101

98-
That comparison is not meant to claim OpenBrowser wins every metric on every task. It is meant to make the tradeoff explicit: DOM-heavy relay systems can be strong today, while OpenBrowser is designed to preserve control-window headroom, support a multimodal execution path, and improve through repeatable evaluation.
102+
Those archived results are still useful for historical tradeoff discussion, but they are not the main metric we optimize against now.
99103

100104
### Run Your Own Evaluation
101105

@@ -106,13 +110,18 @@ python eval/evaluate_browser_agent.py --list
106110
# Set the browser capability token once
107111
export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
108112

109-
# Run all tests with both models
110-
python eval/evaluate_browser_agent.py --model dashscope/qwen3.5-plus --model dashscope/qwen3.5-flash
113+
# Run one test with a configured LLM alias
114+
python eval/evaluate_browser_agent.py --test techforum --model-alias default
115+
116+
# Run all tests with multiple configured aliases
117+
python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
111118

112119
# Or pass the browser UUID explicitly per run
113-
python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID
120+
python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias default
114121
```
115122

123+
`--model-alias` must match an LLM alias configured in the OpenBrowser web UI, such as `default`, `plus`, or `flash`.
124+
116125
See [AGENTS.md](AGENTS.md#evaluation-system) for evaluation framework documentation.
117126

118127
## Quick Start
@@ -230,7 +239,17 @@ This means browser control is authorized by possession of the UUID capability to
230239

231240
### Try OpenBrowser with SKILL - install to your local agents
232241

233-
Simply tell your agent to install `skill/codex/open-browser`
242+
OpenBrowser ships with skills for both `Codex` and `OpenClaw`:
243+
244+
- `skill/codex/open-browser`
245+
- `skill/openclaw/open-browser`
246+
247+
They are similar in purpose, but slightly different in workflow:
248+
249+
- The `Codex` skill is tuned for Codex-style repo workflows and supports either foreground or background task execution.
250+
- The `OpenClaw` skill is tuned for OpenClaw usage, emphasizes background execution, and frames OpenBrowser as the stronger option for rendered-page and multi-step browser tasks.
251+
252+
Install the one that matches your local agent environment.
234253

235254
## Why Qwen3.5 Family Right Now?
236255

README.zh-CN.md

Lines changed: 41 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -62,32 +62,36 @@ OpenBrowser 不是靠“感觉不错”来迭代的。仓库里包含带事件
6262

6363
## 评测
6464

65-
OpenBrowser 目前用两种互补方式做评测
65+
仓库里当前最重要的评测基线,是最新 check-in 的结果文件
6666

67-
- 真实浏览器工作流,以及和现有方案的并排对比
68-
- 带事件跟踪的自定义 mock 网站回归套件,位于 [`eval/`](eval/)
67+
- [`eval/evaluation_report.json`](eval/evaluation_report.json)
6968

70-
仓库里当前主要归档对比,保持相同控制设置,对比的是 `OpenClaw Browser Relay``OpenClaw + OpenBrowser skill`
69+
这套测试集本身是一系列位于 [`eval/`](eval/) 下的本地 mock 仿真网站,用来模拟真实浏览器任务,并记录结构化交互事件。
7170

72-
- [`eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md`](eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md)
73-
- [`eval/evaluation_report.json`](eval/evaluation_report.json)
71+
这个快照生成于 `2026-03-30 11:17:06`,基于其中 `12` 个带事件跟踪的浏览器任务,对两个模型做评测。我们现在优先看三件事:
72+
73+
- 正确性:是否通过,以及任务分覆盖情况
74+
- 效率:平均执行时间
75+
- 成本:单任务平均 RMB 成本
7476

75-
我们重点跟踪
77+
当前快照结果
7678

77-
- 通过率
78-
- 执行时间
79-
- 成本
80-
- 控制窗口剩余上下文空间
79+
- 总体:`24/24` 次运行通过,整体通过率 `100%`
80+
- `dashscope/qwen3.5-flash``12/12` 通过,任务分 `68.5/68.5`,平均耗时 `114.89s`,平均成本 `0.075442 RMB`
81+
- `dashscope/qwen3.5-plus``12/12` 通过,任务分 `67.5/68.5`,平均耗时 `149.63s`,平均成本 `0.291952 RMB`
8182

82-
`2026-03-16` 的代表性归档结果:
83+
| 模型 | 正确性 | 平均耗时 | 平均成本(RMB) | 综合分 |
84+
|------|--------|----------|------------------|--------|
85+
| `dashscope/qwen3.5-flash` | `12/12` 通过,`68.5/68.5` | `114.89s` | `0.075442` | `0.9358` |
86+
| `dashscope/qwen3.5-plus` | `12/12` 通过,`67.5/68.5` | `149.63s` | `0.291952` | `0.8774` |
8387

84-
| 方案 | 通过率 | 平均时间 | 控制窗口上下文 |
85-
|------|--------|----------|----------------|
86-
| OpenClaw Browser Relay | 6/7 | 211s | 640% |
87-
| OpenClaw + OpenBrowser (`qwen3.5-plus`) | 7/7 | 274s | 21% |
88-
| OpenClaw + OpenBrowser (`qwen3.5-flash`) | 首轮 5/7,重试后 7/7 | 317s | 12% |
88+
在当前这套评测里,`qwen3.5-flash` 是更好的效率/成本工作点:在同样保持 `100%` 通过率的前提下,它比 `qwen3.5-plus` 约快 `23.2%`,平均成本约低 `74.2%``qwen3.5-plus` 仍然是更强 fallback 档位,适合更难的视觉推理或更复杂的工作流;但这个仓库现在的主叙事已经不再是“和 OpenClaw 做 benchmark 对比”,而是“看我们当前栈在正确性、速度和成本上的最新结果”。
89+
90+
之前与 OpenClaw 的并排对比现在作为 archived 资料保留:
91+
92+
- [`eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md`](eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md)
8993

90-
这个对比不是为了宣称 OpenBrowser 在所有任务、所有指标上都更优,而是为了把真实权衡说清楚:重 DOM relay 系统在今天可能依然很强,而 OpenBrowser 的设计目标,是保留控制窗口上下文空间,支持多模态执行路径,并通过可重复评测持续改进
94+
这些归档结果对理解历史权衡仍然有价值,但已经不是我们现在主要优化的指标来源
9195

9296
### 自己运行评测
9397

@@ -98,13 +102,18 @@ python eval/evaluate_browser_agent.py --list
98102
# 一次性设置浏览器 capability token
99103
export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
100104

101-
# 用两个模型跑全部测试
102-
python eval/evaluate_browser_agent.py --model dashscope/qwen3.5-plus --model dashscope/qwen3.5-flash
105+
# 用一个已配置的 LLM alias 跑单个测试
106+
python eval/evaluate_browser_agent.py --test techforum --model-alias default
107+
108+
# 用多个已配置 alias 跑全部测试
109+
python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
103110

104111
# 或者在单次运行里显式传 browser UUID
105-
python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID
112+
python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias default
106113
```
107114

115+
`--model-alias` 必须对应你在 OpenBrowser Web UI 里配置过的 LLM alias,比如 `default``plus``flash`
116+
108117
评测框架说明见 [AGENTS.md](AGENTS.md#evaluation-system)
109118

110119
## 快速开始
@@ -224,7 +233,17 @@ http://localhost:8765
224233

225234
### 也可以通过 SKILL 使用 OpenBrowser
226235

227-
直接告诉你的 agent 安装 `skill/codex/open-browser`
236+
OpenBrowser 同时提供 `Codex``OpenClaw` 两套 skill:
237+
238+
- `skill/codex/open-browser`
239+
- `skill/openclaw/open-browser`
240+
241+
它们的目标相近,但工作流上有一点区别:
242+
243+
- `Codex` 版更贴合 Codex 的仓库协作工作流,前台或后台执行都可以。
244+
- `OpenClaw` 版更贴合 OpenClaw 的使用方式,更强调后台执行,并把 OpenBrowser 定位为更适合渲染页面和多步浏览器任务的方案。
245+
246+
安装与你本地 agent 环境对应的那一套即可。
228247

229248
## 为什么当前主要使用 Qwen3.5 系列?
230249

eval/README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -137,25 +137,26 @@ Automated evaluation now requires a browser UUID capability token copied from th
137137
Quick start:
138138

139139
```bash
140-
python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID
140+
python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias default
141141
```
142142

143143
Recommended options:
144144

145145
```bash
146146
export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
147-
python eval/evaluate_browser_agent.py --test techforum
148-
python eval/evaluate_browser_agent.py --model dashscope/qwen3.5-plus --test techforum
149-
python eval/evaluate_browser_agent.py --model dashscope/qwen3.5-plus --model dashscope/qwen3.5-flash
147+
python eval/evaluate_browser_agent.py --test techforum --model-alias default
148+
python eval/evaluate_browser_agent.py --test techforum --model-alias plus
149+
python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
150150
python eval/evaluate_browser_agent.py --list
151151
python eval/evaluate_browser_agent.py --manual --test techforum
152152
```
153153

154154
Notes:
155155

156156
1. `--chrome-uuid` is required for automated runs that call the OpenBrowser browser-control APIs.
157-
2. `--manual` and `--list` do not require a browser UUID.
158-
3. `OPENBROWSER_CHROME_UUID` is the equivalent environment variable for scripting and CI-style usage.
157+
2. Automated evaluation also requires at least one `--model-alias`, which must match a configured LLM alias in the OpenBrowser web UI.
158+
3. `--manual` and `--list` do not require a browser UUID.
159+
4. `OPENBROWSER_CHROME_UUID` is the equivalent environment variable for scripting and CI-style usage.
159160

160161
## Evaluating AI Agent Behavior
161162

0 commit comments

Comments
 (0)