You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
That snapshot was generated on `2026-03-30 11:17:06` and evaluates OpenBrowser on `12` tracked browser tasks across two models. We care about three things first:
80
+
81
+
- Correctness: pass/fail plus task-score coverage
82
+
- Efficiency: average execution time
83
+
- Cost: average RMB cost per task
82
84
83
-
What we track:
85
+
Current snapshot:
84
86
85
-
- Pass rate
86
-
- Execution time
87
-
- Cost
88
-
- Remaining context headroom in the control window
87
+
- Overall: `24/24` runs passed, `100%` pass rate
88
+
-`dashscope/qwen3.5-flash`: `12/12` passed, `68.5/68.5` task score, `114.89s` average duration, `0.075442 RMB` average cost
89
+
-`dashscope/qwen3.5-plus`: `12/12` passed, `67.5/68.5` task score, `149.63s` average duration, `0.291952 RMB` average cost
89
90
90
-
Representative archived results from `2026-03-16`:
91
+
| Model | Correctness | Avg. Time | Avg. Cost (RMB) | Composite Score |
| OpenClaw + OpenBrowser (`qwen3.5-flash`) | 5/7 first pass, 7/7 with retry | 317s | 12% |
96
+
On the current suite, `qwen3.5-flash` is the better efficiency-cost point: it keeps the same `100%` pass rate, while being about `23.2%` faster and `74.2%` cheaper than `qwen3.5-plus`. `qwen3.5-plus` still remains useful as a stronger fallback profile for harder visual workflows, but the repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost."
97
+
98
+
Older side-by-side comparisons with OpenClaw are kept only as archived context:
That comparison is not meant to claim OpenBrowser wins every metric on every task. It is meant to make the tradeoff explicit: DOM-heavy relay systems can be strong today, while OpenBrowser is designed to preserve control-window headroom, support a multimodal execution path, and improve through repeatable evaluation.
102
+
Those archived results are still useful for historical tradeoff discussion, but they are not the main metric we optimize against now.
`--model-alias` must match an LLM alias configured in the OpenBrowser web UI, such as `default`, `plus`, or `flash`.
124
+
116
125
See [AGENTS.md](AGENTS.md#evaluation-system) for evaluation framework documentation.
117
126
118
127
## Quick Start
@@ -230,7 +239,17 @@ This means browser control is authorized by possession of the UUID capability to
230
239
231
240
### Try OpenBrowser with SKILL - install to your local agents
232
241
233
-
Simply tell your agent to install `skill/codex/open-browser`
242
+
OpenBrowser ships with skills for both `Codex` and `OpenClaw`:
243
+
244
+
-`skill/codex/open-browser`
245
+
-`skill/openclaw/open-browser`
246
+
247
+
They are similar in purpose, but slightly different in workflow:
248
+
249
+
- The `Codex` skill is tuned for Codex-style repo workflows and supports either foreground or background task execution.
250
+
- The `OpenClaw` skill is tuned for OpenClaw usage, emphasizes background execution, and frames OpenBrowser as the stronger option for rendered-page and multi-step browser tasks.
251
+
252
+
Install the one that matches your local agent environment.
0 commit comments