Refresh evaluation and skill documentation

softpudding · softpudding · commit dafb26c5aff4 · 2026-03-31T08:54:06.000+08:00
diff --git a/README.md b/README.md
@@ -70,32 +70,36 @@ Model capability matters, but so does price. We do not assume token costs stay c
 
 ## Evaluation
 
-OpenBrowser is evaluated in two complementary ways:
+The primary evaluation signal in this repo is the latest checked-in report:
 
-- Real browser workflows and side-by-side comparisons against existing approaches
-- A custom regression suite of mocked websites with event tracking in [`eval/`](eval/)
+- [`eval/evaluation_report.json`](eval/evaluation_report.json)
 
-The main archived comparison in this repo keeps the same control setup and compares `OpenClaw Browser Relay` with `OpenClaw + OpenBrowser skill`:
+The test set is a series of local mock websites in [`eval/`](eval/) that simulate realistic browser tasks and record structured interaction events.
 
-- [`eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md`](eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md)
-- [`eval/evaluation_report.json`](eval/evaluation_report.json)
+That snapshot was generated on `2026-03-30 11:17:06` and evaluates OpenBrowser on `12` tracked browser tasks across two models. We care about three things first:
+
+- Correctness: pass/fail plus task-score coverage
+- Efficiency: average execution time
+- Cost: average RMB cost per task
 
-What we track:
+Current snapshot:
 
-- Pass rate
-- Execution time
-- Cost
-- Remaining context headroom in the control window
+- Overall: `24/24` runs passed, `100%` pass rate
+- `dashscope/qwen3.5-flash`: `12/12` passed, `68.5/68.5` task score, `114.89s` average duration, `0.075442 RMB` average cost
+- `dashscope/qwen3.5-plus`: `12/12` passed, `67.5/68.5` task score, `149.63s` average duration, `0.291952 RMB` average cost
 
-Representative archived results from `2026-03-16`:
+| Model | Correctness | Avg. Time | Avg. Cost (RMB) | Composite Score |
+|-------|-------------|-----------|------------------|-----------------|
+| `dashscope/qwen3.5-flash` | `12/12` passed, `68.5/68.5` | `114.89s` | `0.075442` | `0.9358` |
+| `dashscope/qwen3.5-plus` | `12/12` passed, `67.5/68.5` | `149.63s` | `0.291952` | `0.8774` |
 
-| Setup | Pass Rate | Avg. Time | Control Window Context |
-|--------|-----------|-----------|------------------------|
-| OpenClaw Browser Relay | 6/7 | 211s | 640% |
-| OpenClaw + OpenBrowser (`qwen3.5-plus`) | 7/7 | 274s | 21% |
-| OpenClaw + OpenBrowser (`qwen3.5-flash`) | 5/7 first pass, 7/7 with retry | 317s | 12% |
+On the current suite, `qwen3.5-flash` is the better efficiency-cost point: it keeps the same `100%` pass rate, while being about `23.2%` faster and `74.2%` cheaper than `qwen3.5-plus`. `qwen3.5-plus` still remains useful as a stronger fallback profile for harder visual workflows, but the repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost."
+
+Older side-by-side comparisons with OpenClaw are kept only as archived context:
+
+- [`eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md`](eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md)
 
-That comparison is not meant to claim OpenBrowser wins every metric on every task. It is meant to make the tradeoff explicit: DOM-heavy relay systems can be strong today, while OpenBrowser is designed to preserve control-window headroom, support a multimodal execution path, and improve through repeatable evaluation.
+Those archived results are still useful for historical tradeoff discussion, but they are not the main metric we optimize against now.
 
 ### Run Your Own Evaluation
 
@@ -106,13 +110,18 @@ python eval/evaluate_browser_agent.py --list
 # Set the browser capability token once
 export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
 
-# Run all tests with both models
-python eval/evaluate_browser_agent.py --model dashscope/qwen3.5-plus --model dashscope/qwen3.5-flash
+# Run one test with a configured LLM alias
+python eval/evaluate_browser_agent.py --test techforum --model-alias default
+
+# Run all tests with multiple configured aliases
+python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
 
 # Or pass the browser UUID explicitly per run
-python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID
+python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias default
 ```
 
+`--model-alias` must match an LLM alias configured in the OpenBrowser web UI, such as `default`, `plus`, or `flash`.
+
 See [AGENTS.md](AGENTS.md#evaluation-system) for evaluation framework documentation.
 
 ## Quick Start
@@ -230,7 +239,17 @@ This means browser control is authorized by possession of the UUID capability to
 
 ### Try OpenBrowser with SKILL - install to your local agents
 
-Simply tell your agent to install `skill/codex/open-browser`
+OpenBrowser ships with skills for both `Codex` and `OpenClaw`:
+
+- `skill/codex/open-browser`
+- `skill/openclaw/open-browser`
+
+They are similar in purpose, but slightly different in workflow:
+
+- The `Codex` skill is tuned for Codex-style repo workflows and supports either foreground or background task execution.
+- The `OpenClaw` skill is tuned for OpenClaw usage, emphasizes background execution, and frames OpenBrowser as the stronger option for rendered-page and multi-step browser tasks.
+
+Install the one that matches your local agent environment.
 
 ## Why Qwen3.5 Family Right Now?
 
diff --git a/README.zh-CN.md b/README.zh-CN.md
@@ -62,32 +62,36 @@ OpenBrowser 不是靠“感觉不错”来迭代的。仓库里包含带事件
 
 ## 评测
 
-OpenBrowser 目前用两种互补方式做评测：
+仓库里当前最重要的评测基线，是最新 check-in 的结果文件：
 
-- 真实浏览器工作流，以及和现有方案的并排对比
-- 带事件跟踪的自定义 mock 网站回归套件，位于 [`eval/`](eval/)
+- [`eval/evaluation_report.json`](eval/evaluation_report.json)
 
-仓库里当前主要归档对比，保持相同控制设置，对比的是 `OpenClaw Browser Relay` 和 `OpenClaw + OpenBrowser skill`：
+这套测试集本身是一系列位于 [`eval/`](eval/) 下的本地 mock 仿真网站，用来模拟真实浏览器任务，并记录结构化交互事件。
 
-- [`eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md`](eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md)
-- [`eval/evaluation_report.json`](eval/evaluation_report.json)
+这个快照生成于 `2026-03-30 11:17:06`，基于其中 `12` 个带事件跟踪的浏览器任务，对两个模型做评测。我们现在优先看三件事：
+
+- 正确性：是否通过，以及任务分覆盖情况
+- 效率：平均执行时间
+- 成本：单任务平均 RMB 成本
 
-我们重点跟踪：
+当前快照结果：
 
-- 通过率
-- 执行时间
-- 成本
-- 控制窗口剩余上下文空间
+- 总体：`24/24` 次运行通过，整体通过率 `100%`
+- `dashscope/qwen3.5-flash`：`12/12` 通过，任务分 `68.5/68.5`，平均耗时 `114.89s`，平均成本 `0.075442 RMB`
+- `dashscope/qwen3.5-plus`：`12/12` 通过，任务分 `67.5/68.5`，平均耗时 `149.63s`，平均成本 `0.291952 RMB`
 
-`2026-03-16` 的代表性归档结果：
+| 模型 | 正确性 | 平均耗时 | 平均成本（RMB） | 综合分 |
+|------|--------|----------|------------------|--------|
+| `dashscope/qwen3.5-flash` | `12/12` 通过，`68.5/68.5` | `114.89s` | `0.075442` | `0.9358` |
+| `dashscope/qwen3.5-plus` | `12/12` 通过，`67.5/68.5` | `149.63s` | `0.291952` | `0.8774` |
 
-| 方案 | 通过率 | 平均时间 | 控制窗口上下文 |
-|------|--------|----------|----------------|
-| OpenClaw Browser Relay | 6/7 | 211s | 640% |
-| OpenClaw + OpenBrowser (`qwen3.5-plus`) | 7/7 | 274s | 21% |
-| OpenClaw + OpenBrowser (`qwen3.5-flash`) | 首轮 5/7，重试后 7/7 | 317s | 12% |
+在当前这套评测里，`qwen3.5-flash` 是更好的效率/成本工作点：在同样保持 `100%` 通过率的前提下，它比 `qwen3.5-plus` 约快 `23.2%`，平均成本约低 `74.2%`。`qwen3.5-plus` 仍然是更强 fallback 档位，适合更难的视觉推理或更复杂的工作流；但这个仓库现在的主叙事已经不再是“和 OpenClaw 做 benchmark 对比”，而是“看我们当前栈在正确性、速度和成本上的最新结果”。
+
+之前与 OpenClaw 的并排对比现在作为 archived 资料保留：
+
+- [`eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md`](eval/archived/2026-03-16/browser_agent_evaluation_2026-03-16_openclaw_vs_openbrowser.md)
 
-这个对比不是为了宣称 OpenBrowser 在所有任务、所有指标上都更优，而是为了把真实权衡说清楚：重 DOM relay 系统在今天可能依然很强，而 OpenBrowser 的设计目标，是保留控制窗口上下文空间，支持多模态执行路径，并通过可重复评测持续改进。
+这些归档结果对理解历史权衡仍然有价值，但已经不是我们现在主要优化的指标来源。
 
 ### 自己运行评测
 
@@ -98,13 +102,18 @@ python eval/evaluate_browser_agent.py --list
 # 一次性设置浏览器 capability token
 export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
 
-# 用两个模型跑全部测试
-python eval/evaluate_browser_agent.py --model dashscope/qwen3.5-plus --model dashscope/qwen3.5-flash
+# 用一个已配置的 LLM alias 跑单个测试
+python eval/evaluate_browser_agent.py --test techforum --model-alias default
+
+# 用多个已配置 alias 跑全部测试
+python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
 
 # 或者在单次运行里显式传 browser UUID
-python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID
+python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias default
 ```
 
+`--model-alias` 必须对应你在 OpenBrowser Web UI 里配置过的 LLM alias，比如 `default`、`plus`、`flash`。
+
 评测框架说明见 [AGENTS.md](AGENTS.md#evaluation-system)。
 
 ## 快速开始
@@ -224,7 +233,17 @@ http://localhost:8765
 
 ### 也可以通过 SKILL 使用 OpenBrowser
 
-直接告诉你的 agent 安装 `skill/codex/open-browser`
+OpenBrowser 同时提供 `Codex` 和 `OpenClaw` 两套 skill：
+
+- `skill/codex/open-browser`
+- `skill/openclaw/open-browser`
+
+它们的目标相近，但工作流上有一点区别：
+
+- `Codex` 版更贴合 Codex 的仓库协作工作流，前台或后台执行都可以。
+- `OpenClaw` 版更贴合 OpenClaw 的使用方式，更强调后台执行，并把 OpenBrowser 定位为更适合渲染页面和多步浏览器任务的方案。
+
+安装与你本地 agent 环境对应的那一套即可。
 
 ## 为什么当前主要使用 Qwen3.5 系列？
 
diff --git a/eval/README.md b/eval/README.md
@@ -137,25 +137,26 @@ Automated evaluation now requires a browser UUID capability token copied from th
 Quick start:
 
 ```bash
-python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID
+python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias default
 ```
 
 Recommended options:
 
 ```bash
 export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
-python eval/evaluate_browser_agent.py --test techforum
-python eval/evaluate_browser_agent.py --model dashscope/qwen3.5-plus --test techforum
-python eval/evaluate_browser_agent.py --model dashscope/qwen3.5-plus --model dashscope/qwen3.5-flash
+python eval/evaluate_browser_agent.py --test techforum --model-alias default
+python eval/evaluate_browser_agent.py --test techforum --model-alias plus
+python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
 python eval/evaluate_browser_agent.py --list
 python eval/evaluate_browser_agent.py --manual --test techforum
 ```
 
 Notes:
 
 1. `--chrome-uuid` is required for automated runs that call the OpenBrowser browser-control APIs.
-2. `--manual` and `--list` do not require a browser UUID.
-3. `OPENBROWSER_CHROME_UUID` is the equivalent environment variable for scripting and CI-style usage.
+2. Automated evaluation also requires at least one `--model-alias`, which must match a configured LLM alias in the OpenBrowser web UI.
+3. `--manual` and `--list` do not require a browser UUID.
+4. `OPENBROWSER_CHROME_UUID` is the equivalent environment variable for scripting and CI-style usage.
 
 ## Evaluating AI Agent Behavior