Claw-Eval

End-to-end transparent benchmark for AI agents acts in the real world.
139 tasks, 15 services, Docker sandboxes, and robust grading.

Leaderboard

Browse the full leaderboard and individual task cases at claw-eval.github.io.

Evaluation Logic (Updated March 2026):

Primary Metric: Pass^3. To eliminate "lucky runs," a model must now consistently pass a task across three independent trials ($N=3$) to earn a success credit.
Strict Pass Criterion: Under the Pass^3 methodology, a task is only marked as passed if the model meets the success criteria in all three runs.
Reproducibility: We are committed to end-to-end reproducibility. Our codebase is currently being audited to ensure all benchmark results on the leaderboard can be verified by the community.
Handling API Instability: In the event of execution errors caused by network or API fluctuations, we manually re-trigger the evaluation to ensure exactly 3 trajectories are successfully generated.

📢 Updates

v1.1.0 is now live! 35 more challenging multimodal agentic tasks — agents perceive, reason, create, and deliver.

v1.0.0 is now live! Built on reproducible real-world complexity.
v0.0.0 released: from chatbot to real world. (2026.3)

Quick Start

We recommend using uv for fast, reliable dependency management:

pip install uv
uv venv --python 3.11
source .venv/bin/activate

Prepare your keys and set up the environments with one command:

export OPENROUTER_API_KEY=sk-or-...
export SERP_DEV_KEY=... # add this for tasks need real web search
bash scripts/test_sandbox.sh

Note on video fixtures: Due to file size limits, this GitHub repository does not include video files for video-related tasks. The complete fixtures (including all videos) are available on Hugging Face: claw-eval/Claw-Eval.

Go rock 🚀

claw-eval batch --config model_configs/claude_opus_46.yaml --sandbox --trials 3 --parallel 16

Roadmap

More real-world, multimodal tasks in complex productivity environments
Comprehensive, fine-grained scoring logic with deep state verification
Enhanced sandbox isolation and full-trace tracking for transparent, scalable evaluation

Contribution

We welcome any kind of contribution. Let us know if you have any suggestions!

Acknowledgements

Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0.

Contributors

Bowen Ye*(PKU), Rang Li* (PKU), Qibin Yang* (PKU), Zhihui Xie(HKU), Yuanxin Liu(PKU), Linli Yao(PKU), Hanglong Lyu(PKU), Lei Li$^\dagger$(HKU, Project Lead)

Advisors: Tong Yang (PKU), Zhifang Sui (PKU), Lingpeng Kong (HKU), Qi Liu (HKU)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
mock_services		mock_services
scripts		scripts
src/claw_eval		src/claw_eval
tasks		tasks
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile.agent		Dockerfile.agent
README.md		README.md
claw_eval.png		claw_eval.png
claweval_multimodal.png		claweval_multimodal.png
config.yaml		config.yaml
pyproject.toml		pyproject.toml
requirements-sandbox-server.txt		requirements-sandbox-server.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Claw-Eval

Leaderboard

📢 Updates

Quick Start

Roadmap

Contribution

Acknowledgements

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Claw-Eval

Leaderboard

📢 Updates

Quick Start

Roadmap

Contribution

Acknowledgements

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages