Benchmarking and reproducibility suite for EcoClaw.
This repository is the official place to:
- Install and run EcoClaw end-to-end
- Reproduce PinchBench baseline vs EcoClaw runs
- Store raw benchmark outputs and comparison reports
- Extend evaluation to additional datasets over time
- Runtime under test: EcoClaw
- Primary benchmark (phase 1): PinchBench-compatible
skilltasks (recommended fork: Xubqpanda/skill) - Evaluation goal: improve token efficiency while maintaining or improving task quality
EcoClaw-Bench/
├── docs/
├── experiments/
│ ├── configs/
│ │ └── pinchbench/
│ └── scripts/
├── results/
│ ├── raw/
│ └── reports/
└── assets/
- Copy
.env.exampleto.env - Fill your API key and base URL
cp .env.example .envDetailed variable reference: docs/env.md
Large dataset assets are not stored in git. After cloning, download the archives from Google Drive and extract them to:
experiments/dataset/claw_eval/assets/experiments/dataset/pinchbench/assets/
Recommended release structure on Drive:
claw_eval_assets_YYYYMMDD.zippinchbench_assets_YYYYMMDD.zip
Maintain links in this section:
- Claw Eval assets:
https://drive.google.com/drive/folders/1JXKLgfQ4Q3qSXEeOP5a3XjjmYS9t9pyc?usp=sharing - PinchBench assets:
https://drive.google.com/drive/folders/1JXKLgfQ4Q3qSXEeOP5a3XjjmYS9t9pyc?usp=sharing
After extraction, verify:
ls experiments/dataset/claw_eval/assets
ls experiments/dataset/pinchbench/assetsThis repo uses a patched benchmark flow compared with upstream PinchBench scripts:
- Use the local/forked
skillrepo (setECOCLAW_SKILL_DIRif needed). - Baseline scripts support isolated parallel execution via
--parallel/ECOCLAW_PARALLEL. - Model aliases in experiment scripts are mapped to
dica/*provider ids by default.
If your OpenClaw default model is not dica/*, prefer explicit full model ids in .env:
ECOCLAW_MODEL=dica/gpt-5-miniECOCLAW_JUDGE=dica/gpt-5-nano
This avoids silent fallback to other providers/models in mixed-provider OpenClaw configs.
- Read docs/install.md
- Fill
.env.examplefields in your local.env - Run baseline
- Run EcoClaw-enabled
- Compare outputs
Linux (bash):
- experiments/scripts/run_pinchbench_baseline.sh
- experiments/scripts/run_pinchbench_ecoclaw.sh
- experiments/scripts/compare_pinchbench_results.sh
- src/cost/calculate_llm_cost.py
Example:
./experiments/scripts/run_pinchbench_baseline.sh --suite all --parallel 4Linux quick guide: docs/linux.md