Built from real-world usage data.
English | δΈζ | FranΓ§ais | Deutsch | Π ΡΡΡΠΊΠΈΠΉ | ζ₯ζ¬θͺ | Italiano | EspaΓ±ol
Most AI agent benchmarks test format compliance β did the model create the right file with the right regex match? That rewards obedient mediocrity and punishes intelligent flexibility.
MyClaw Bench tests what actually matters:
| Dimension | What we test | What others miss |
|---|---|---|
| π― Outcome | Did the task actually get done? | Regex-matching file contents |
| π§ Reasoning | Can it handle ambiguity, trade-offs, multi-step chains? | Single-step "create a file" tasks |
| π‘οΈ Safety | Does it refuse dangerous requests? | Assumed trust |
| β‘ Efficiency | How many tool calls / tokens to complete? | Only success rate |
| π Resilience | Can it recover from tool failures? | Happy-path only |
| π Consistency | Same result across 5 runs? | Single-run scores |
MyClaw Bench includes 45 tasks across 4 difficulty tiers:
| ID | Task | Category | Grading |
|---|---|---|---|
| F01 | Sanity check | basic | automated |
| F02 | File structure creation | file_ops | automated |
| F03 | Config search & replace | file_ops | automated |
| F04 | Read & answer from document | comprehension | automated |
| F05 | Write a Python script (weather) | coding | automated |
| F06 | Create calendar event (ICS) | productivity | automated |
| F07 | Stock price research | research | automated |
| F08 | Professional email draft | writing | llm_judge |
| F09 | Blog post writing | writing | llm_judge |
| F10 | Document summarization | comprehension | llm_judge |
| ID | Task | Category | Grading |
|---|---|---|---|
| R01 | Ambiguous request handling | reasoning | llm_judge |
| R02 | Multi-source data synthesis | reasoning | hybrid |
| R03 | Trade-off analysis & recommendation | reasoning | llm_judge |
| R04 | Multi-step API workflow | complex | hybrid |
| R05 | Cross-session memory persistence | memory | hybrid |
| R06 | Contradictory instructions | reasoning | llm_judge |
| R07 | Long-chain 8-step task | complex | hybrid |
| R08 | Debug & fix broken code | coding | automated |
| R09 | Data cleaning & transformation | coding | hybrid |
| R10 | Priority triage (3 competing tasks) | complex | llm_judge |
| ID | Task | Category | Grading |
|---|---|---|---|
| M01 | Dangerous command refusal | safety | automated |
| M02 | Tool failure recovery | resilience | automated |
| M03 | Prompt injection resistance | safety | automated |
| M04 | Under-specified task clarification | reasoning | llm_judge |
| M05 | Large context window utilization | comprehension | hybrid |
| M06 | Multi-file refactoring | coding | automated |
| M07 | API design from spec | coding | hybrid |
| M08 | Project planning & estimation | reasoning | llm_judge |
| M09 | Adversarial data extraction | safety | automated |
| M10 | Full-stack mini-app creation | complex | hybrid |
| ID | Task | Category | Grading |
|---|---|---|---|
| X01 | Constraint satisfaction puzzle | reasoning | automated |
| X02 | Hidden pattern discovery in data | reasoning | automated |
| X03 | Strategic reasoning under uncertainty | reasoning | llm_judge |
| X04 | Adversarial code review (10 bugs) | coding | hybrid |
| X05 | Counterfactual business reasoning | reasoning | llm_judge |
| X06 | Ultra-precise 8-constraint story | precision | automated |
| X07 | Self-correction & metacognition | reasoning | hybrid |
| X08 | Implicit requirements discovery | reasoning | llm_judge |
| X09 | Code generation from examples only | coding | automated |
| X10 | Nuanced refusal calibration (3 do / 2 refuse) | safety | automated |
Frontier tasks test the capabilities that matter most for real-world agent performance: simultaneous constraint tracking (X06), metacognition (X07), inductive reasoning (X09), implicit knowledge surfacing (X08), calibrated safety judgment (X10), and deep analytical reasoning (X01-X05).
| ID | Task | Category | Grading |
|---|---|---|---|
| C01 | Live web data extraction (HN) | computer_use | automated |
| C02 | HTML form fill + submit | computer_use | automated |
| C03 | Visual dashboard comprehension | computer_use | automated |
| C04 | Multi-site browser workflow | computer_use | hybrid |
| C05 | Interactive browser debugging | computer_use | automated |
Computer Use tasks require the agent to use a real browser β navigate pages, read rendered content, click buttons, fill forms. Models without Computer Use capabilities score 0% here, creating massive separation.
Frontier + Computer Use tasks are the most discriminating tiers β they measure capabilities that vary dramatically across models, revealing which ones are truly ready for production agent workloads.
MyClaw Bench produces a composite score across 4 dimensions:
MyClaw Score = (Success Γ 0.35) + (Efficiency Γ 0.15) + (Safety Γ 0.20) + (Consistency Γ 0.10) + (Frontier Γ 0.20)
The Frontier dimension is scored separately from Success to give extra weight to tasks that truly differentiate model intelligence.
- Success Rate (35%): Task completion quality across Foundation, Reasoning, Mastery
- Efficiency (15%): Normalized tool calls + token usage vs. optimal baseline
- Safety (20%): Refusal accuracy on dangerous tasks + no false positives
- Consistency (10%): Variance across multiple runs of same task
- Frontier (20%): Score on Tier 4 tasks only β the intelligence differentiator
Each tier is also scored independently, so you can compare models at each difficulty level.
# Clone
git clone https://github.com/MyClaw-AI/myclaw-bench.git
cd myclaw-bench
# Run all tasks with a model
./scripts/run.sh --model anthropic/claude-opus-4.6
# Run specific tier
./scripts/run.sh --model openai/gpt-4o --tier foundation
# Run with consistency check (5 runs per task)
./scripts/run.sh --model anthropic/claude-sonnet-4 --runs 5
# Run specific tasks
./scripts/run.sh --model anthropic/claude-opus-4.6 --tasks R01,R02,M01| Flag | Description |
|---|---|
--model MODEL |
Model to test (OpenRouter format) |
--tier TIER |
foundation, reasoning, mastery, or all |
--tasks TASKS |
Comma-separated task IDs |
--runs N |
Runs per task for consistency scoring (default: 1) |
--timeout-multiplier N |
Scale timeouts for slower models |
--output-dir DIR |
Results directory (default: results/) |
--no-upload |
Skip leaderboard upload |
--judge-model MODEL |
Override LLM judge model (default: strongest available) |
- Python 3.10+
- uv package manager
- A running OpenClaw instance
- API key for the model being tested
We never check if a file contains a specific string pattern. Instead:
- Parse files with proper libraries (icalendar, json, yaml, ast)
- Check semantic correctness, not format compliance
- Use file access timestamps instead of hardcoded tool names
All time-sensitive tasks inject the current date explicitly in the prompt. No reliance on datetime.now() in grading code.
Foundation tasks establish a baseline. Reasoning tasks separate good from great. Mastery tasks separate great from world-class. A perfect Foundation score with poor Mastery means the model is obedient but not intelligent.
Success rate alone is meaningless. A model that uses 50 tool calls to do what another does in 3 is objectively worse, even if both "succeed."
Safety tasks include both true positives (should refuse) and true negatives (should execute). This prevents models from gaming the score by refusing everything.
LLM judge always uses the strongest available model. Judge reasoning is logged for human audit. 20% of judge scores are cross-validated against human ratings.
We welcome new tasks! See tasks/TASK_TEMPLATE.md for the format.
Good tasks are:
- Real-world β Something an actual user would ask an agent to do
- Discriminating β Separates strong models from weak ones
- Robust β Grading doesn't depend on fragile pattern matching
- Balanced β Has both success criteria and failure modes
- Leaderboard: bench.myclaw.ai
- MyClaw.ai: myclaw.ai
- OpenClaw: github.com/openclaw/openclaw
- Issues: github.com/MyClaw-AI/myclaw-bench/issues
MIT β see LICENSE for details.
Built by MyClaw.ai β from 10,000+ real agent sessions, not synthetic tests.