Skip to content

Latest commit

 

History

History
123 lines (94 loc) · 4.17 KB

File metadata and controls

123 lines (94 loc) · 4.17 KB
name pinchbench
description Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.
metadata
author version homepage repository
pinchbench
1.0.0

PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

  • Python 3.10+
  • uv package manager
  • OpenClaw instance (this agent)

Quick Start

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

Task Category Description
task_00_sanity Basic Verify agent works
task_01_calendar Productivity Calendar event creation
task_02_stock Research Stock price lookup
task_03_blog Writing Blog post creation
task_04_weather Coding Weather script
task_05_summary Analysis Document summarization
task_06_events Research Conference research
task_07_email Writing Email drafting
task_08_memory Memory Context retrieval
task_09_files Files File structure creation
task_10_workflow Integration Multi-step API workflow
task_11_clawdhub Skills ClawHub interaction
task_12_skill_search Skills Skill discovery
task_13_image_gen Creative Image generation
task_14_humanizer Writing Text humanization
task_15_daily_summary Productivity Daily digest
task_16_email_triage Email Inbox triage
task_17_email_search Email Email search
task_18_market_research Research Market analysis
task_19_spreadsheet_summary Analysis Spreadsheet analysis
task_20_eli5_pdf_summary Analysis PDF simplification
task_21_openclaw_comprehension Knowledge OpenClaw docs comprehension
task_22_second_brain Memory Knowledge management

Command Line Options

Option Description
--model Model identifier (e.g., anthropic/claude-sonnet-4)
--suite all, automated-only, or comma-separated task IDs
--output-dir Results directory (default: results/)
--timeout-multiplier Scale task timeouts for slower models
--runs Number of runs per task for averaging
--no-upload Skip uploading to leaderboard
--register Request new API token for submissions
--upload FILE Upload previous results JSON

Token Registration

To submit results to the leaderboard:

# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:

  • YAML frontmatter (id, name, category, grading_type, timeout)
  • Prompt section
  • Expected behavior
  • Grading criteria
  • Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

  • Model rankings by overall score
  • Per-task breakdowns
  • Historical performance trends