Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,12 +97,34 @@ export PINCHBENCH_OFFICIAL_KEY=your_official_key
| `--suite SUITE` | `all`, `automated-only`, or comma-separated task IDs |
| `--runs N` | Number of runs per task for averaging |
| `--timeout-multiplier N` | Scale timeouts for slower models |
| `--thinking LEVELS` | Comma-separated thinking levels (e.g., `low,medium,high`) |
| `--output-dir DIR` | Where to save results (default: `results/`) |
| `--no-upload` | Skip uploading to leaderboard |
| `--register` | Request an API token for submissions |
| `--upload FILE` | Upload a previous results JSON |
| `--official-key KEY` | Mark submission as official (or use `PINCHBENCH_OFFICIAL_KEY` env var) |

## Thinking Levels

Many models support configurable thinking/reasoning levels. Test how different reasoning depths affect task performance:

```bash
# Test multiple thinking levels
./scripts/run.sh --model anthropic/claude-sonnet-4 --thinking low,medium,high

# Compare a single level against default
./scripts/run.sh --model anthropic/claude-sonnet-4 --thinking high
```

Valid levels: `off`, `minimal`, `low`, `medium`, `high`, `xhigh`, `adaptive`

**Model-specific notes:**
- `xhigh` is only available for GPT-5.x models (gpt-5.4, gpt-5.2, codex variants)
- `adaptive` is provider-managed reasoning budget (Anthropic Claude 4.6 family)
- Invalid levels for your model are warned and skipped

Results include a `thinking_aggregates` section with per-level statistics, and each task result includes the `thinking_level` used.

## Contributing Tasks

We welcome new tasks! Check out [`tasks/TASK_TEMPLATE.md`](tasks/TASK_TEMPLATE.md) for the format. Good tasks are:
Expand Down
24 changes: 24 additions & 0 deletions SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload
| `--output-dir` | Results directory (default: `results/`) |
| `--timeout-multiplier` | Scale task timeouts for slower models |
| `--runs` | Number of runs per task for averaging |
| `--thinking` | Comma-separated thinking levels (e.g., `low,medium,high`) |
| `--no-upload` | Skip uploading to leaderboard |
| `--register` | Request new API token for submissions |
| `--upload FILE` | Upload previous results JSON |
Expand All @@ -89,6 +90,29 @@ uv run benchmark.py --register
uv run benchmark.py --model anthropic/claude-sonnet-4
```

## Thinking Levels

Many models support different thinking/reasoning levels (e.g., Claude's extended thinking). PinchBench can run tasks across multiple thinking levels to measure how reasoning depth affects performance:

```bash
# Run with multiple thinking levels
uv run benchmark.py --model anthropic/claude-sonnet-4 --thinking low,medium,high

# Run with a single thinking level
uv run benchmark.py --model anthropic/claude-sonnet-4 --thinking high
```

Valid thinking levels: `off`, `minimal`, `low`, `medium`, `high`, `xhigh`, `adaptive`

**Model-specific notes:**
- `xhigh` is only supported by GPT-5.x models (gpt-5.4, gpt-5.2, codex variants)
- `adaptive` is provider-managed reasoning (Anthropic Claude 4.6 family)
- Invalid levels for your model are warned and skipped

Results include per-level aggregates:
- `thinking_aggregates`: Summary statistics for each thinking level
- Per-task results include `thinking_level` field

## Results

Results are saved as JSON in the output directory:
Expand Down
Loading