pinchbench · ForceConstant · Mar 7, 2026 · Mar 7, 2026 · Mar 7, 2026 · Mar 25, 2026
diff --git a/README.md b/README.md
@@ -97,12 +97,34 @@ export PINCHBENCH_OFFICIAL_KEY=your_official_key
 | `--suite SUITE`          | `all`, `automated-only`, or comma-separated task IDs                          |
 | `--runs N`               | Number of runs per task for averaging                                         |
 | `--timeout-multiplier N` | Scale timeouts for slower models                                              |
+| `--thinking LEVELS`      | Comma-separated thinking levels (e.g., `low,medium,high`)                     |
 | `--output-dir DIR`       | Where to save results (default: `results/`)                                   |
 | `--no-upload`            | Skip uploading to leaderboard                                                 |
 | `--register`             | Request an API token for submissions                                          |
 | `--upload FILE`          | Upload a previous results JSON                                                |
 | `--official-key KEY`     | Mark submission as official (or use `PINCHBENCH_OFFICIAL_KEY` env var)         |
 
+## Thinking Levels
+
+Many models support configurable thinking/reasoning levels. Test how different reasoning depths affect task performance:
+
+```bash
+# Test multiple thinking levels
+./scripts/run.sh --model anthropic/claude-sonnet-4 --thinking low,medium,high
+
+# Compare a single level against default
+./scripts/run.sh --model anthropic/claude-sonnet-4 --thinking high
+```
+
+Valid levels: `off`, `minimal`, `low`, `medium`, `high`, `xhigh`, `adaptive`
+
+**Model-specific notes:**
+- `xhigh` is only available for GPT-5.x models (gpt-5.4, gpt-5.2, codex variants)
+- `adaptive` is provider-managed reasoning budget (Anthropic Claude 4.6 family)
+- Invalid levels for your model are warned and skipped
+
+Results include a `thinking_aggregates` section with per-level statistics, and each task result includes the `thinking_level` used.
+
 ## Contributing Tasks
 
 We welcome new tasks! Check out [`tasks/TASK_TEMPLATE.md`](tasks/TASK_TEMPLATE.md) for the format. Good tasks are:

diff --git a/SKILL.md b/SKILL.md
@@ -73,6 +73,7 @@ uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload
 | `--output-dir` | Results directory (default: `results/`) |
 | `--timeout-multiplier` | Scale task timeouts for slower models |
 | `--runs` | Number of runs per task for averaging |
+| `--thinking` | Comma-separated thinking levels (e.g., `low,medium,high`) |
 | `--no-upload` | Skip uploading to leaderboard |
 | `--register` | Request new API token for submissions |
 | `--upload FILE` | Upload previous results JSON |
@@ -89,6 +90,29 @@ uv run benchmark.py --register
 uv run benchmark.py --model anthropic/claude-sonnet-4
 ```
 
+## Thinking Levels
+
+Many models support different thinking/reasoning levels (e.g., Claude's extended thinking). PinchBench can run tasks across multiple thinking levels to measure how reasoning depth affects performance:
+
+```bash
+# Run with multiple thinking levels
+uv run benchmark.py --model anthropic/claude-sonnet-4 --thinking low,medium,high
+
+# Run with a single thinking level
+uv run benchmark.py --model anthropic/claude-sonnet-4 --thinking high
+```
+
+Valid thinking levels: `off`, `minimal`, `low`, `medium`, `high`, `xhigh`, `adaptive`
+
+**Model-specific notes:**
+- `xhigh` is only supported by GPT-5.x models (gpt-5.4, gpt-5.2, codex variants)
+- `adaptive` is provider-managed reasoning (Anthropic Claude 4.6 family)
+- Invalid levels for your model are warned and skipped
+
+Results include per-level aggregates:
+- `thinking_aggregates`: Summary statistics for each thinking level
+- Per-task results include `thinking_level` field
+
 ## Results
 
 Results are saved as JSON in the output directory: