What's New

Major Features

Multi-session support - Benchmark tasks now run across multiple sessions for better isolation and reliability
Fail-fast sanity checks - Invalid OpenRouter model names and config issues are caught immediately instead of failing mid-run
Score summary logging - Final results include submission ID and summary stats for easier tracking

Bug Fixes

Task 10 grader compatibility - Now supports both read and read_file tool names (OpenClaw/Claude Code compatibility)
Agent ID normalization - Fixed issues with special characters in agent IDs causing path problems
Model slug normalization - Model names are now lowercased for consistent agent/session paths
Bootstrap file handling - Removed conflicting workspace files that were causing NO_REPLY issues
Skills copying - Fixed skills not being properly copied to benchmark workspace

Code Quality

Added Ruff linting and compilation checks to CI
Syntax error fixes in sanity check failures
Review feedback optimizations

Full Changelog: v1.0.0...v1.1.0

PinchBench 1.0.0 Release Notes

Overview

PinchBench 1.0.0 marks our first stable release—a fully automated, open-source LLM benchmarking platform that measures how well AI coding agents handle real-world development tasks. This release brings together four months of development across our skill framework, API backend, leaderboard frontend, and orchestration infrastructure.

What's New

🏆 Official Benchmark Submissions

Benchmark runs can now be tagged as "official" using an authenticated API key. Official submissions display a verified badge on the leaderboard and are prioritized in rankings. This enables trusted, reproducible benchmark results from verified infrastructure.

👤 GitHub OAuth Integration

Users can now claim their benchmark submissions using GitHub OAuth. The claim flow automatically links submissions to your GitHub profile and redirects to a personalized success page with your public profile link.

🖥️ Hardware Metadata Display

Submission detail pages now show the underlying hardware information for each benchmark run, including CPU, memory, and instance specifications—essential context for interpreting performance differences.

🎲 Randomized Model Assignment

The orchestration layer now randomly distributes models across Vultr instances, preventing bias from instance-specific performance variations and ensuring fairer benchmark comparisons.

🔧 Reaper: Automated Cleanup

New automated cleanup script (reaper.sh) that identifies and terminates stale Vultr instances left behind from interrupted benchmark runs, reducing infrastructure costs and preventing resource leaks.

Improvements

API & Backend

Model metadata endpoint (/api/models) now provides richer model information with provider fallbacks
Free suffix normalization—model names are normalized to strip :free suffixes for consistent identification
Clickable submission IDs in the admin panel for faster navigation
Admin users tab for managing user claims and submissions
Zero-score cleanup—admin ability to delete failed/zeroed submissions directly from the panel

Leaderboard Frontend

Default score display changed from "best" to "average" for more representative rankings
SEO improvements including sitemap generation, meta tags, robots.txt, and UTM parameter tracking
Wider model name column in bar charts for better readability
NVIDIA provider color added to the visual theme

Benchmark Skill

Category-level score summaries displayed at the end of each benchmark run
Verbose logging mode (--verbose) for detailed debugging output
Immediate task score logging after each grading for better progress visibility
Expanded model list including all leaderboard models plus new additions (Amazon Nova, NVIDIA Nemotron, Z-AI GLM-5, etc.)

Orchestration Scripts

Increased default workers from 10 to 25 for faster parallel benchmark execution
OpenRouter prefix handling—all default models now include openrouter/ prefix for proper API routing
Default models file (default-models.yml) for easier model list management
Slack score summaries posted after benchmark runs complete

Bug Fixes

Qwen model typo—fixed incorrect model name qwen3.5-122b-a19b → qwen3.5-122b-a10b
Judge response parsing—normalized judge responses to handle varied JSON formats and whitespace
Artifact name sanitization—colons in artifact names are now properly sanitized for filesystem compatibility
OpenClaw config format—fixed workspace override configuration to prevent NO_REPLY responses
Date reference bug—removed "2024" references from tasks that confused time-aware agents
Mobile tooltips—disabled hoverable tooltip content on mobile for proper touch support
Race condition in delete-zeros—fixed using meta.changes instead of separate COUNT query

Breaking Changes

None—this is our first stable release. Future versions will follow semantic versioning for any breaking changes.

Contributors

Thank you to everyone who contributed to this release:

ScuttleBot — Official key support, GitHub OAuth, UI improvements, model metadata, reaper script
pkuYmiracle — Default models updates
lilei-xiaomi — Model additions and updates
arpitg1991 — Pull request reviews and fixes
DJRHails — Date reference fixes
justiniggy — Token efficiency metrics
iJaack — Judge JSON parsing improvements
aeromomo — Value score CPST implementation
olearycrew — Agents.md and documentation

Special thanks to the OpenClaw community for testing, feedback, and issue reports that shaped this release.

What's Next

See open issues in the pinchbench/skill, pinchbench/api, and pinchbench/leaderboard repositories for upcoming features including redesigned filtering, new benchmark tasks, and Cloudflare deployment options.

Happy benchmarking! 🦀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New

Major Features

Bug Fixes

Code Quality

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

PinchBench 1.0.0 Release Notes

Overview

What's New

🏆 Official Benchmark Submissions

👤 GitHub OAuth Integration

🖥️ Hardware Metadata Display

🎲 Randomized Model Assignment

🔧 Reaper: Automated Cleanup

Improvements

API & Backend

Leaderboard Frontend

Benchmark Skill

Orchestration Scripts

Bug Fixes

Breaking Changes

Contributors

What's Next

Uh oh!

Releases: pinchbench/skill

v1.1.0 - Reliability & Multi-Session Support

What's New

Major Features

Bug Fixes

Code Quality

Uh oh!

PinchBench 1.0.0

PinchBench 1.0.0 Release Notes

Overview

What's New

🏆 Official Benchmark Submissions

👤 GitHub OAuth Integration

🖥️ Hardware Metadata Display

🎲 Randomized Model Assignment

🔧 Reaper: Automated Cleanup

Improvements

API & Backend

Leaderboard Frontend

Benchmark Skill

Orchestration Scripts

Bug Fixes

Breaking Changes

Contributors

What's Next

Uh oh!