Skip to content

Releases: pinchbench/skill

v1.1.0 - Reliability & Multi-Session Support

19 Mar 14:39
e8e833b

Choose a tag to compare

What's New

Major Features

  • Multi-session support - Benchmark tasks now run across multiple sessions for better isolation and reliability
  • Fail-fast sanity checks - Invalid OpenRouter model names and config issues are caught immediately instead of failing mid-run
  • Score summary logging - Final results include submission ID and summary stats for easier tracking

Bug Fixes

  • Task 10 grader compatibility - Now supports both read and read_file tool names (OpenClaw/Claude Code compatibility)
  • Agent ID normalization - Fixed issues with special characters in agent IDs causing path problems
  • Model slug normalization - Model names are now lowercased for consistent agent/session paths
  • Bootstrap file handling - Removed conflicting workspace files that were causing NO_REPLY issues
  • Skills copying - Fixed skills not being properly copied to benchmark workspace

Code Quality

  • Added Ruff linting and compilation checks to CI
  • Syntax error fixes in sanity check failures
  • Review feedback optimizations

Full Changelog: v1.0.0...v1.1.0

PinchBench 1.0.0

17 Mar 16:31
262c418

Choose a tag to compare

PinchBench 1.0.0 Release Notes

Overview

PinchBench 1.0.0 marks our first stable release—a fully automated, open-source LLM benchmarking platform that measures how well AI coding agents handle real-world development tasks. This release brings together four months of development across our skill framework, API backend, leaderboard frontend, and orchestration infrastructure.


What's New

🏆 Official Benchmark Submissions

Benchmark runs can now be tagged as "official" using an authenticated API key. Official submissions display a verified badge on the leaderboard and are prioritized in rankings. This enables trusted, reproducible benchmark results from verified infrastructure.

👤 GitHub OAuth Integration

Users can now claim their benchmark submissions using GitHub OAuth. The claim flow automatically links submissions to your GitHub profile and redirects to a personalized success page with your public profile link.

🖥️ Hardware Metadata Display

Submission detail pages now show the underlying hardware information for each benchmark run, including CPU, memory, and instance specifications—essential context for interpreting performance differences.

🎲 Randomized Model Assignment

The orchestration layer now randomly distributes models across Vultr instances, preventing bias from instance-specific performance variations and ensuring fairer benchmark comparisons.

🔧 Reaper: Automated Cleanup

New automated cleanup script (reaper.sh) that identifies and terminates stale Vultr instances left behind from interrupted benchmark runs, reducing infrastructure costs and preventing resource leaks.


Improvements

API & Backend

  • Model metadata endpoint (/api/models) now provides richer model information with provider fallbacks
  • Free suffix normalization—model names are normalized to strip :free suffixes for consistent identification
  • Clickable submission IDs in the admin panel for faster navigation
  • Admin users tab for managing user claims and submissions
  • Zero-score cleanup—admin ability to delete failed/zeroed submissions directly from the panel

Leaderboard Frontend

  • Default score display changed from "best" to "average" for more representative rankings
  • SEO improvements including sitemap generation, meta tags, robots.txt, and UTM parameter tracking
  • Wider model name column in bar charts for better readability
  • NVIDIA provider color added to the visual theme

Benchmark Skill

  • Category-level score summaries displayed at the end of each benchmark run
  • Verbose logging mode (--verbose) for detailed debugging output
  • Immediate task score logging after each grading for better progress visibility
  • Expanded model list including all leaderboard models plus new additions (Amazon Nova, NVIDIA Nemotron, Z-AI GLM-5, etc.)

Orchestration Scripts

  • Increased default workers from 10 to 25 for faster parallel benchmark execution
  • OpenRouter prefix handling—all default models now include openrouter/ prefix for proper API routing
  • Default models file (default-models.yml) for easier model list management
  • Slack score summaries posted after benchmark runs complete

Bug Fixes

  • Qwen model typo—fixed incorrect model name qwen3.5-122b-a19bqwen3.5-122b-a10b
  • Judge response parsing—normalized judge responses to handle varied JSON formats and whitespace
  • Artifact name sanitization—colons in artifact names are now properly sanitized for filesystem compatibility
  • OpenClaw config format—fixed workspace override configuration to prevent NO_REPLY responses
  • Date reference bug—removed "2024" references from tasks that confused time-aware agents
  • Mobile tooltips—disabled hoverable tooltip content on mobile for proper touch support
  • Race condition in delete-zeros—fixed using meta.changes instead of separate COUNT query

Breaking Changes

None—this is our first stable release. Future versions will follow semantic versioning for any breaking changes.


Contributors

Thank you to everyone who contributed to this release:

  • ScuttleBot — Official key support, GitHub OAuth, UI improvements, model metadata, reaper script
  • pkuYmiracle — Default models updates
  • lilei-xiaomi — Model additions and updates
  • arpitg1991 — Pull request reviews and fixes
  • DJRHails — Date reference fixes
  • justiniggy — Token efficiency metrics
  • iJaack — Judge JSON parsing improvements
  • aeromomo — Value score CPST implementation
  • olearycrew — Agents.md and documentation

Special thanks to the OpenClaw community for testing, feedback, and issue reports that shaped this release.


What's Next

See open issues in the pinchbench/skill, pinchbench/api, and pinchbench/leaderboard repositories for upcoming features including redesigned filtering, new benchmark tasks, and Cloudflare deployment options.


Happy benchmarking! 🦀