v1.1.0 - Reliability & Multi-Session Support

Latest

Latest

ScuttleBot released this 19 Mar 14:39

e8e833b

What's New

Major Features

Multi-session support - Benchmark tasks now run across multiple sessions for better isolation and reliability
Fail-fast sanity checks - Invalid OpenRouter model names and config issues are caught immediately instead of failing mid-run
Score summary logging - Final results include submission ID and summary stats for easier tracking

Bug Fixes

Task 10 grader compatibility - Now supports both read and read_file tool names (OpenClaw/Claude Code compatibility)
Agent ID normalization - Fixed issues with special characters in agent IDs causing path problems
Model slug normalization - Model names are now lowercased for consistent agent/session paths
Bootstrap file handling - Removed conflicting workspace files that were causing NO_REPLY issues
Skills copying - Fixed skills not being properly copied to benchmark workspace

Code Quality

Added Ruff linting and compilation checks to CI
Syntax error fixes in sanity check failures
Review feedback optimizations

Full Changelog: v1.0.0...v1.1.0

Assets 2