Skip to content

v1.1.0 - Reliability & Multi-Session Support

Latest

Choose a tag to compare

@ScuttleBot ScuttleBot released this 19 Mar 14:39
e8e833b

What's New

Major Features

  • Multi-session support - Benchmark tasks now run across multiple sessions for better isolation and reliability
  • Fail-fast sanity checks - Invalid OpenRouter model names and config issues are caught immediately instead of failing mid-run
  • Score summary logging - Final results include submission ID and summary stats for easier tracking

Bug Fixes

  • Task 10 grader compatibility - Now supports both read and read_file tool names (OpenClaw/Claude Code compatibility)
  • Agent ID normalization - Fixed issues with special characters in agent IDs causing path problems
  • Model slug normalization - Model names are now lowercased for consistent agent/session paths
  • Bootstrap file handling - Removed conflicting workspace files that were causing NO_REPLY issues
  • Skills copying - Fixed skills not being properly copied to benchmark workspace

Code Quality

  • Added Ruff linting and compilation checks to CI
  • Syntax error fixes in sanity check failures
  • Review feedback optimizations

Full Changelog: v1.0.0...v1.1.0