What's New
Major Features
- Multi-session support - Benchmark tasks now run across multiple sessions for better isolation and reliability
- Fail-fast sanity checks - Invalid OpenRouter model names and config issues are caught immediately instead of failing mid-run
- Score summary logging - Final results include submission ID and summary stats for easier tracking
Bug Fixes
- Task 10 grader compatibility - Now supports both
readandread_filetool names (OpenClaw/Claude Code compatibility) - Agent ID normalization - Fixed issues with special characters in agent IDs causing path problems
- Model slug normalization - Model names are now lowercased for consistent agent/session paths
- Bootstrap file handling - Removed conflicting workspace files that were causing NO_REPLY issues
- Skills copying - Fixed skills not being properly copied to benchmark workspace
Code Quality
- Added Ruff linting and compilation checks to CI
- Syntax error fixes in sanity check failures
- Review feedback optimizations
Full Changelog: v1.0.0...v1.1.0