Skip to content

Commit 8bfa8ee

Browse files
authored
feat(executor): add SWE-bench batch evaluation with hotkey auth and WebSocket streaming (#2)
* feat(executor): SWE-bench batch evaluation with hotkey auth and WebSocket streaming Redesign the term-executor from a single-evaluation model to a batch-oriented SWE-bench evaluation system with real-time streaming and hardcoded hotkey authentication. Core architecture changes: - Replace bearer token auth with SS58 hotkey validation via X-Hotkey header, restricted to a single authorized hotkey (5GziQCcRpN...Dag2At) - Replace single-eval model with batch processing: upload a multipart archive containing tasks/ and agent_code/ directories, execute all tasks with configurable concurrency (--concurrent-tasks, default 8, via semaphore) - Binary reward system: 1.0 if all tests pass, 0.0 otherwise; aggregate reward is the mean across all tasks in a batch - Agent code is never exposed in any API response New modules and API surface: - src/ws.rs: WebSocket handler at /ws?batch_id=<id> providing real-time events (snapshot on connect, then task_started, task_complete, batch_complete) via broadcast channels - POST /submit: multipart archive upload replacing POST /evaluate - GET /batch/{id}, /batch/{id}/tasks, /batch/{id}/task/{task_id}: batch and task status polling endpoints replacing GET /evaluate/{id} - GET /batches: list all batches Session and executor refactoring: - src/session.rs: Replace Session/EvalResult/EvalStatus with Batch/BatchResult/ BatchStatus/TaskResult/TaskStatus types; add WsEvent broadcast channel per batch; add has_active_batch() to enforce single-batch-at-a-time constraint - src/executor.rs: spawn_batch() runs tasks concurrently via tokio semaphore, each task goes through clone→install→agent→test pipeline independently; emits WsEvent on task start/complete and batch complete - src/task.rs: Add extract_uploaded_archive() to parse zip/tar.gz archives with tasks/ and agent_code/ structure; add SweForgeTask and ExtractedArchive types Supporting changes: - src/auth.rs: Simplified to verify_hotkey() + extract_hotkey() + SS58 validation via bs58 crate - src/config.rs: Remove auth_token, add AUTHORIZED_HOTKEY constant, rename max_concurrent_evals→max_concurrent_tasks, increase defaults (TTL 7200s, clone timeout 180s, archive limit 500MB) - src/metrics.rs: Rename eval counters to batch/task counters (batches_total, batches_active, tasks_passed, tasks_failed, etc.) - src/main.rs: Remove semaphore creation, wire up new AppState - Cargo.toml: Bump to v0.2.0, add axum ws/multipart features, add bs58, futures, tokio-stream dependencies - Dockerfile: Add python3/pip/venv/build-essential for SWE-bench task execution, change CMD to ENTRYPOINT - README.md: Complete rewrite documenting new batch API, archive format, WebSocket protocol, reward model, and configuration * ci: trigger CI run * style: fix CI — apply rustfmt formatting * fix: resolve clippy warnings — identical if/else, collapsible if, useless into() * style: fix remaining rustfmt formatting issues * ci: trigger CI after merge with main
1 parent 0c403e8 commit 8bfa8ee

File tree

13 files changed

+1411
-729
lines changed

13 files changed

+1411
-729
lines changed

Cargo.lock

Lines changed: 207 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)