Commit 8bfa8ee
authored
feat(executor): add SWE-bench batch evaluation with hotkey auth and WebSocket streaming (#2)
* feat(executor): SWE-bench batch evaluation with hotkey auth and WebSocket streaming
Redesign the term-executor from a single-evaluation model to a batch-oriented
SWE-bench evaluation system with real-time streaming and hardcoded hotkey
authentication.
Core architecture changes:
- Replace bearer token auth with SS58 hotkey validation via X-Hotkey header,
restricted to a single authorized hotkey (5GziQCcRpN...Dag2At)
- Replace single-eval model with batch processing: upload a multipart archive
containing tasks/ and agent_code/ directories, execute all tasks with
configurable concurrency (--concurrent-tasks, default 8, via semaphore)
- Binary reward system: 1.0 if all tests pass, 0.0 otherwise; aggregate
reward is the mean across all tasks in a batch
- Agent code is never exposed in any API response
New modules and API surface:
- src/ws.rs: WebSocket handler at /ws?batch_id=<id> providing real-time
events (snapshot on connect, then task_started, task_complete,
batch_complete) via broadcast channels
- POST /submit: multipart archive upload replacing POST /evaluate
- GET /batch/{id}, /batch/{id}/tasks, /batch/{id}/task/{task_id}: batch and
task status polling endpoints replacing GET /evaluate/{id}
- GET /batches: list all batches
Session and executor refactoring:
- src/session.rs: Replace Session/EvalResult/EvalStatus with Batch/BatchResult/
BatchStatus/TaskResult/TaskStatus types; add WsEvent broadcast channel per
batch; add has_active_batch() to enforce single-batch-at-a-time constraint
- src/executor.rs: spawn_batch() runs tasks concurrently via tokio semaphore,
each task goes through clone→install→agent→test pipeline independently;
emits WsEvent on task start/complete and batch complete
- src/task.rs: Add extract_uploaded_archive() to parse zip/tar.gz archives
with tasks/ and agent_code/ structure; add SweForgeTask and ExtractedArchive
types
Supporting changes:
- src/auth.rs: Simplified to verify_hotkey() + extract_hotkey() + SS58
validation via bs58 crate
- src/config.rs: Remove auth_token, add AUTHORIZED_HOTKEY constant, rename
max_concurrent_evals→max_concurrent_tasks, increase defaults (TTL 7200s,
clone timeout 180s, archive limit 500MB)
- src/metrics.rs: Rename eval counters to batch/task counters
(batches_total, batches_active, tasks_passed, tasks_failed, etc.)
- src/main.rs: Remove semaphore creation, wire up new AppState
- Cargo.toml: Bump to v0.2.0, add axum ws/multipart features, add bs58,
futures, tokio-stream dependencies
- Dockerfile: Add python3/pip/venv/build-essential for SWE-bench task
execution, change CMD to ENTRYPOINT
- README.md: Complete rewrite documenting new batch API, archive format,
WebSocket protocol, reward model, and configuration
* ci: trigger CI run
* style: fix CI — apply rustfmt formatting
* fix: resolve clippy warnings — identical if/else, collapsible if, useless into()
* style: fix remaining rustfmt formatting issues
* ci: trigger CI after merge with main1 parent 0c403e8 commit 8bfa8ee
File tree
13 files changed
+1411
-729
lines changed- src
13 files changed
+1411
-729
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
0 commit comments