Skip to content

Commit ab0d470

Browse files
committed
docs: refresh AGENTS.md
1 parent 4cebf27 commit ab0d470

File tree

2 files changed

+94
-67
lines changed

2 files changed

+94
-67
lines changed

AGENTS.md

Lines changed: 49 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Project Purpose
44

5-
**term-executor** is a remote evaluation executor for the [term-challenge](https://github.com/PlatformNetwork/term-challenge) platform. It runs as a containerized Rust service on [Basilica](https://basilica.ai) that receives agent code submissions, executes them against a cloned task repository, runs validation test scripts, and reports pass/fail results. It is the core compute backend that evaluates AI agent coding challenges.
5+
**term-executor** is a remote evaluation executor for the [term-challenge](https://github.com/PlatformNetwork/term-challenge) platform. It runs as a containerized Rust service on [Basilica](https://basilica.ai) that receives batch task archives via multipart upload, executes agent code against cloned task repositories, runs validation test scripts, and reports pass/fail results with aggregate rewards. It is the core compute backend that evaluates AI agent coding challenges.
66

77
## Architecture Overview
88

@@ -11,73 +11,81 @@ This is a **single-crate Rust binary** (`term-executor`) built with Axum. There
1111
### Data Flow
1212

1313
```
14-
Platform Server → POST /evaluate → term-executor
15-
1. Download task archive (.tar.gz / .zip) from task_url
16-
2. Parse workspace.yaml, prompt.md, tests/
17-
3. git clone the target repository at base_commit
18-
4. Run install commands (pip install, etc.)
19-
5. Write & execute agent code in the repo
20-
6. Write test source files into the repo
21-
7. Run test scripts (bash), collect exit codes
22-
8. Return results via GET /evaluate/{id}
14+
Client → POST /submit (multipart archive) → term-executor
15+
1. Authenticate via X-Hotkey header (SS58 hotkey)
16+
2. Extract uploaded archive (zip/tar.gz) containing tasks/ and agent_code/
17+
3. Parse each task: workspace.yaml, prompt.md, tests/
18+
4. For each task (concurrently, up to limit):
19+
a. git clone the target repository at base_commit
20+
b. Run install commands (pip install, etc.)
21+
c. Write & execute agent code in the repo
22+
d. Write test source files into the repo
23+
e. Run test scripts (bash), collect exit codes
24+
5. Aggregate results (reward per task, aggregate reward)
25+
6. Stream progress via WebSocket (GET /ws?batch_id=...)
26+
7. Return results via GET /batch/{id}
2327
```
2428

2529
### Module Map
2630

2731
| File | Responsibility |
2832
|---|---|
2933
| `src/main.rs` | Entry point — bootstraps config, session manager, executor, Axum server, reaper tasks |
30-
| `src/config.rs` | `Config` struct loaded from environment variables with defaults |
31-
| `src/handlers.rs` | Axum route handlers: `/health`, `/status`, `/metrics`, `/evaluate`, `/evaluate/{id}`, `/evaluations` |
32-
| `src/auth.rs` | Bearer token authentication middleware and `check_token()` helper |
33-
| `src/executor.rs` | Core evaluation engine — spawns async tasks that clone repos, run agents, run tests |
34-
| `src/session.rs` | `SessionManager` with `DashMap`, `Session`, `EvalResult`, `EvalStatus`, `EvalStep` types |
35-
| `src/task.rs` | Task archive download/extraction (zip/tar.gz), `workspace.yaml` parsing, test file loading |
36-
| `src/metrics.rs` | Atomic counter-based Prometheus metrics (total, passed, failed, active, duration) |
34+
| `src/config.rs` | `Config` struct loaded from environment variables with defaults; `AUTHORIZED_HOTKEY` constant |
35+
| `src/handlers.rs` | Axum route handlers: `/health`, `/status`, `/metrics`, `/submit`, `/batch/{id}`, `/batch/{id}/tasks`, `/batch/{id}/task/{task_id}`, `/batches` |
36+
| `src/auth.rs` | Hotkey authentication: `extract_hotkey()`, `verify_hotkey()`, `validate_ss58()` |
37+
| `src/executor.rs` | Core evaluation engine — spawns batch tasks that clone repos, run agents, run tests concurrently |
38+
| `src/session.rs` | `SessionManager` with `DashMap`, `Batch`, `BatchResult`, `TaskResult`, `BatchStatus`, `TaskStatus`, `WsEvent` types |
39+
| `src/task.rs` | Archive extraction (zip/tar.gz), task directory parsing, agent code loading, language detection |
40+
| `src/metrics.rs` | Atomic counter-based Prometheus metrics (batches total/active/completed, tasks passed/failed, duration) |
3741
| `src/cleanup.rs` | Work directory removal, stale session reaping, process group killing |
42+
| `src/ws.rs` | WebSocket handler for real-time batch progress streaming |
3843

3944
### Key Shared State (via `Arc`)
4045

41-
- `AppState` (in `handlers.rs`) holds `Config`, `SessionManager`, `Metrics`, `Executor`, `Semaphore`
42-
- `SessionManager` uses `DashMap<String, Arc<Session>>` for lock-free concurrent access
43-
- `Semaphore` controls max concurrent evaluations (default: 4)
46+
- `AppState` (in `handlers.rs`) holds `Config`, `SessionManager`, `Metrics`, `Executor`, `started_at`
47+
- `SessionManager` uses `DashMap<String, Arc<Batch>>` for lock-free concurrent access
48+
- Per-batch `Semaphore` in `executor.rs` controls concurrent tasks within a batch (configurable, default: 8)
49+
- `broadcast::Sender<WsEvent>` per batch for WebSocket event streaming
4450

4551
## Tech Stack
4652

4753
- **Language**: Rust (edition 2021, nightly toolchain for fmt/clippy)
48-
- **Async Runtime**: Tokio (full features + process)
49-
- **Web Framework**: Axum 0.7 with Tower middleware
50-
- **HTTP Client**: reqwest 0.12 (for downloading task archives)
54+
- **Async Runtime**: Tokio (full features + process), `tokio-stream`, `futures`
55+
- **Web Framework**: Axum 0.7 (json, ws, multipart) with Tower middleware, `tower-http` (cors, trace)
56+
- **HTTP Client**: reqwest 0.12 (json, stream) for downloading task archives
5157
- **Serialization**: serde + serde_json + serde_yaml
52-
- **Concurrency**: `DashMap` 6, `parking_lot` 0.12, `tokio::sync::Semaphore`
58+
- **Concurrency**: `DashMap` 6, `parking_lot` 0.12, `tokio::sync::Semaphore`, `tokio::sync::broadcast`
5359
- **Archive Handling**: `flate2` + `tar` (tar.gz), `zip` 2 (zip)
5460
- **Error Handling**: `anyhow` 1 + `thiserror` 2
5561
- **Logging**: `tracing` + `tracing-subscriber` with env-filter
62+
- **Crypto/Identity**: `sha2`, `hex`, `base64`, `bs58` (SS58 address validation), `uuid` v4
63+
- **Time**: `chrono` with serde support
5664
- **Build Tooling**: `mold` linker via `.cargo/config.toml`, `clang` as linker driver
57-
- **Container**: Multi-stage Dockerfile — `rust:1.93-slim-bookworm` builder → `debian:bookworm-slim` runtime
58-
- **CI**: GitHub Actions on Blacksmith runners (4/32 vCPU), nightly Rust
65+
- **Container**: Multi-stage Dockerfile — `rust:1.93-slim-bookworm` builder → `debian:bookworm-slim` runtime (includes python3, pip, venv, build-essential, git, curl)
66+
- **CI**: GitHub Actions on `blacksmith-32vcpu-ubuntu-2404` runners, nightly Rust
5967

6068
## CRITICAL RULES
6169

6270
1. **Always use `cargo +nightly fmt --all` before committing.** The CI enforces `--check` and will reject unformatted code. The project uses the nightly formatter exclusively.
6371

6472
2. **All clippy warnings are errors.** Run `cargo +nightly clippy --all-targets -- -D warnings` locally. CI runs the same command and will fail on any warning.
6573

66-
3. **Never expose secrets in logs or responses.** The `AUTH_TOKEN` environment variable is sensitive. Auth failures log only the `x-forwarded-for` header, never the token value. Follow this pattern for any new secrets.
74+
3. **Never expose secrets in logs or responses.** The `AUTHORIZED_HOTKEY` in `src/config.rs` is the only authorized SS58 hotkey. Auth failures log only the rejection, never the submitted hotkey value. Follow this pattern for any new secrets.
6775

6876
4. **All process execution MUST have timeouts.** Every call to `run_cmd`/`run_shell` in `src/executor.rs` takes a `Duration` timeout. Never spawn a child process without a timeout — agent code is untrusted and may hang forever.
6977

7078
5. **Output MUST be truncated.** The `truncate_output()` function in `src/executor.rs` caps output at `MAX_OUTPUT` (1MB). Any new command output capture must use this function to prevent memory exhaustion from malicious agent output.
7179

7280
6. **Shared state must use `Arc` + lock-free structures.** `SessionManager` uses `DashMap` (not `Mutex<HashMap>`). Metrics use `AtomicU64`. New shared state should follow these patterns — never use `std::sync::Mutex` for hot-path data.
7381

74-
7. **Semaphore must gate evaluation capacity.** The `Semaphore` in `AppState` limits concurrent evaluations to `MAX_CONCURRENT_EVALS`. Any new evaluation path must acquire a permit before spawning work.
82+
7. **Semaphore must gate task concurrency.** The per-batch `Semaphore` in `executor.rs` limits concurrent tasks within a batch. The `SessionManager::has_active_batch()` check prevents multiple batches from running simultaneously.
7583

76-
8. **Session cleanup is mandatory.** Every evaluation must clean up its work directory in `src/executor.rs` (the `Cleanup` step). The stale session reaper in `src/cleanup.rs` is a safety net, not a primary mechanism.
84+
8. **Session cleanup is mandatory.** Every task must clean up its work directory in `src/executor.rs`. The stale session reaper in `src/cleanup.rs` is a safety net, not a primary mechanism.
7785

78-
9. **Error handling: use `anyhow::Result` for internal logic, `(StatusCode, String)` for HTTP responses.** Handler functions in `src/handlers.rs` return `Result<impl IntoResponse, (StatusCode, String)>`. Internal executor/task functions return `anyhow::Result<T>`.
86+
9. **Error handling: use `anyhow::Result` for internal logic, `(StatusCode, Json<Value>)` for HTTP responses.** Handler functions in `src/handlers.rs` return `Result<impl IntoResponse, (StatusCode, Json<Value>)>`. Internal executor/task functions return `anyhow::Result<T>`.
7987

80-
10. **All new fields on serialized structs must use `#[serde(default)]` or `Option<T>`.** The `EvalRequest`, `EvalResult`, and `WorkspaceConfig` structs are deserialized from external input. Missing fields must not break deserialization.
88+
10. **All new fields on serialized structs must use `#[serde(default)]` or `Option<T>`.** The `WorkspaceConfig`, `BatchResult`, and `TaskResult` structs are deserialized from external input or stored results. Missing fields must not break deserialization.
8189

8290
## DO / DO NOT
8391

@@ -86,12 +94,11 @@ Platform Server → POST /evaluate → term-executor
8694
- Use `tracing::info!`/`warn!`/`error!` for logging (not `println!`)
8795
- Add new routes in `src/handlers.rs` via the `router()` function
8896
- Use `tokio::fs` for async file I/O in the executor pipeline
89-
- Keep the Dockerfile minimal — runtime image has no compilers or language runtimes
9097
- Use conventional commits (`feat:`, `fix:`, `perf:`, `chore:`, etc.)
9198

9299
### DO NOT
93100
- Do NOT add `unsafe` code — there is none in this project and it should stay that way
94-
- Do NOT add synchronous blocking I/O in async functions — use `tokio::task::spawn_blocking` for CPU-heavy work (see `extract_archive` in `src/task.rs`)
101+
- Do NOT add synchronous blocking I/O in async functions — use `tokio::task::spawn_blocking` for CPU-heavy work (see `extract_archive_bytes` in `src/task.rs`)
95102
- Do NOT store large data (agent output, test output) in memory without truncation
96103
- Do NOT add new dependencies without justification — the binary must stay small for container deployment
97104
- Do NOT use `unwrap()` in production code paths — use `?` or `context()` from anyhow. `unwrap()` is only acceptable in tests and infallible cases (like parsing a known-good string)
@@ -122,7 +129,7 @@ cargo +nightly fmt --all -- --check
122129
cargo +nightly clippy --all-targets -- -D warnings
123130

124131
# Run locally
125-
AUTH_TOKEN=test PORT=8080 cargo run
132+
PORT=8080 cargo run
126133

127134
# Docker build
128135
docker build -t term-executor .
@@ -149,12 +156,15 @@ Both hooks are activated via `git config core.hooksPath .githooks`.
149156
| Variable | Default | Description |
150157
|---|---|---|
151158
| `PORT` | `8080` | HTTP listen port |
152-
| `AUTH_TOKEN` | *(none)* | Bearer token for `/evaluate`. If unset, auth is disabled |
153-
| `SESSION_TTL_SECS` | `1800` | Max session lifetime before reaping |
154-
| `MAX_CONCURRENT_EVALS` | `4` | Maximum parallel evaluations |
155-
| `CLONE_TIMEOUT_SECS` | `120` | Git clone timeout |
159+
| `SESSION_TTL_SECS` | `7200` | Max batch lifetime before reaping |
160+
| `MAX_CONCURRENT_TASKS` | `8` | Maximum parallel tasks per batch |
161+
| `CLONE_TIMEOUT_SECS` | `180` | Git clone timeout |
156162
| `AGENT_TIMEOUT_SECS` | `600` | Agent execution timeout |
157163
| `TEST_TIMEOUT_SECS` | `300` | Test suite timeout |
158-
| `MAX_AGENT_CODE_BYTES` | `5242880` | Max agent code payload (5MB) |
164+
| `MAX_ARCHIVE_BYTES` | `524288000` | Max uploaded archive size (500MB) |
159165
| `MAX_OUTPUT_BYTES` | `1048576` | Max captured output per command (1MB) |
160166
| `WORKSPACE_BASE` | `/tmp/sessions` | Base directory for session workspaces |
167+
168+
## Authentication
169+
170+
Authentication uses SS58 hotkey validation via the `X-Hotkey` HTTP header. The authorized hotkey is hardcoded as `AUTHORIZED_HOTKEY` in `src/config.rs`. Only requests with a matching hotkey can submit batches via `POST /submit`. All other endpoints are open.

0 commit comments

Comments
 (0)