You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AGENTS.md
+49-39Lines changed: 49 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Project Purpose
4
4
5
-
**term-executor** is a remote evaluation executor for the [term-challenge](https://github.com/PlatformNetwork/term-challenge) platform. It runs as a containerized Rust service on [Basilica](https://basilica.ai) that receives agent code submissions, executes them against a cloned task repository, runs validation test scripts, and reports pass/fail results. It is the core compute backend that evaluates AI agent coding challenges.
5
+
**term-executor** is a remote evaluation executor for the [term-challenge](https://github.com/PlatformNetwork/term-challenge) platform. It runs as a containerized Rust service on [Basilica](https://basilica.ai) that receives batch task archives via multipart upload, executes agent code against cloned task repositories, runs validation test scripts, and reports pass/fail results with aggregate rewards. It is the core compute backend that evaluates AI agent coding challenges.
6
6
7
7
## Architecture Overview
8
8
@@ -11,73 +11,81 @@ This is a **single-crate Rust binary** (`term-executor`) built with Axum. There
11
11
### Data Flow
12
12
13
13
```
14
-
Platform Server → POST /evaluate → term-executor
15
-
1. Download task archive (.tar.gz / .zip) from task_url
16
-
2. Parse workspace.yaml, prompt.md, tests/
17
-
3. git clone the target repository at base_commit
18
-
4. Run install commands (pip install, etc.)
19
-
5. Write & execute agent code in the repo
20
-
6. Write test source files into the repo
21
-
7. Run test scripts (bash), collect exit codes
22
-
8. Return results via GET /evaluate/{id}
14
+
Client → POST /submit (multipart archive) → term-executor
15
+
1. Authenticate via X-Hotkey header (SS58 hotkey)
16
+
2. Extract uploaded archive (zip/tar.gz) containing tasks/ and agent_code/
17
+
3. Parse each task: workspace.yaml, prompt.md, tests/
18
+
4. For each task (concurrently, up to limit):
19
+
a. git clone the target repository at base_commit
20
+
b. Run install commands (pip install, etc.)
21
+
c. Write & execute agent code in the repo
22
+
d. Write test source files into the repo
23
+
e. Run test scripts (bash), collect exit codes
24
+
5. Aggregate results (reward per task, aggregate reward)
25
+
6. Stream progress via WebSocket (GET /ws?batch_id=...)
-**CI**: GitHub Actions on `blacksmith-32vcpu-ubuntu-2404` runners, nightly Rust
59
67
60
68
## CRITICAL RULES
61
69
62
70
1.**Always use `cargo +nightly fmt --all` before committing.** The CI enforces `--check` and will reject unformatted code. The project uses the nightly formatter exclusively.
63
71
64
72
2.**All clippy warnings are errors.** Run `cargo +nightly clippy --all-targets -- -D warnings` locally. CI runs the same command and will fail on any warning.
65
73
66
-
3.**Never expose secrets in logs or responses.** The `AUTH_TOKEN` environment variable is sensitive. Auth failures log only the `x-forwarded-for` header, never the token value. Follow this pattern for any new secrets.
74
+
3.**Never expose secrets in logs or responses.** The `AUTHORIZED_HOTKEY` in `src/config.rs` is the only authorized SS58 hotkey. Auth failures log only the rejection, never the submitted hotkey value. Follow this pattern for any new secrets.
67
75
68
76
4.**All process execution MUST have timeouts.** Every call to `run_cmd`/`run_shell` in `src/executor.rs` takes a `Duration` timeout. Never spawn a child process without a timeout — agent code is untrusted and may hang forever.
69
77
70
78
5.**Output MUST be truncated.** The `truncate_output()` function in `src/executor.rs` caps output at `MAX_OUTPUT` (1MB). Any new command output capture must use this function to prevent memory exhaustion from malicious agent output.
71
79
72
80
6.**Shared state must use `Arc` + lock-free structures.**`SessionManager` uses `DashMap` (not `Mutex<HashMap>`). Metrics use `AtomicU64`. New shared state should follow these patterns — never use `std::sync::Mutex` for hot-path data.
73
81
74
-
7.**Semaphore must gate evaluation capacity.** The `Semaphore` in `AppState` limits concurrent evaluations to `MAX_CONCURRENT_EVALS`. Any new evaluation path must acquire a permit before spawning work.
82
+
7.**Semaphore must gate task concurrency.** The per-batch `Semaphore` in `executor.rs` limits concurrent tasks within a batch. The `SessionManager::has_active_batch()` check prevents multiple batches from running simultaneously.
75
83
76
-
8.**Session cleanup is mandatory.** Every evaluation must clean up its work directory in `src/executor.rs` (the `Cleanup` step). The stale session reaper in `src/cleanup.rs` is a safety net, not a primary mechanism.
84
+
8.**Session cleanup is mandatory.** Every task must clean up its work directory in `src/executor.rs`. The stale session reaper in `src/cleanup.rs` is a safety net, not a primary mechanism.
77
85
78
-
9.**Error handling: use `anyhow::Result` for internal logic, `(StatusCode, String)` for HTTP responses.** Handler functions in `src/handlers.rs` return `Result<impl IntoResponse, (StatusCode, String)>`. Internal executor/task functions return `anyhow::Result<T>`.
86
+
9.**Error handling: use `anyhow::Result` for internal logic, `(StatusCode, Json<Value>)` for HTTP responses.** Handler functions in `src/handlers.rs` return `Result<impl IntoResponse, (StatusCode, Json<Value>)>`. Internal executor/task functions return `anyhow::Result<T>`.
79
87
80
-
10.**All new fields on serialized structs must use `#[serde(default)]` or `Option<T>`.** The `EvalRequest`, `EvalResult`, and `WorkspaceConfig` structs are deserialized from external input. Missing fields must not break deserialization.
88
+
10.**All new fields on serialized structs must use `#[serde(default)]` or `Option<T>`.** The `WorkspaceConfig`, `BatchResult`, and `TaskResult` structs are deserialized from external input or stored results. Missing fields must not break deserialization.
81
89
82
90
## DO / DO NOT
83
91
@@ -86,12 +94,11 @@ Platform Server → POST /evaluate → term-executor
86
94
- Use `tracing::info!`/`warn!`/`error!` for logging (not `println!`)
87
95
- Add new routes in `src/handlers.rs` via the `router()` function
88
96
- Use `tokio::fs` for async file I/O in the executor pipeline
89
-
- Keep the Dockerfile minimal — runtime image has no compilers or language runtimes
90
97
- Use conventional commits (`feat:`, `fix:`, `perf:`, `chore:`, etc.)
91
98
92
99
### DO NOT
93
100
- Do NOT add `unsafe` code — there is none in this project and it should stay that way
94
-
- Do NOT add synchronous blocking I/O in async functions — use `tokio::task::spawn_blocking` for CPU-heavy work (see `extract_archive` in `src/task.rs`)
101
+
- Do NOT add synchronous blocking I/O in async functions — use `tokio::task::spawn_blocking` for CPU-heavy work (see `extract_archive_bytes` in `src/task.rs`)
95
102
- Do NOT store large data (agent output, test output) in memory without truncation
96
103
- Do NOT add new dependencies without justification — the binary must stay small for container deployment
97
104
- Do NOT use `unwrap()` in production code paths — use `?` or `context()` from anyhow. `unwrap()` is only acceptable in tests and infallible cases (like parsing a known-good string)
|`MAX_AGENT_CODE_BYTES`|`5242880`| Max agent code payload (5MB) |
164
+
|`MAX_ARCHIVE_BYTES`|`524288000`| Max uploaded archive size (500MB) |
159
165
|`MAX_OUTPUT_BYTES`|`1048576`| Max captured output per command (1MB) |
160
166
|`WORKSPACE_BASE`|`/tmp/sessions`| Base directory for session workspaces |
167
+
168
+
## Authentication
169
+
170
+
Authentication uses SS58 hotkey validation via the `X-Hotkey` HTTP header. The authorized hotkey is hardcoded as `AUTHORIZED_HOTKEY` in `src/config.rs`. Only requests with a matching hotkey can submit batches via `POST /submit`. All other endpoints are open.
0 commit comments