Skip to content

Latest commit

 

History

History
234 lines (196 loc) · 12.9 KB

File metadata and controls

234 lines (196 loc) · 12.9 KB

AGENTS.md — Term Challenge

Project Purpose

Term Challenge is a WASM evaluation module for AI agents on the Bittensor network via platform-v2. Miners submit Python agent packages (as zip files) that solve SWE-bench tasks. The WASM module runs inside platform-v2 validators to validate submissions, evaluate task results, and compute scores. A companion native CLI (term-cli) provides a TUI for monitoring leaderboards, evaluation progress, and network health. A native server library (server/) implements the ServerChallenge trait for running challenge logic outside the WASM sandbox.

Architecture Overview

term-challenge/
├── Cargo.toml          # workspace with members = [".", "wasm", "cli", "lib", "server", "storage"]
├── src/
│   ├── lib.rs                  # Root library crate entry point
│   └── dataset/
│       ├── mod.rs              # Dataset module re-exports
│       ├── types.rs            # DatasetEntry struct (SWE-forge schema)
│       └── huggingface.rs      # HuggingFaceDataset: download, list, cache
├── wasm/
│   ├── Cargo.toml      # cdylib, depends on platform-challenge-sdk-wasm
│   └── src/
│       ├── lib.rs              # Challenge impl + register_challenge!
│       ├── types.rs            # Submission, TaskDefinition, AgentLogs, etc.
│       ├── scoring.rs          # Aggregate scoring, decay, weight calculation
│       ├── tasks.rs            # Active dataset storage (SWE-bench tasks)
│       ├── dataset.rs          # Dataset selection, consensus, and random index generation
│       ├── routes.rs           # Challenge route definitions and handlers for RPC
│       ├── agent_storage.rs    # Agent code, log, and evaluation status storage
│       ├── ast_validation.rs   # Python AST whitelist validation (imports, builtins, patterns)
│       ├── llm_review.rs       # LLM-based code review, reviewer selection, aggregation
│       ├── submission.rs       # Named submission registry and version tracking
│       └── timeout_handler.rs  # Review assignment timeout tracking and replacement
├── lib/
│   ├── Cargo.toml      # native library, depends on platform-core & platform-challenge-sdk
│   └── src/
│       ├── lib.rs              # Module declarations, re-exports dataset/ChallengeId/Hotkey
│       ├── dataset.rs          # Validator-side dataset management types
│       ├── synthetic/mod.rs    # Synthetic task generation
│       ├── validation/mod.rs   # Validation result types
│       ├── admin/mod.rs        # Admin action types
│       ├── util/mod.rs         # Utility module
│       ├── util/hotkey.rs      # Hotkey parsing helpers (wraps platform_core::Hotkey)
│       ├── cache/mod.rs        # Score caching layer
│       ├── chain/mod.rs        # Chain submission types
│       ├── worker/mod.rs       # Worker job processing types
│       └── bin/term-sudo.rs    # Admin CLI binary
├── server/
│   ├── Cargo.toml      # lib + bin, depends on platform-challenge-sdk (server mode)
│   └── src/
│       ├── lib.rs              # TerminalBenchChallenge implementing ServerChallenge trait
│       ├── main.rs             # Binary entry point: CLI args, ChallengeServer::builder().run()
│       ├── server.rs           # ChallengeServerState axum HTTP wrapper
│       ├── types.rs            # Shared types (std port of wasm/src/types.rs)
│       ├── scoring.rs          # Aggregate scoring, decay (uses ChallengeDatabase)
│       ├── tasks.rs            # Active dataset storage (uses ChallengeDatabase)
│       ├── dataset.rs          # Dataset selection, consensus, random indices
│       ├── routes.rs           # Route definitions + handlers (SDK ChallengeRoute/RouteRequest)
│       ├── agent_storage.rs    # Agent code, log, status storage (uses ChallengeDatabase)
│       ├── ast_validation.rs   # Python AST whitelist validation
│       ├── llm_review.rs       # Async LLM review (reqwest HTTP client)
│       ├── submission.rs       # Named submission registry and version tracking
│       └── timeout_handler.rs  # Review assignment timeout tracking
├── storage/
│   ├── Cargo.toml      # native library, depends on platform-core + platform-challenge-sdk
│   └── src/
│       ├── lib.rs              # Root module, re-exports, From impls for StorageError
│       ├── traits.rs           # ChallengeStorage trait, StorageError, Result alias
│       ├── chain.rs            # Chain storage (sled)
│       └── local.rs            # Local storage (SQLite)
├── cli/
│   ├── Cargo.toml      # native binary, ratatui TUI
│   └── src/
│       ├── main.rs     # Entry point, event loop
│       ├── app.rs      # Application state
│       ├── ui.rs       # Ratatui UI rendering
│       └── rpc.rs      # JSON-RPC 2.0 client
├── docs/
│   ├── architecture.md
│   ├── miner/
│   │   ├── how-to-mine.md
│   │   └── submission.md
│   └── validator/
│       └── setup.md
├── .github/
│   └── workflows/
│       ├── ci.yml          # Build, clippy, test, WASM build, release on tags
│       └── release.yml     # release-please + artifact publishing
├── AGENTS.md
├── README.md
├── LICENSE
├── CHANGELOG.md
└── .githooks/

Data Flow

  1. Miner submits a zip package with agent code and task results
  2. RPC receives submission, verifies signature, relays to validators
  3. Validators run WASM validate() — checks signature, epoch rate limit, Basilica metadata, package size
  4. 50% validator approval → submission stored in blockchain
  5. Validators run WASM evaluate(): a. AST validation — checks Python code against import whitelist, forbidden builtins, and dangerous patterns b. LLM review — optional LLM-based security review via host_http_post() (if enabled) c. Task scoring — scores task results, optionally applies LLM judge per task d. Aggregate & decay — computes pass rate, applies epoch-based decay
  6. Agent code & logs stored on-chain for auditability (code ≤ 1MB, logs ≤ 256KB)
  7. Log consensus — validators propose logs, >50% hash agreement required
  8. Consensus aggregates scores, applies decay, submits weights to Bittensor

Key Concepts

  • Dual mode: Challenge logic is available as both a WASM module (wasm/) and a native server library (server/) implementing ServerChallenge
  • WASM mode: The wasm32-unknown-unknown module is loaded by platform-v2 validators
  • Server mode: The server/ crate implements ServerChallenge using ChallengeDatabase (sled KV store) for storage and reqwest for HTTP
  • Host functions (WASM): WASM interacts with the outside world via host_http_post(), host_storage_get(), host_storage_set(), host_consensus_get_epoch(), host_consensus_get_submission_count(), host_random_seed(), host_get_timestamp()
  • Server storage: The server crate uses ChallengeDatabase (sled-backed KV store) via db.kv_get::<T>(key) / db.kv_set(key, &value) and reqwest::Client for HTTP
  • SWE-bench datasets: Tasks are selected from HuggingFace CortexLM/swe-bench via P2P consensus
  • Epoch rate limiting: 1 submission per 3 epochs per miner
  • Top agent decay: 60-epoch grace period, then exponential decay with 20-epoch half-life

Agent Code Storage

Agent submissions are stored on-chain for auditability and retrieval. The agent_storage module manages three storage categories:

Storage Key Format Content Max Size
agent_code:<hotkey>:<epoch> Raw zip package bytes 1 MB (1,048,576 bytes)
agent_hash:<hotkey>:<epoch> Hash of the agent package
agent_logs:<hotkey>:<epoch> Serialized AgentLogs struct 256 KB (262,144 bytes)
  • Package size limit: Submissions with package_zip exceeding 1 MB are rejected at the storage layer.
  • Log size limit: Serialized logs exceeding 256 KB are rejected. Individual task output previews are truncated to 4 KB (4,096 bytes) before storage.
  • Key format: Keys are constructed as <prefix><hotkey_bytes>:<epoch_le_bytes> using little-endian encoding for the epoch.

CLI

The term-cli crate is a native binary (NOT no_std) that provides a terminal user interface for monitoring the term-challenge network.

Design

  • Framework: Built with ratatui for TUI rendering
  • Transport: Connects to validators via JSON-RPC 2.0 over HTTP
  • Target: Standard x86_64 / aarch64 native targets (not WASM)

Available Tabs

Tab Description
Leaderboard Current scores, ranks, and miner hotkeys
Evaluation Live evaluation progress for pending submissions
Submission Recent submission history and status
Network Validator count, epoch info, system health

Keyboard Shortcuts

Key Action
Tab / Shift+Tab Switch between tabs
/ Navigate rows
r Refresh data
q Quit

RPC Methods Used (platform-v2 protocol)

  • epoch_current — Current epoch info (epochNumber, currentBlock, phase, blocksPerEpoch, blockInEpoch, progress)
  • system_health — Node health status
  • validator_count — Number of active validators
  • challenge_list — Auto-detect challenge ID (UUID) when only one exists; response format: { challenges: [{ id, name, ... }] }
  • challenge_call with challengeId (UUID) param and paths:
    • /leaderboard — Leaderboard data (includes f64 weight per miner)
    • /stats — Total submissions and active miners
    • /decay — Top agent decay status
    • /agent/:hotkey/journey — Evaluation status journey (hotkey is SS58 address)
    • /agent/:hotkey/logs — Evaluation logs for a miner
  • evaluation_getProgress — Evaluation progress for a submission

Build Commands

# Build library (native)
cargo build --release -p term-challenge-lib

# Build CLI (native)
cargo build --release -p term-cli

# Build storage library (native)
cargo build --release -p term-challenge-storage

# Build WASM module
cargo build --release --target wasm32-unknown-unknown -p term-challenge-wasm

# Build server library
cargo build --release -p term-challenge-server

# Check (no target needed for workspace check)
cargo check -p term-challenge-wasm
cargo check -p term-challenge-server

Git Hooks

Git hooks live in .githooks/ and are activated with git config core.hooksPath .githooks.

Hook What it does
pre-commit Runs cargo fmt --all, stages formatted files. Skippable with SKIP_GIT_HOOKS=1.
pre-push Full quality gate: format check → cargo checkcargo clippycargo test. Skippable with SKIP_GIT_HOOKS=1 or git push --no-verify.

CRITICAL RULES

  1. No std in WASM code. The wasm/ module compiles with #![no_std]. Use alloc:: equivalents. The server/ crate uses standard std.
  2. Cryptographic signatures use sr25519. SS58 prefix 42. Do NOT switch schemes.
  3. Conventional commits required. The project uses release-please.
  4. No .unwrap() or .expect() in library paths. Use pattern matching or unwrap_or_default().
  5. Host functions are the ONLY external interface. No direct HTTP, no filesystem, no std::net.
  6. Do NOT add #[allow(dead_code)] broadly. Fix unused code or remove it.

Note: The cli/, core/, lib/, server/, and storage/ crates are exempt from the no_std rule (rule 1) and the host-functions-only rule (rule 5) since they are native code that runs outside the WASM sandbox. Rules 2, 3, 4, and 6 still apply to all crates.

DO / DO NOT

DO

  • Use alloc::string::String, alloc::vec::Vec, alloc::collections::BTreeMap (WASM code)
  • Use serde with default-features = false, features = ["derive", "alloc"] (WASM code)
  • Use bincode with default-features = false for serialization (WASM code)
  • Use host functions for all I/O: host_storage_get/set, host_http_post, host_consensus_get_epoch, host_consensus_get_submission_count, host_random_seed, host_get_timestamp (WASM code)
  • Keep the register_challenge! macro ABI contract intact
  • Use standard std library features in the cli/ and server/ crates
  • Use ChallengeDatabase KV store (kv_get/kv_set) for all storage in the server/ crate

DO NOT

  • Do NOT use std::, println!, std::collections::HashMap in WASM code
  • Do NOT add heavy dependencies — the WASM module must stay minimal
  • Do NOT break the WASM ABI (evaluate, validate, get_name, get_version, get_tasks, configure, alloc, get_routes, handle_route)
  • Do NOT store sensitive data in plain text in blockchain storage