Skip to content

Latest commit

Β 

History

History
313 lines (250 loc) Β· 10.8 KB

File metadata and controls

313 lines (250 loc) Β· 10.8 KB

τΡrm chΞ±llΡηgΞ΅

Terminal Benchmark Challenge β€” WASM Evaluation Module for Platform-v2

Coverage License Rust SWE-Forge

Term Challenge Banner

Term Challenge is a WASM evaluation module for AI agents on the Bittensor network. It runs inside platform-v2 validators to evaluate miner submissions against SWE-bench tasks. Miners submit Python agent packages that autonomously solve software engineering issues, and the network scores them through a multi-stage review pipeline including LLM-based code review and AST structural validation.


System Architecture

flowchart LR
    Miner[Miner] -->|Submit Agent ZIP| RPC[Validator RPC]
    RPC --> Validators[Validator Network]
    Validators --> WASM[term-challenge WASM]
    WASM --> Storage[(Blockchain Storage)]
    Validators --> Executor[term-executor]
    Executor -->|Task Results| Validators
    Validators -->|Scores + Weights| BT[Bittensor Chain]
    CLI[term-cli TUI] -->|JSON-RPC| RPC
    CLI -->|Display| Monitor[Leaderboard / Progress / Logs]
Loading

Evaluation Pipeline

sequenceDiagram
    participant M as Miner
    participant V as Validators
    participant LLM as LLM Reviewers (Γ—3)
    participant AST as AST Reviewers (Γ—3)
    participant W as WASM Module
    participant E as term-executor
    participant BT as Bittensor

    M->>V: Submit agent zip + metadata
    V->>W: validate(submission)
    W-->>V: Approved (>50% consensus)
    V->>LLM: Assign LLM code review
    V->>AST: Assign AST structural review
    LLM-->>V: LLM review scores
    AST-->>V: AST review scores
    V->>E: Execute agent on SWE-bench tasks
    E-->>V: Task results + scores
    V->>W: evaluate(results)
    W-->>V: Aggregate score + weight
    V->>V: Store agent code & logs
    V->>V: Log consensus (>50% hash agreement)
    V->>BT: Submit weights at epoch boundary
Loading

Validator Assignment

flowchart TB
    Sub[New Submission] --> Seed[Deterministic Seed from submission_id]
    Seed --> Select[Select 6 Validators]
    Select --> LLM[3 LLM Reviewers]
    Select --> AST[3 AST Reviewers]
    LLM --> LR1[LLM Reviewer 1]
    LLM --> LR2[LLM Reviewer 2]
    LLM --> LR3[LLM Reviewer 3]
    AST --> AR1[AST Reviewer 1]
    AST --> AR2[AST Reviewer 2]
    AST --> AR3[AST Reviewer 3]
    LR1 & LR2 & LR3 -->|Timeout?| TD1{Responded?}
    AR1 & AR2 & AR3 -->|Timeout?| TD2{Responded?}
    TD1 -->|No| Rep1[Replacement Validator]
    TD1 -->|Yes| Agg[Result Aggregation]
    TD2 -->|No| Rep2[Replacement Validator]
    TD2 -->|Yes| Agg
    Rep1 --> Agg
    Rep2 --> Agg
    Agg --> Score[Final Score]
Loading

Submission Flow

flowchart LR
    Register[Register Name] -->|First-register-owns| Name[Submission Name]
    Name --> Version[Auto-increment Version]
    Version --> Pack[Package Agent ZIP ≀ 1MB]
    Pack --> Sign[Sign with sr25519]
    Sign --> Submit[Submit via RPC]
    Submit --> RateCheck{Epoch Rate Limit OK?}
    RateCheck -->|No: < 3 epochs since last| Reject[Rejected]
    RateCheck -->|Yes| Validate[WASM validate]
    Validate --> Consensus{>50% Validator Approval?}
    Consensus -->|No| Reject
    Consensus -->|Yes| Evaluate[Evaluation Pipeline]
    Evaluate --> Store[Store Code + Hash + Logs]
Loading

Decay Mechanism

flowchart LR
    Top[Top Score Achieved] --> Grace["21,600 blocks Grace Period β‰ˆ 72h"]
    Grace -->|Within grace| Full[100% Weight Retained]
    Grace -->|After grace| Decay[Exponential Decay Begins]
    Decay --> Half["50% per 7,200 blocks half-life β‰ˆ 24h"]
    Half --> Min[Decay to 0.0 min multiplier]
    Min --> Burn[Weight Burns to UID 0]
Loading

Block timing: 1 block β‰ˆ 12s, 5 blocks/min, 7,200 blocks/day.


CLI Data Flow

flowchart TB
    CLI[term-cli] -->|epoch_current| RPC[Validator RPC]
    CLI -->|challenge_call /leaderboard| RPC
    CLI -->|evaluation_getProgress| RPC
    CLI -->|agent_getLogs| RPC
    CLI -->|system_health| RPC
    CLI -->|validator_count| RPC
    RPC --> State[Chain State]
    State --> LB[Leaderboard Data]
    State --> Eval[Evaluation Progress]
    State --> Logs[Validated Logs]
Loading

Agent Log Consensus

flowchart LR
    V1[Validator 1] -->|Log Proposal| P2P[(P2P Network)]
    V2[Validator 2] -->|Log Proposal| P2P
    V3[Validator 3] -->|Log Proposal| P2P
    P2P --> Consensus{Hash Match >50%?}
    Consensus -->|Yes| Store[Validated Logs]
    Consensus -->|No| Reject[Rejected]
Loading

Agent Code Storage

flowchart TB
    Submit[Agent Submission] --> Validate{package_zip ≀ 1MB?}
    Validate -->|Yes| Store[Blockchain Storage]
    Validate -->|No| Reject[Rejected]
    Store --> Code[agent_code:hotkey:epoch]
    Store --> Hash[agent_hash:hotkey:epoch]
    Store --> Logs[agent_logs:hotkey:epoch ≀ 256KB]
Loading

Route Architecture

flowchart LR
    Client[Client] -->|JSON-RPC| RPC[RPC Server]
    RPC -->|challenge_call| WE[WASM Executor]
    WE -->|handle_route request| WM[WASM Module]
    WM --> Router{Route Match}
    Router --> LB["/leaderboard"]
    Router --> Subs["/submissions"]
    Router --> DS["/dataset"]
    Router --> Stats["/stats"]
    Router --> Agent["/agent/:hotkey/code"]
    LB & Subs & DS & Stats & Agent --> Storage[(Storage)]
    Storage --> Response[Serialized Response]
    Response --> WE
    WE --> RPC
    RPC --> Client
Loading

Features

  • WASM Module: Compiles to wasm32-unknown-unknown, loaded by platform-v2 validators
  • SWE-bench Evaluation: Tasks selected from SWE-Forge datasets
  • LLM Code Review: 3 validators perform LLM-based code review via host functions
  • AST Structural Validation: 3 validators perform AST-based structural analysis
  • Submission Versioning: Auto-incrementing versions with full history tracking
  • Timeout Handling: Unresponsive reviewers are replaced with alternate validators
  • Route Handlers: WASM-native route handling for leaderboard, submissions, dataset, and agent data
  • Epoch Rate Limiting: 1 submission per 3 epochs per miner
  • Top Agent Decay: 21,600 blocks grace period (~72h), 50% per 7,200 blocks half-life (~24h) decay to 0 weight
  • P2P Dataset Consensus: Validators collectively select 50 evaluation tasks from SWE-Forge
  • Zip Package Submissions: Agents submitted as zip packages (no compilation step)
  • Agent Code Storage: Submitted agent packages (≀ 1MB) stored on-chain with hash verification
  • Log Consensus: Evaluation logs validated across validators with >50% hash agreement
  • CLI (term-cli): Native TUI for monitoring leaderboards, evaluation progress, submissions, and network health

Building

# Build WASM module
cargo build --release --target wasm32-unknown-unknown -p term-challenge-wasm

# The output .wasm file is at:
# target/wasm32-unknown-unknown/release/term_challenge_wasm.wasm

# Build CLI (native)
cargo build --release -p term-cli

Architecture

This repository contains the WASM evaluation module and a native CLI for monitoring. All infrastructure (P2P networking, RPC server, blockchain storage, validator coordination) is provided by platform-v2.

term-challenge/
β”œβ”€β”€ wasm/                   # WASM evaluation module
β”‚   └── src/
β”‚       β”œβ”€β”€ lib.rs               # Challenge trait implementation (validate + evaluate)
β”‚       β”œβ”€β”€ types.rs             # Submission, task, config, route, and log types
β”‚       β”œβ”€β”€ scoring.rs           # Score aggregation, decay, and weight calculation
β”‚       β”œβ”€β”€ tasks.rs             # Active dataset management and history
β”‚       β”œβ”€β”€ dataset.rs           # Dataset selection and P2P consensus logic
β”‚       β”œβ”€β”€ routes.rs            # WASM route definitions for RPC (handle_route)
β”‚       └── agent_storage.rs     # Agent code, hash, and log storage functions
β”œβ”€β”€ cli/                    # Native TUI monitoring tool
β”‚   └── src/
β”‚       β”œβ”€β”€ main.rs         # Entry point, event loop
β”‚       β”œβ”€β”€ app.rs          # Application state
β”‚       β”œβ”€β”€ ui.rs           # Ratatui UI rendering
β”‚       └── rpc.rs          # JSON-RPC 2.0 client
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ architecture.md     # System architecture and internals
β”‚   β”œβ”€β”€ miner/
β”‚   β”‚   β”œβ”€β”€ how-to-mine.md  # Complete miner guide
β”‚   β”‚   └── submission.md   # Submission format and review process
β”‚   └── validator/
β”‚       └── setup.md        # Validator setup and operations
β”œβ”€β”€ AGENTS.md               # Development guide
└── README.md

How It Works

  1. Miners submit zip packages with agent code and SWE-bench task results
  2. Platform-v2 validators load this WASM module
  3. validate() checks signatures, epoch rate limits, package size, and Basilica metadata
  4. 6 review validators are deterministically selected (3 LLM + 3 AST) to review the submission
  5. LLM reviewers score code quality; AST reviewers validate structural integrity
  6. Timed-out reviewers are automatically replaced with alternate validators
  7. evaluate() scores task results, applies LLM judge scoring, and computes aggregate weights
  8. Agent code and hash are stored on-chain for auditability (≀ 1MB per package)
  9. Evaluation logs are proposed and validated via P2P consensus (>50% hash agreement)
  10. Scores are aggregated via P2P consensus and submitted to Bittensor at epoch boundaries
  11. Top agents enter a decay cycle: 21,600 blocks grace (~72h) β†’ 50% per 7,200 blocks (~24h) decay β†’ weight burns to UID 0

CLI Usage

# Install via platform CLI
platform download term-challenge

# Or build from source
cargo build --release -p term-cli

# Run the TUI
term-cli --rpc-url http://chain.platform.network

# With miner hotkey filter
term-cli --hotkey 5GrwvaEF... --tab leaderboard

# Available tabs: leaderboard, evaluation, submission, network

Documentation


License

Apache-2.0