Core Paper: ReCode: Unify Plan and Action for Universal Granularity Control (arXiv:2510.23564v2)
- π Core Paper: 2510.23564v2.md - MUST READ!!!
- π Specification: dev-spec/
- π Dev Worklogs: .worklogs/
- π¬ Paper Demo Code: .dev-docs/ - Python academic prototype
- π Codex CLI Docs: .knowledge/codex-cli/docs/
- π» Codex TypeScript SDK: .knowledge/codex-cli/sdk/typescript/
- π³ Harbor Environment Guide: .claude/skills/ReCodeAgent-TB2-Evaluate/
Project Status: Production-ready Rust implementation with Harbor integration for Terminal-Bench 2.0 evaluation Updated Date: 2025-11-22
Project Objective: Productionize the ReCode research paradigm into a high-performance Rust Core + Codex CLI integrated recursive code generation system. Now featuring Harbor Container-Unified Architecture for Terminal-Bench 2.0 benchmark evaluation.
ReCodeAgent is a production implementation of the academic paper ReCode: Unify Plan and Action for Universal Granularity Control, using a Hybrid Architecture: Rust Orchestrator + Codex CLI Executor to enable dynamic granularity control from fixed-granularity decision-making to universal programming agents.
- π Recursive Code Generation: Placeholder functions auto-expand into executable code
- β‘ High-Performance Rust Core: DFS tree traversal, AST parsing, checkpoint mechanism
- π Codex CLI Integration: LLM calls via
codex exec --json, authenticated with~/.codex/auth.json - π³ Harbor Container-Unified Architecture: Seamless Terminal-Bench 2.0 evaluation
- π οΈ Tool Ecosystem Integration: File editing, command execution, environment interaction
- π Production-Grade Reliability: Type safety, memory efficiency, JSONL event streaming
# Navigate to Harbor workspace
cd ~/harbor-workspace
# Run a single task
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent
# Specify prompt template
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent \
--agent-kwarg template=recode_tb2_prompt.jinja2
# Limit max steps + debug mode
harbor run -d terminal-bench@2.0 -t password-recovery -a recode-agent \
--agent-kwarg max_steps=50 --debug
# Batch run all tasks (4 concurrent)
harbor run -d terminal-bench@2.0 -a recode-agent -n 4# Quick smoke test (10 steps)
cargo run --example terminal_bench_smoke --release
# CLI subcommand test
cargo run --release --manifest-path recode-core/Cargo.toml -- \
execute --task-name test --instruction "test task" --working-dir /tmp --max-steps 5# Latest task results
ls -lt ~/harbor-workspace/jobs/ | head -3
# Execution logs
cat ~/harbor-workspace/jobs/<job-id>/<task-id>/agent/command-2/stdout.txt | tail -50
# Verification result (0.0 = fail, 1.0 = success)
cat ~/harbor-workspace/jobs/<job-id>/<task-id>/verifier/reward.txtβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β macOS Development Environment (Host) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ReCodeAgent Repository β
β ~/dev-space/ReCodeAgent/ β
β βββ recode-core/src/ # Rust source code β
β βββ recode-core/templates/ # Jinja2 Prompt templates β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Harbor Installation β
β ~/.local/share/uv/tools/harbor/.../agents/installed/ β
β βββ recode_agent.py # Harbor Agent definition β
β βββ recode-assets/ # Deployment assets β
β βββ recode-agent # Linux x86_64 binary β
β βββ templates/ # Jinja2 templates β
β βββ scripts/ # Python bridge scripts β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Docker Volume
β ${HOME}/.codex:/tmp/host-codex:ro
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Docker Container (Terminal-Bench 2.0) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β /app/ # Working directory β
β βββ recode-agent # ReCodeAgent binary β
β βββ AGENTS.md # Rendered system prompt β
β β # (auto-loaded by Codex) β
β βββ instruction.md # Task instruction β
β βββ .codex/ # Codex CLI config β
β βββ auth.json # Auth (from host) β
β βββ config.toml # Model configuration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
harbor run -d terminal-bench@2.0 -t <task> -a recode-agent
β
ββββΆ Step 1: Setup Codex auth (copy auth.json, config.toml)
ββββΆ Step 2: Render AGENTS.md (recode-agent render-template)
ββββΆ Step 3: Execute task (recode-agent execute + codex exec)
β βββ DFS tree traversal + checkpoint self-verification
ββββΆ Step 4: Cleanup
enum Command {
/// Legacy: Run with explicit bridge configuration
Run { env_kind, python, bridge, bridge_args, instruction },
/// Render AGENTS.md template (Harbor Step 2)
RenderTemplate { template, output, task_name, instruction_path },
/// Execute task (Harbor Step 3) - Core execution engine
Execute { task_name, instruction, working_dir, max_steps, codex_home },
}recode-core/
βββ Cargo.toml
βββ Dockerfile # Container image config
β
βββ examples/
β βββ terminal_bench_smoke.rs # Quick local test (10 steps)
β
βββ harbor-assets/ # Harbor deployment assets
β βββ recode-agent # Linux x86_64 binary
β βββ templates/ # Deployment templates
β
βββ templates/ # Jinja2 Prompt templates (source)
β βββ recode_tb2_agents_md.jinja2 # Default AGENTS.md template
β βββ recode_tb2_prompt.jinja2 # TB2 task prompt
β βββ recode_microtexecute_tb2_prompt.jinja2 # Codex expansion
β βββ recode_tb2_checkpoint_minimal.jinja2 # Checkpoint
β
βββ src/
β βββ main.rs # CLI entry (run, render-template, execute)
β βββ codex/
β β βββ thread_manager.rs # Codex CLI integration
β βββ execution/
β β βββ python_executor.rs # Python code execution
β βββ orchestrator/
β β βββ engine.rs # Codex prompt assembly
β β βββ runtime.rs # DFS tree + checkpoint mechanism
β βββ tree/
β βββ context.rs
β βββ node.rs
β
βββ tests/
βββ fixtures/
We model LLM-based agent interaction with the environment as a simplified decision process:
Where:
-
$\mathcal{S}$ : State space -
$\mathcal{A}$ : Primitive action space (executable operations likerun('crack egg')) -
$\mathcal{O}$ : Observation space -
$T: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$ : Transition function -
$R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ : Reward function
Beyond primitive actions, we introduce plan space prepare_breakfast()).
Decision space:
Key Insight: Plans and actions, though seemingly different, can be unified into a single executable code representation.
- Actions (primitive): Executable operations like
run('click the submit button') - Plans (abstract): Unimplemented placeholder functions like
prepare_breakfast(),get_ingredients()
This unified representation enables seamless transitions between planning and execution.
Algorithm 1: The ReCode Algorithm
Procedure ReCode(T, Ο, E, c):
if c is None: // Initialize
o_0 β Reset(E) // Reset environment
c β Text2Code(T, o_0) // Convert task to root placeholder
end if
code_block β Ο(c) // LLM generates child code
for each child u in code_block:
if IsPrimitive(u): // Primitive action
Execute(u, E)
else: // Placeholder function
ReCode(T, Ο, E, u) // Recursive expansion
end if
end for
end procedure
- Task Initialization: Task instruction β root placeholder function
solve(instruction, observation) - Context Management: Unified variable namespace, persisted across recursion levels
- Error Handling: Self-correction loop (max_rewrite=5)
- Recursion Control: Maximum recursion depth 10
- Checkpoint Mechanism: DFS tree completion β task solved, inject checkpoint for agent self-verification
- Rust 1.83+ (
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) - Docker (for cross-compilation and Harbor)
- Codex CLI (
~/.codex/auth.jsonconfigured) - Harbor Framework (
uv tool install harbor)
# Local macOS build (development)
cargo build --release --manifest-path recode-core/Cargo.toml
# Linux x86_64 build (Harbor container)
docker build --platform linux/amd64 -f Dockerfile.build-x86 -t recode-builder .
docker create --name tmp recode-builder
docker cp tmp:/build/recode-core/target/release/recode-core ./recode-agent-linux-x86_64
docker rm tmp
# Verify binary
file ./recode-agent-linux-x86_64
# Should output: ELF 64-bit LSB pie executable, x86-64...HARBOR_ASSETS=~/.local/share/uv/tools/harbor/lib/python3.13/site-packages/harbor/agents/installed/recode-assets
# Sync binary
cp ./recode-agent-linux-x86_64 $HARBOR_ASSETS/recode-agent
chmod +x $HARBOR_ASSETS/recode-agent
# Sync templates
cp -r recode-core/templates/*.jinja2 $HARBOR_ASSETS/templates/
# Sync scripts
cp scripts/terminal_bench_bridge.py $HARBOR_ASSETS/scripts/cargo test # All tests
cargo test --test codex_turn_tests # Codex integration tests
cargo clippy # Code linting| Template | Purpose | Default |
|---|---|---|
recode_tb2_agents_md.jinja2 |
AGENTS.md system prompt | β |
recode_tb2_prompt.jinja2 |
TB2 task prompt | |
recode_microtexecute_tb2_prompt.jinja2 |
Codex expansion | |
recode_tb2_checkpoint_minimal.jinja2 |
Checkpoint verification |
| Parameter | Type | Default | Description |
|---|---|---|---|
template |
string | recode_tb2_agents_md.jinja2 |
Jinja2 template filename |
max_steps |
int | 99999 | DFS tree maximum steps |
Usage: --agent-kwarg template=xxx --agent-kwarg max_steps=100
- Authentication:
~/.codex/auth.json(copied to container/app/.codex/) - Model config:
~/.codex/config.toml(default: gpt-5.1-codex-max) - AGENTS.md: Auto-discovered and loaded by Codex (95%+ token savings)
- Command:
codex exec --jsonfor JSONL event streaming
| Document | Description |
|---|---|
| WARP.md | Quick reference guide for WARP/Claude Code |
| CLAUDE.md | Claude Code instructions |
| Harbor ENV Guide | Detailed Harbor operation manual |
| Architecture | Technical architecture specification |
| Roadmap | Implementation roadmap |
ReCodeAgent is evaluated on Terminal-Bench 2.0, a benchmark for terminal-based task automation.
# Run evaluation
harbor run -d terminal-bench@2.0 -a recode-agent -n 4
# View results
cat ~/harbor-workspace/jobs/<job-id>/result.json | jq '.stats'- tokio - Async runtime
- clap - CLI argument parsing
- serde / serde_json - Serialization / JSONL parsing
- minijinja - Jinja2 template rendering
- tree-sitter - High-performance AST parsing
- tracing - Structured logging
- Codex CLI - LLM calls and tool execution
- Harbor Framework - Benchmark evaluation platform
- Docker - Container runtime
Apache License 2.0 - See LICENSE file for details.
- ReCode Paper - Yu et al., 2025
- Terminal-Bench 2.0 - Benchmark dataset
- Harbor Framework - Evaluation platform
- OpenAI Codex CLI
- Architecture Design: RECODE_ARCHITECTURE_V0.1.0.md
- Specification: dev-spec/
- Harbor Docs: https://harborframework.com/docs/running-tbench
Last Updated: 2025-11-22 Version: v0.2.0 Status: β Production-ready with Harbor Terminal-Bench 2.0 integration