██████╗ ██████╗ ██████╗ ███████╗████████╗██████╗ █████╗ ██████╗███████╗██████╗ ██╔════╝██╔═══██╗██╔══██╗██╔════╝╚══██╔══╝██╔══██╗██╔══██╗██╔════╝██╔════╝██╔══██╗ ██║ ██║ ██║██║ ██║█████╗ ██║ ██████╔╝███████║██║ █████╗ ██████╔╝ ██║ ██║ ██║██║ ██║██╔══╝ ██║ ██╔══██╗██╔══██║██║ ██╔══╝ ██╔══██╗ ╚██████╗╚██████╔╝██████╔╝███████╗ ██║ ██║ ██║██║ ██║╚██████╗███████╗██║ ██║ ╚═════╝ ╚═════╝ ╚═════╝ ╚══════╝ ╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚══════╝╚═╝ ╚═╝
NJU-LINK × Kwaipilot
Quick Start • CodeTraceBench • Supported Agents • Configuration • Citation
CodeTracer analyzes agent execution trajectories step-by-step, identifies incorrect and unuseful actions, and produces structured diagnostic labels. It operates as an autonomous diagnosis agent — navigating trajectory environments, inspecting evidence, and building root-cause chains — with cross-trajectory memory that accumulates experience over time.
Tip
New to CodeTracer? Start with a single trajectory: codetracer analyze /path/to/trajectory/ --model gpt-4o --profile detailed
|
Autonomous Diagnosis Agent Iteratively explores trajectory data, inspects steps, gathers evidence, and produces structured error labels with full reasoning chains. Deep Recursive Discovery Three-phase trajectory discovery (marker scan → child preference → LLM-guided analysis) handles arbitrarily nested and messy data archives. Auto-Skill Generation Unknown trajectory formats are automatically parsed via LLM-generated skills; no manual parser authoring required. |
Cross-Trajectory Memory Online memory extraction during analysis accumulates agent-specific failure patterns and investigation strategies across runs. Resilient Context Management Two-tier compaction (LLM summarization → sliding window fallback) ensures analysis never stalls from context overflow. Replay Engine Resume failed trajectories from diagnosed breakpoints with corrective strategies injected. |
git clone https://github.com/NJU-LINK/CodeTracer.git
cd CodeTracer
pip install -e .export CODETRACER_API_BASE="https://api.openai.com/v1"
export CODETRACER_API_KEY="your-api-key"# Single trajectory
codetracer analyze /path/to/trajectory/ \
--model gpt-4o \
--profile detailed
# Batch analysis
codetracer-batch \
--manifest bench_manifest.verified.jsonl \
--parallel 4 \
--model gpt-4ocodetracer interactive /path/to/trajectory/| Module | Description |
|---|---|
discovery.explorer |
Three-phase recursive trajectory discovery with LLM-guided fallback |
agents.trace_agent |
Autonomous diagnosis loop with bash execution in trajectory environments |
agents.compact |
Two-tier context management (LLM summarization + sliding window) |
services.memory |
Cross-trajectory memory with online mid-analysis extraction |
skills.* |
Pluggable format parsers with auto-generation for unknown formats |
agents.replay |
Replay engine for resuming trajectories from diagnosed breakpoints |
query.normalizer |
Unified trajectory normalization across all supported formats |
query.tree_builder |
Step classification tree construction (change/explore labeling) |
CodeTraceBench is a benchmark of 4,316 agent trajectories with human-verified step-level annotations for trajectory diagnosis evaluation.
| Split | Trajectories | Description |
|---|---|---|
verified |
1,000 | Curated subset (489 SWE-bench + 511 TerminalBench) |
full |
3,316 | All trajectories across agents and models |
Note
CodeTraceBench covers 4 agents (mini-SWE-agent, OpenHands, Terminus2, SWE-agent) × 5 models (Claude Sonnet 4, DeepSeek-V3.2, Kimi-K2, GPT-5, Qwen3-Coder) across 26 task categories.
# Download and extract
huggingface-cli download NJU-LINK/CodeTraceBench \
--repo-type dataset \
--local-dir ./tracebench_data
# Run on verified split
codetracer-batch \
--manifest tracebench_data/bench_manifest.verified.jsonl \
--model gpt-4o \
--parallel 4 \
--output outputs/Dataset Fields
| Field | Description |
|---|---|
traj_id |
Unique trajectory identifier |
agent |
Agent name (mini-SWE-agent, OpenHands, Terminus2, SWE-agent) |
model |
Model identifier |
stages |
Ground-truth stage ranges |
incorrect_stages |
Per-stage incorrect/unuseful step annotations |
solved |
Whether the agent solved the task |
step_count |
Total number of steps |
difficulty |
easy / medium / hard |
Each analysis produces structured diagnostic labels:
[
{
"stage_id": 3,
"incorrect_step_ids": [21, 22],
"unuseful_step_ids": [],
"reasoning": "Steps 21-22 edit the wrong file based on an incorrect localization hypothesis..."
}
]Detailed profile output (root-cause analysis)
{
"root_cause_chain": ["Final test failure", "Wrong file edited", "Incorrect grep interpretation"],
"critical_decision_points": [
{"step_id": 15, "decision": "Searched for pattern X", "should_have": "Searched for pattern Y"}
],
"correct_strategy": "Should have verified file structure before editing",
"stage_labels": [...],
"summary": "Agent mislocalized the bug due to ambiguous grep results..."
}from datasets import load_dataset
import json
ds = load_dataset("NJU-LINK/CodeTraceBench", split="verified")
for entry in ds:
pred = json.load(open(f"outputs/{entry['traj_id']}/codetracer_labels.json"))
gt = entry["incorrect_stages"]
pred_steps = {s for stage in pred for s in stage.get("incorrect_step_ids", [])}
gt_steps = {s for stage in gt for s in stage.get("incorrect_step_ids", [])}
tp = len(pred_steps & gt_steps)
precision = tp / len(pred_steps) if pred_steps else 0
recall = tp / len(gt_steps) if gt_steps else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0CodeTracer ships with built-in parsers for major agent frameworks:
| Agent | Format | Auto-Detected |
|---|---|---|
| mini-SWE-agent | agent-logs/mini.traj.json + sessions/ |
Yes |
| OpenHands | OpenHands event stream | Yes |
| Terminus2 | Session recordings + results | Yes |
| Custom | Any format | Via auto-generated skill |
For unknown trajectory formats, CodeTracer automatically generates a parser:
# Auto-detect and generate parser for unknown format
codetracer analyze /path/to/unknown/trajectory/ --model gpt-4o
# Generated skill is cached for future use
ls ~/.config/codetracer/skills/Full configuration reference in src/codetracer/config/default.yaml.
| Section | Key Settings |
|---|---|
llm |
api_base, model_name, model_kwargs |
trace |
cost_limit, step_limit, timeout, prompt templates |
discovery |
max_depth, skip_dirs, auto-generation settings |
memory |
enabled, online_step_interval, online_token_threshold |
output.profiles |
tracebench, detailed, rl_feedback |
replay |
max_replay_steps, timeout |
Override via custom YAML:
codetracer analyze <run_dir> --config my_config.yaml --model gpt-4o| Profile | Output File | Description |
|---|---|---|
tracebench |
codetracer_labels.json |
Stage-level labels (benchmark evaluation) |
detailed |
codetracer_analysis.json |
Root cause chains + critical decision points |
rl_feedback |
codetracer_rl_feedback.json |
Per-step deviation analysis for RL training |
# Use specific profile
codetracer analyze <run_dir> --profile rl_feedback --model gpt-4oCodeTracer/
├── src/codetracer/
│ ├── agents/ # Core agent loops (trace, replay, compact)
│ ├── cli/ # CLI commands and interactive REPL
│ ├── config/ # Default configuration
│ ├── discovery/ # Deep recursive trajectory discovery
│ ├── llm/ # LLM client with Azure AD support
│ ├── models/ # Data models (trajectory, analysis, replay)
│ ├── plugins/ # Hook system and adapter interface
│ ├── query/ # Normalizer, tree builder, config loader
│ ├── scripts/ # Batch runner, dev tools, analysis scripts
│ ├── services/ # Memory, cost tracking, validation, complexity
│ ├── skills/ # Format parsers (built-in + generated)
│ ├── state/ # Output profiles and session persistence
│ └── utils/ # Template rendering, report generation
├── data/ # Trajectory datasets
├── scripts/ # Shell scripts for batch operations
└── tests/ # Test suite
CLI Reference
Usage: codetracer [COMMAND] [OPTIONS]
Commands:
analyze Run trajectory diagnosis on a single run directory
run Full pipeline: normalize → tree → analyze → replay
replay Resume a trajectory from a diagnosed breakpoint
interactive Interactive trajectory exploration shell
inspect Inspect a specific step or range from a trajectory
Global Options:
--model TEXT LLM model name
--api-base URL API endpoint
--api-key TEXT API key
--config PATH Custom configuration file
--profile TEXT Output profile (tracebench/detailed/rl_feedback)
--cost-limit $ Max LLM spend per trajectory (default: 3.0)
--dry-run Normalize + tree only, skip LLM analysis
If you use CodeTracer or CodeTraceBench in your research, please cite:
@article{li2026codetracer,
title={CodeTracer: Towards Traceable Agent States},
author={Li, Han and Yao, Yifan and Zhu, Letian and Feng, Rili and Ye, Hongyi and Wang, Jiaming and He, Yancheng and Zou, Pengyu and Zhang, Lehan and Lei, Xinping and Huang, Haoyang and Deng, Ken and Sun, Ming and Zhang, Zhaoxiang and Ye, He and Liu, Jiaheng},
journal={arXiv preprint arXiv:2604.11641},
year={2026}
}This project is licensed under the MIT License. See LICENSE for details.
Built at Nanjing University & Kwaipilot

