Official implementation of AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems.
AstroReason-Bench is a comprehensive benchmark for evaluating agentic planning in astronautics mission design and planning. It integrates multiple scheduling regimes under a unified agent-oriented interface with strict physical constraints.
Five distinct planning challenges enforcing orbital mechanics, power budgets, data storage, and slew kinematics:
- SatNet - Deep Space Network resource allocation
- Revisit Optimization - Minimize time gaps for continuous target monitoring
- Regional Coverage - Maximize area coverage using strip-imaging satellites
- Stereo Imaging - Schedule synchronized observation pairs for 3D reconstruction
- Latency Optimization - Manage LEO constellation for integrated sensing and communications
This benchmark suite is under active development in the dev branch. The current implementation in the main branch represents a snapshot and a work-in-progress, and we are continously improving:
- Backend Transition: Currently relying on the astrox web API for orbital computations. We plan to migrate to local computation using established libraries for better reliability.
- Interface Exploration: Evaluating whether predefined MCP tools and Python APIs are optimal, or if agents should interact directly with computational libraries with codes.
- Benchmark Expansion: Actively designing better organizational structures for benchmarks and expanding to cover more diverse space missions.
- Baseline Performance: Current baselines are initial implementations for verification purposes. We plan to include more carefully-tuned baseline algorithms for each problem in the future.
- Python 3.12+
- Claude Code (required - agentic LLM interface)
- uv (required - manages environments and builds sandboxes)
- bubblewrap (optional, enables filesystem isolation):
# Debian/Ubuntu sudo apt install bubblewrap # Arch Linux sudo pacman -S bubblewrap # Fedora sudo dnf install bubblewrap
# Clone the repository with submodules
git clone --recurse-submodules https://github.com/your-org/astro-reason.git
cd astro-reason
# If you already cloned without submodules, initialize them:
# git submodule update --init --recursive
# Create virtual environment and install dependencies
uv sync --all-groups
# Activate the environment (required for all subsequent commands)
source .venv/bin/activate # bash/zsh
# or: source .venv/bin/activate.fish # fish
# Build sandbox environments (required before running benchmarks)
bash src/benchmark/build_sandbox.sh
bash src/satnet_agent/build_sandbox.shNote: The build scripts use uv pip install --python to install dependencies with shebangs pointing to .venv/bin/python3. Always activate the virtual environment before building or running benchmarks.
export ANTHROPIC_API_KEY="..." # Claude
export DEEPSEEK_API_KEY="..." # DeepSeek
export DASHSCOPE_API_KEY="..." # QwenEvaluate agentic LLM systems on benchmarks:
# Single case evaluation
python src/benchmark/run_benchmark.py \
--benchmark revisit-optimization \
--case case_0001 \
--model anthropic::claude-sonnet-4-5-20250929
# All cases in benchmark
python src/benchmark/run_benchmark.py \
--benchmark stereo-imaging \
--all \
--model anthropic::claude-sonnet-4-5-20250929
# Interactive mode (for close inspection and observation)
python src/benchmark/run_benchmark.py \
--benchmark regional-coverage \
--case case_0001 \
--model anthropic::claude-sonnet-4-5-20250929 \
--interactive
# File system isolation and resource limits
python src/benchmark/run_benchmark.py \
--benchmark latency-optimization \
--case case_0001 \
--bwrap \
--cpu-quota 800% \
--memory-limit 16G \
--model deepsee::deepseek-chatAvailable benchmarks: revisit-optimization, stereo-imaging, latency-optimization, regional-coverage
SatNet uses a separate runner:
# Run SatNet Week 40
python src/satnet_agent/run_benchmark.py \
--week 40 \
--model anthropic::claude-sonnet-4-5-20250929
# Run all weeks
python src/satnet_agent/run_benchmark.py \
--all \
--model anthropic::claude-sonnet-4-5-20250929
# Interactive mode
python src/satnet_agent/run_benchmark.py \
--week 40 \
--model anthropic::claude-sonnet-4-5-20250929 \
--interactive
# File isolation and resource limits
python src/satnet_agent/run_benchmark.py \
--week 40
--model anthropic::claude-sonnet-4-5-2025-0929 \
--bwrap \
--memory-limit 16G \
--cpu-quota 800%Available weeks: 10, 20, 30, 40, 50
Run the test suite to verify installation and environment setup:
# Run all tests
pytest
# Run specific test file
pytest tests/test_mcp_server.py
# Run with verbose output
pytest -v
# Run specific benchmark tests
pytest tests/test_scenario_satnet.py# Run all benchmarks with Claude Sonnet 4.5
for benchmark in revisit-optimization stereo-imaging latency-optimization regional-coverage; do
python src/benchmark/run_benchmark.py \
--benchmark $benchmark \
--bwrap --memory-limit 16G --cpu-quota 800% \
--all \
--model anthropic::claude-sonnet-4-5-20250929 \
--timeout 7200
done
# Run SatNet weeks
python src/satnet_agent/run_benchmark.py \
--bwrap --memory-limit 16G --cpu-quota 800% \
--all \
--model anthropic::claude-sonnet-4-5-20250929 \
--timeout 7200Each benchmark case includes:
src/dataset/<benchmark>/cases/<case_id>/
├── mission_brief.md # Natural language task description
├── manifest.json # Case metadata and configuration
├── requirements.yaml # Mission-specific requirements
├── satellites.yaml # Satellite constellation definition
├── stations.yaml # Ground station locations
├── targets.yaml # Observation targets
└── initial_plan.json # Empty/template plan
Four-layer design:
- Physics Layer - SGP4 propagation, slew kinematics, resource modeling (stateless)
- Scenario Layer - State management, action registry, persistence (stateful)
- Interface Layer - MCP tools + Python API
- Cognitive Layer - LLM agent (ReAct loop via Claude Code)
Agents use MCP tools for exploration and Python scripts for bulk optimization.
If you use AstroReason-Bench in your research, please cite:
@article{wang2026astroreason,
title={AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems},
author={Weiyi Wang and Xinchi Chen and Jingjing Gong and Xuanjing Huang and Xipeng Qiu},
year={2026},
eprint={2601.11354},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.11354},
}This benchmark integrates the SatNet scheduling problem:
@inproceedings{goh2021satnet,
title={SatNet: A benchmark for satellite scheduling optimization},
author={Goh, Edwin and Venkataram, Hamsa Shwetha and Balaji, Bharathan and Wilson, Brian D and Johnston, Mark D},
booktitle={AAAI-22 Workshop on Machine Learning for Operations Research (ML4OR)},
year={2021}
}Benchmark datasets are derived from the following sources:
- TLE orbital data: CelesTrak
- City locations: World cities database (CC BY 4.0)
- Ground stations: Ground Station Dataset (MIT License)
Note: Satellite parameters other than orbital elements (e.g., power budgets, data storage, slew rates) are fictional or represent typical values for benchmark purposes.
Datasets are also available on Hugging Face.
This project is licensed under the MIT License - see the LICENSE file for details.