AstroReason-Bench

Official implementation of AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems.

AstroReason-Bench is a comprehensive benchmark for evaluating agentic planning in astronautics mission design and planning. It integrates multiple scheduling regimes under a unified agent-oriented interface with strict physical constraints.

Overview

Five distinct planning challenges enforcing orbital mechanics, power budgets, data storage, and slew kinematics:

SatNet - Deep Space Network resource allocation
Revisit Optimization - Minimize time gaps for continuous target monitoring
Regional Coverage - Maximize area coverage using strip-imaging satellites
Stereo Imaging - Schedule synchronized observation pairs for 3D reconstruction
Latency Optimization - Manage LEO constellation for integrated sensing and communications

Development Status

This benchmark suite is under active development in the dev branch. The current implementation in the main branch represents a snapshot and a work-in-progress, and we are continously improving:

Backend Transition: Currently relying on the astrox web API for orbital computations. We plan to migrate to local computation using established libraries for better reliability.
Interface Exploration: Evaluating whether predefined MCP tools and Python APIs are optimal, or if agents should interact directly with computational libraries with codes.
Benchmark Expansion: Actively designing better organizational structures for benchmarks and expanding to cover more diverse space missions.
Baseline Performance: Current baselines are initial implementations for verification purposes. We plan to include more carefully-tuned baseline algorithms for each problem in the future.

Installation

Prerequisites

Python 3.12+
Claude Code (required - agentic LLM interface)
uv (required - manages environments and builds sandboxes)

bubblewrap (optional, enables filesystem isolation):

# Debian/Ubuntu
sudo apt install bubblewrap

# Arch Linux
sudo pacman -S bubblewrap

# Fedora
sudo dnf install bubblewrap

Setup

# Clone the repository with submodules
git clone --recurse-submodules https://github.com/your-org/astro-reason.git
cd astro-reason

# If you already cloned without submodules, initialize them:
# git submodule update --init --recursive

# Create virtual environment and install dependencies
uv sync --all-groups

# Activate the environment (required for all subsequent commands)
source .venv/bin/activate  # bash/zsh
# or: source .venv/bin/activate.fish  # fish

# Build sandbox environments (required before running benchmarks)
bash src/benchmark/build_sandbox.sh
bash src/satnet_agent/build_sandbox.sh

Note: The build scripts use uv pip install --python to install dependencies with shebangs pointing to .venv/bin/python3. Always activate the virtual environment before building or running benchmarks.

API Keys

export ANTHROPIC_API_KEY="..."      # Claude
export DEEPSEEK_API_KEY="..."       # DeepSeek
export DASHSCOPE_API_KEY="..."      # Qwen

How to Run

Running For Novel Benchmarks

Evaluate agentic LLM systems on benchmarks:

# Single case evaluation
python src/benchmark/run_benchmark.py \
  --benchmark revisit-optimization \
  --case case_0001 \
  --model anthropic::claude-sonnet-4-5-20250929

# All cases in benchmark
python src/benchmark/run_benchmark.py \
  --benchmark stereo-imaging \
  --all \
  --model anthropic::claude-sonnet-4-5-20250929

# Interactive mode (for close inspection and observation)
python src/benchmark/run_benchmark.py \
  --benchmark regional-coverage \
  --case case_0001 \
  --model anthropic::claude-sonnet-4-5-20250929 \
  --interactive

# File system isolation and resource limits
python src/benchmark/run_benchmark.py \
  --benchmark latency-optimization \
  --case case_0001 \
  --bwrap \
  --cpu-quota 800% \
  --memory-limit 16G \
  --model deepsee::deepseek-chat

Available benchmarks: revisit-optimization, stereo-imaging, latency-optimization, regional-coverage

Running SatNet (DSN Scheduling) Benchmark

SatNet uses a separate runner:

# Run SatNet Week 40
python src/satnet_agent/run_benchmark.py \
  --week 40 \
  --model anthropic::claude-sonnet-4-5-20250929

# Run all weeks
python src/satnet_agent/run_benchmark.py \
  --all \
  --model anthropic::claude-sonnet-4-5-20250929

# Interactive mode
python src/satnet_agent/run_benchmark.py \
  --week 40 \
  --model anthropic::claude-sonnet-4-5-20250929 \
  --interactive

# File isolation and resource limits
python src/satnet_agent/run_benchmark.py \
  --week 40
  --model anthropic::claude-sonnet-4-5-2025-0929 \
  --bwrap \
  --memory-limit 16G \
  --cpu-quota 800%

Available weeks: 10, 20, 30, 40, 50

Running Tests

Run the test suite to verify installation and environment setup:

# Run all tests
pytest

# Run specific test file
pytest tests/test_mcp_server.py

# Run with verbose output
pytest -v

# Run specific benchmark tests
pytest tests/test_scenario_satnet.py

Reproducing Paper Results

# Run all benchmarks with Claude Sonnet 4.5
for benchmark in revisit-optimization stereo-imaging latency-optimization regional-coverage; do
  python src/benchmark/run_benchmark.py \
  --benchmark $benchmark \
  --bwrap --memory-limit 16G --cpu-quota 800% \
  --all \
  --model anthropic::claude-sonnet-4-5-20250929 \
  --timeout 7200
done

# Run SatNet weeks
python src/satnet_agent/run_benchmark.py \
--bwrap --memory-limit 16G --cpu-quota 800% \
--all \
--model anthropic::claude-sonnet-4-5-20250929 \
--timeout 7200

Dataset Structure

Each benchmark case includes:

src/dataset/<benchmark>/cases/<case_id>/
├── mission_brief.md      # Natural language task description
├── manifest.json         # Case metadata and configuration
├── requirements.yaml     # Mission-specific requirements
├── satellites.yaml       # Satellite constellation definition
├── stations.yaml         # Ground station locations
├── targets.yaml          # Observation targets
└── initial_plan.json     # Empty/template plan

Architecture

Four-layer design:

Physics Layer - SGP4 propagation, slew kinematics, resource modeling (stateless)
Scenario Layer - State management, action registry, persistence (stateful)
Interface Layer - MCP tools + Python API
Cognitive Layer - LLM agent (ReAct loop via Claude Code)

Agents use MCP tools for exploration and Python scripts for bulk optimization.

Citation

If you use AstroReason-Bench in your research, please cite:

@article{wang2026astroreason,
      title={AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems}, 
      author={Weiyi Wang and Xinchi Chen and Jingjing Gong and Xuanjing Huang and Xipeng Qiu},
      year={2026},
      eprint={2601.11354},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.11354}, 
}

References

This benchmark integrates the SatNet scheduling problem:

@inproceedings{goh2021satnet,
  title={SatNet: A benchmark for satellite scheduling optimization},
  author={Goh, Edwin and Venkataram, Hamsa Shwetha and Balaji, Bharathan and Wilson, Brian D and Johnston, Mark D},
  booktitle={AAAI-22 Workshop on Machine Learning for Operations Research (ML4OR)},
  year={2021}
}

Data Sources

Benchmark datasets are derived from the following sources:

TLE orbital data: CelesTrak
City locations: World cities database (CC BY 4.0)
Ground stations: Ground Station Dataset (MIT License)

Note: Satellite parameters other than orbital elements (e.g., power budgets, data storage, slew rates) are fictional or represent typical values for benchmark purposes.

Datasets are also available on Hugging Face.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
satnet @ 3e78e56		satnet @ 3e78e56
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AstroReason-Bench

Overview

Development Status

Installation

Prerequisites

Setup

API Keys

How to Run

Running For Novel Benchmarks

Running SatNet (DSN Scheduling) Benchmark

Running Tests

Reproducing Paper Results

Dataset Structure

Architecture

Citation

References

Data Sources

License

About

Uh oh!

Releases

Packages

Languages

License

Mtrya/astro-reason

Folders and files

Latest commit

History

Repository files navigation

AstroReason-Bench

Overview

Development Status

Installation

Prerequisites

Setup

API Keys

How to Run

Running For Novel Benchmarks

Running SatNet (DSN Scheduling) Benchmark

Running Tests

Reproducing Paper Results

Dataset Structure

Architecture

Citation

References

Data Sources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages