🚀 AI Lunar Lander

Autonomous lunar landing system using reinforcement learning and high-fidelity spacecraft dynamics

Train AI agents to perform precise Moon landings using Basilisk astrodynamics simulation and Stable Baselines3 reinforcement learning.

⚡ Quick Start

# 1. Test your setup (2 minutes)
python unified_training.py --mode test

# 2. Train with curriculum learning (2-4 hours with GPU) - Recommended
python unified_training.py --mode curriculum

# 3. Monitor training
tensorboard --logdir=./logs

# 4. Evaluate trained model
python unified_training.py --mode eval --model-path ./models/best_model/best_model

🎯 What This Does

This project trains AI agents to autonomously land spacecraft on the Moon, handling:

Realistic physics: 6-DOF dynamics via Basilisk astrodynamics framework
Complex terrain: Procedurally generated lunar craters and slopes
Multiple sensors: IMU, LIDAR, altimeter, fuel gauges
Challenging conditions: Variable altitude, velocity, terrain difficulty

Training approach: Progressive curriculum learning from simple hovering → precision landings on extreme terrain

📋 Prerequisites

Python Dependencies

# Python 3.8+ required
pip install stable-baselines3[extra] gymnasium numpy matplotlib tensorboard

Basilisk Astrodynamics Framework

Important: Basilisk is not included in this repository and must be installed separately:

# Option 1: Install from PyPI (recommended)
pip install Basilisk

# Option 2: Build from source
# See: https://hanspeterschaub.info/basilisk/

Note: The code expects Basilisk to be importable as a Python module. If you build from source, ensure the built dist3 directory is in your Python path.

🎓 Training Modes

Mode	Duration	Purpose
`test`	2 min	Verify environment setup
`demo`	15 min	Quick demonstration of curriculum learning
`standard`	1-2 hrs	Direct RL training without curriculum (2M timesteps default)
`curriculum`	2-4 hrs	Progressive difficulty training with GPU (best results)
`eval`	1-2 min	Evaluate trained models

Curriculum Stages

Hover Training → Learn altitude/attitude control
Simple Descent → Controlled descent from moderate altitude
Precision Landing → Land softly near target position
Challenging Terrain → Handle complex lunar terrain
Extreme Conditions → Master worst-case scenarios

🤖 Supported Algorithms

PPO (Proximal Policy Optimization) - Default, stable, general-purpose
SAC (Soft Actor-Critic) - Sample efficient, good exploration
TD3 (Twin Delayed DDPG) - Continuous control, deterministic

# Try different algorithms
python unified_training.py --mode curriculum --algorithm ppo
python unified_training.py --mode curriculum --algorithm sac

📊 Monitoring Progress

# Launch TensorBoard
tensorboard --logdir=./logs

# Open browser to http://localhost:6006

Key metrics:

rollout/ep_rew_mean - Average episode reward (primary metric)
curriculum/current_stage - Current training stage
rollout/ep_len_mean - Episode length

📁 Project Structure

.
├── unified_training.py              # 🌟 Main training script (all modes)
├── lunar_lander_env.py              # Gymnasium environment
├── ScenarioLunarLanderStarter.py    # Basilisk simulation setup
├── generate_terrain.py              # Terrain generation utilities
├── terrain_simulation.py            # Lunar regolith physics model
├── common_utils.py                  # Shared utility functions
├── starship_constants.py            # Starship HLS physical constants
│
├── README.md                        # 📖 This file - quick start guide
├── REWARD_SYSTEM_GUIDE.md           # 🎁 Comprehensive reward system documentation
├── PRODUCTION_CHECKLIST.md          # ✅ Production deployment guide
│
├── basilisk/                        # Astrodynamics simulation framework
├── generated_terrain/               # Generated terrain heightmaps
├── models/                          # Saved trained models
└── logs/                            # TensorBoard logs

🔧 Common Commands

# Quick test
python unified_training.py --mode test

# Demo training
python unified_training.py --mode demo

# Full curriculum training (recommended)
python unified_training.py --mode curriculum --n-envs 4

# Standard training (no curriculum) - default 2M timesteps
python unified_training.py --mode standard --timesteps 2000000

# Resume training from checkpoint
python unified_training.py --mode standard --resume ./models/checkpoints/ppo_lunar_lander_500000_steps --timesteps 500000

# Resume curriculum training (automatic state restoration)
python unified_training.py --mode curriculum --resume-curriculum

# Evaluate model
python unified_training.py --mode eval --model-path ./models/best_model/best_model --eval-episodes 20

# Evaluate with visualization
python unified_training.py --mode eval --model-path ./models/best_model/best_model --render

💾 Save & Resume Training

The training system includes comprehensive save/resume functionality:

Automatic Checkpointing

Every 100,000 timesteps: Model + VecNormalize stats saved
Curriculum state: Stage progress, attempts, and performance tracked
Best model: Automatically saved based on evaluation performance (every 10,000 timesteps)

Resume Training

# Standard training - resume from checkpoint
python unified_training.py --mode standard \
    --resume ./models/checkpoints/ppo_lunar_lander_500000_steps \
    --timesteps 500000  # Additional timesteps to train

# Curriculum training - automatic state restoration
python unified_training.py --mode curriculum --resume-curriculum

What Gets Saved

models/
├── training_state.json           # Training progress (human-readable)
├── curriculum_state.pkl          # Complete state (binary)
├── checkpoints/                  # Regular checkpoints (every 100k steps)
│   ├── ppo_lunar_lander_100000_steps.zip
│   ├── ppo_lunar_lander_200000_steps.zip
│   └── vecnormalize.pkl          # Normalization statistics
├── stage*_checkpoints/           # Per-stage checkpoints (curriculum)
├── stage*_vecnormalize.pkl       # Per-stage normalization
└── best_model/                   # Best performing model
    └── best_model.zip

Recovery from Interruptions

# Gracefully stop training (saves state)
# Press Ctrl+C during training

# Resume automatically (curriculum mode)
python unified_training.py --mode curriculum --resume-curriculum

# Or manually specify checkpoint (standard mode)
python unified_training.py --mode standard --resume ./models/checkpoints/ppo_lunar_lander_450000_steps

🏆 Expected Performance

After full curriculum training:

Mean reward: 800-1200 on extreme conditions (terminal 1000 + bonuses up to 400)
Success rate: 60%+ successful landings (curriculum requires this for advancement)
Landing criteria: Altitude < 5m, vertical velocity < 3 m/s, horizontal speed < 2 m/s, attitude < 15° from upright
Fuel efficiency: Bonus up to +150 points for high fuel remaining (only awarded on successful landing)

🎁 Reward System Design

The reward system uses a comprehensive multi-component architecture designed to guide the agent from initial random actions to optimal landing performance.

Reward Architecture

Total Reward = Terminal Rewards + Progress Tracking + Safety/Efficiency + Control Quality
               (±1000 scale)    (0-5 scale)        (±2 scale)          (±1 scale)

1. Terminal Rewards (±1000) - Dominant Episode Outcome

Success Landing (1000-1600 points):

Base success: +1000 (largest single component)
Precision bonus: +0 to +200 (scales with distance from target)
Softness bonus: +0 to +100 (scales with touchdown velocity)
Attitude bonus: +0 to +100 (scales with upright orientation)
Fuel efficiency: +0 to +150 (quadratic curve, ONLY on success)
Control smoothness: +0 to +50 (rewards stable final approach)

Hard Landing (-300 to -450 points):

Base penalty scaled by violation severity (velocity, position, attitude errors)

Crash (-400 to -800 points):

Penalty scales with impact energy (velocity squared)

High Altitude Failure (-200 to -400 points):

Penalty scales with altitude (higher failure = worse penalty)

2. Progress Tracking Rewards (0-5 per step) - Continuous Guidance

Descent Profile (0-1): Encourages proper descent rate proportional to altitude (-2 to -10 m/s)

Approach Angle (0-0.5): Rewards vertical descent near ground (low horizontal/vertical velocity ratio)

Proximity to Target (0-1): Progressive reward for getting closer to landing site (active < 200m altitude)

Attitude Stability (0-0.5): Rewards upright orientation near ground (< 100m altitude)

Final Approach Quality (0-2): High reward in last 50m for being slow, upright, and on-target

3. Safety & Efficiency Penalties (±2 per step)

Danger Zone Warnings (altitude < 50m):

Speed danger: -0 to -1 (excessive vertical velocity)
Tilt danger: -0 to -0.5 (tilted orientation)
Lateral danger: -0 to -0.5 (high horizontal velocity)

Fuel Management: -0 to -1 (progressive warning as fuel depletes below 10%)

High Altitude Loitering: -0.5 (discourages hovering above 500m)

4. Control Quality Penalties (±1 per step)

Control Effort: -0.001 × effort (encourages efficient control)

Control Jitter: -0.1 × change (penalizes rapid control changes)

Spin Rate: -0 to -0.5 (discourages uncontrolled rotation)

Reward Design Philosophy

Terminal rewards dominate (10x larger than shaping) - clearly signals success/failure
Progressive difficulty - rewards increase as agent approaches landing
Fuel efficiency rewarded ONLY on success - prevents hoarding during flight
Multi-dimensional success criteria - velocity, position, attitude, fuel all matter
Penalties scale with severity - worse violations get worse penalties
Action smoothing - exponential moving average (80% old, 20% new) for stable control

Expected Cumulative Rewards

Outcome	Cumulative Reward	Description
Perfect Landing	1200-1600	All bonuses achieved (precision, fuel, smoothness)
Good Landing	900-1200	Most bonuses achieved
Basic Landing	600-900	Minimal bonuses, but successful
Poor Landing	400-600	Barely meets success criteria
Hard Landing	-300 to -450	Crashes but in landing zone
Crash	-400 to -800	Impact below surface
Timeout/Abort	-200 to -400	Failure at high altitude

Monitoring Reward Components

The system tracks individual reward components for debugging and analysis:

# Launch TensorBoard to view reward component breakdown
tensorboard --logdir=./logs

# Navigate to "reward_components" section to see:
# - terminal_success, precision_bonus, fuel_efficiency, etc.
# - descent_profile, approach_angle, proximity, etc.
# - danger penalties, control quality metrics

Each episode's info dict includes reward_components with detailed breakdown of all reward sources.

Tuning Guidelines

If agent not landing:

Increase success_threshold in curriculum stages
Increase min_episodes for better mastery
Check TensorBoard for episode/success_rate_100 metric

If agent too cautious (hovers):

Increase loitering penalty
Reduce fuel efficiency bonus coefficient
Add progressive time penalty

If agent crashes frequently:

Increase danger zone penalties
Reduce initial velocity range in curriculum
Add more stages to curriculum

If agent uses too much fuel:

Increase fuel efficiency bonus coefficient
Add fuel consumption penalty during flight
Reward descent profile adherence more

📚 Documentation

README.md - This file: Quick start guide and overview
REWARD_SYSTEM_GUIDE.md - 🎁 Comprehensive reward system documentation with tuning guide
PRODUCTION_CHECKLIST.md - ✅ Production deployment checklist and validation guide

Key Code Files

unified_training.py - Main training script with curriculum learning
lunar_lander_env.py - Gymnasium environment implementation
ScenarioLunarLanderStarter.py - Basilisk simulation setup
generate_terrain.py - Procedural lunar terrain generation
terrain_simulation.py - Lunar regolith physics (Bekker-Wong model)
starship_constants.py - Starship HLS physical constants

🚀 Example Workflow

# 1. First-time setup verification
python unified_training.py --mode test

# 2. Quick demo to understand the system
python unified_training.py --mode demo

# 3. Start full curriculum training
python unified_training.py --mode curriculum --n-envs 4 --algorithm ppo

# 4. Monitor progress (in separate terminal)
tensorboard --logdir=./logs

# 5. After training completes, evaluate
python unified_training.py --mode eval \
    --model-path ./models/curriculum_final \
    --eval-episodes 20

🛠️ Customization

Custom Terrain

# Generate custom terrain
python generate_terrain.py \
    --output generated_terrain/custom_terrain.npy \
    --size 2000 \
    --craters 25 \
    --seed 42 \
    --visualize

Modify Training Parameters

Edit unified_training.py to customize:

Curriculum stages and difficulty progression
Model hyperparameters (learning rate, network architecture)
Environment configuration (sensors, terrain, initial conditions)
Success thresholds and advancement criteria

🐛 Troubleshooting

Training is slow

# Use more parallel environments
python unified_training.py --mode curriculum --n-envs 8

# Or try a faster algorithm
python unified_training.py --mode curriculum --algorithm sac

Agent not learning

# Use curriculum learning (automatic difficulty progression)
python unified_training.py --mode curriculum

# Or train longer
python unified_training.py --mode standard --timesteps 2000000

Environment errors

# Run diagnostic test
python unified_training.py --mode test

# If test fails, check dependencies
pip install --upgrade stable-baselines3[extra] gymnasium

🔬 Technical Details

Simulation Framework

Basilisk: High-fidelity spacecraft dynamics with 6-DOF rigid body simulation
Gravity: Lunar gravitational field (μ = 4.9028×10¹² m³/s²)
Propulsion: 3 Raptor Vacuum engines (2.5 MN thrust each, 40-100% throttle)
Sensors: IMU (noisy), LIDAR (64-ray cone scan), altimeter, fuel gauges, attitude sensors
Terrain: Analytical Bekker-Wong model with procedural crater generation and realistic regolith mechanics

Reinforcement Learning

Framework: Stable Baselines3 (PyTorch-based)
Observation: 32-dimensional state vector (position, velocity, Euler angles, fuel flow rate, time-to-impact, LIDAR azimuthal bins, IMU)
Observation normalization: VecNormalize for zero-mean, unit-variance observations (improves stability)
Action: 15-dimensional continuous control:
- Primary engine throttles (3): individual throttle [0.4-1.0]
- Primary engine gimbals (6): pitch/yaw per engine [-8°, +8°]
- Mid-body thruster groups (3): rotation control [0, 1]
- RCS thruster groups (3): pitch/yaw/roll [0, 1]
Action smoothing: Exponential moving average filter (80% old, 20% new) for stable control
Reward: Comprehensive multi-component system with:
- Terminal rewards (±1000): Dominant signals for success/failure
- Progress tracking (0-5): Continuous guidance toward landing
- Safety penalties (±2): Danger zone warnings and efficiency
- Control quality (±1): Smooth control and technique optimization
Curriculum Learning: 5 progressive stages with advancement requiring both mean reward threshold AND 60%+ success rate

📄 License

See LICENSE file for details.

🌟 Key Features

✅ Curriculum learning for robust policy development
✅ Multiple RL algorithms (PPO, SAC, TD3)
✅ High-fidelity physics via Basilisk
✅ Procedural terrain generation
✅ Real-time monitoring with TensorBoard
✅ Checkpoint system for resuming training
✅ Comprehensive evaluation tools

Ready to train an AI to land on the Moon? 🌙

Start here:

python unified_training.py --mode test

For more details, see the Documentation section above.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github		.github
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REWARD_SYSTEM_GUIDE.md		REWARD_SYSTEM_GUIDE.md
ScenarioLunarLanderStarter.py		ScenarioLunarLanderStarter.py
build_basilisk.bat		build_basilisk.bat
common_utils.py		common_utils.py
convert_to_obj.py		convert_to_obj.py
generate_terrain.py		generate_terrain.py
lunar_lander_env.py		lunar_lander_env.py
moon_landing_viz.py		moon_landing_viz.py
moon_terrain.obj		moon_terrain.obj
starship_constants.py		starship_constants.py
starship_hls.glb		starship_hls.glb
terrain_simulation.py		terrain_simulation.py
unified_training.py		unified_training.py

Folders and files

Latest commit

History

Repository files navigation

🚀 AI Lunar Lander

⚡ Quick Start

🎯 What This Does

📋 Prerequisites

Python Dependencies

Basilisk Astrodynamics Framework

🎓 Training Modes

Curriculum Stages

🤖 Supported Algorithms

📊 Monitoring Progress

📁 Project Structure

🔧 Common Commands

💾 Save & Resume Training

Automatic Checkpointing

Resume Training

What Gets Saved

Recovery from Interruptions

🏆 Expected Performance

🎁 Reward System Design

Reward Architecture

1. Terminal Rewards (±1000) - Dominant Episode Outcome

2. Progress Tracking Rewards (0-5 per step) - Continuous Guidance

3. Safety & Efficiency Penalties (±2 per step)

4. Control Quality Penalties (±1 per step)

Reward Design Philosophy

Expected Cumulative Rewards

Monitoring Reward Components

Tuning Guidelines

📚 Documentation

Key Code Files

🚀 Example Workflow

🛠️ Customization

Custom Terrain

Modify Training Parameters

🐛 Troubleshooting

Training is slow

Agent not learning

Environment errors

🔬 Technical Details

Simulation Framework

Reinforcement Learning

📄 License

🌟 Key Features

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages