Autonomous lunar landing system using reinforcement learning and high-fidelity spacecraft dynamics
Train AI agents to perform precise Moon landings using Basilisk astrodynamics simulation and Stable Baselines3 reinforcement learning.
# 1. Test your setup (2 minutes)
python unified_training.py --mode test
# 2. Train with curriculum learning (2-4 hours with GPU) - Recommended
python unified_training.py --mode curriculum
# 3. Monitor training
tensorboard --logdir=./logs
# 4. Evaluate trained model
python unified_training.py --mode eval --model-path ./models/best_model/best_modelThis project trains AI agents to autonomously land spacecraft on the Moon, handling:
- Realistic physics: 6-DOF dynamics via Basilisk astrodynamics framework
- Complex terrain: Procedurally generated lunar craters and slopes
- Multiple sensors: IMU, LIDAR, altimeter, fuel gauges
- Challenging conditions: Variable altitude, velocity, terrain difficulty
Training approach: Progressive curriculum learning from simple hovering β precision landings on extreme terrain
# Python 3.8+ required
pip install stable-baselines3[extra] gymnasium numpy matplotlib tensorboardImportant: Basilisk is not included in this repository and must be installed separately:
# Option 1: Install from PyPI (recommended)
pip install Basilisk
# Option 2: Build from source
# See: https://hanspeterschaub.info/basilisk/Note: The code expects Basilisk to be importable as a Python module. If you build from source, ensure the built dist3 directory is in your Python path.
| Mode | Duration | Purpose |
|---|---|---|
test |
2 min | Verify environment setup |
demo |
15 min | Quick demonstration of curriculum learning |
standard |
1-2 hrs | Direct RL training without curriculum (2M timesteps default) |
curriculum |
2-4 hrs | Progressive difficulty training with GPU (best results) |
eval |
1-2 min | Evaluate trained models |
- Hover Training β Learn altitude/attitude control
- Simple Descent β Controlled descent from moderate altitude
- Precision Landing β Land softly near target position
- Challenging Terrain β Handle complex lunar terrain
- Extreme Conditions β Master worst-case scenarios
- PPO (Proximal Policy Optimization) - Default, stable, general-purpose
- SAC (Soft Actor-Critic) - Sample efficient, good exploration
- TD3 (Twin Delayed DDPG) - Continuous control, deterministic
# Try different algorithms
python unified_training.py --mode curriculum --algorithm ppo
python unified_training.py --mode curriculum --algorithm sac# Launch TensorBoard
tensorboard --logdir=./logs
# Open browser to http://localhost:6006Key metrics:
rollout/ep_rew_mean- Average episode reward (primary metric)curriculum/current_stage- Current training stagerollout/ep_len_mean- Episode length
.
βββ unified_training.py # π Main training script (all modes)
βββ lunar_lander_env.py # Gymnasium environment
βββ ScenarioLunarLanderStarter.py # Basilisk simulation setup
βββ generate_terrain.py # Terrain generation utilities
βββ terrain_simulation.py # Lunar regolith physics model
βββ common_utils.py # Shared utility functions
βββ starship_constants.py # Starship HLS physical constants
β
βββ README.md # π This file - quick start guide
βββ REWARD_SYSTEM_GUIDE.md # π Comprehensive reward system documentation
βββ PRODUCTION_CHECKLIST.md # β
Production deployment guide
β
βββ basilisk/ # Astrodynamics simulation framework
βββ generated_terrain/ # Generated terrain heightmaps
βββ models/ # Saved trained models
βββ logs/ # TensorBoard logs
# Quick test
python unified_training.py --mode test
# Demo training
python unified_training.py --mode demo
# Full curriculum training (recommended)
python unified_training.py --mode curriculum --n-envs 4
# Standard training (no curriculum) - default 2M timesteps
python unified_training.py --mode standard --timesteps 2000000
# Resume training from checkpoint
python unified_training.py --mode standard --resume ./models/checkpoints/ppo_lunar_lander_500000_steps --timesteps 500000
# Resume curriculum training (automatic state restoration)
python unified_training.py --mode curriculum --resume-curriculum
# Evaluate model
python unified_training.py --mode eval --model-path ./models/best_model/best_model --eval-episodes 20
# Evaluate with visualization
python unified_training.py --mode eval --model-path ./models/best_model/best_model --renderThe training system includes comprehensive save/resume functionality:
- Every 100,000 timesteps: Model + VecNormalize stats saved
- Curriculum state: Stage progress, attempts, and performance tracked
- Best model: Automatically saved based on evaluation performance (every 10,000 timesteps)
# Standard training - resume from checkpoint
python unified_training.py --mode standard \
--resume ./models/checkpoints/ppo_lunar_lander_500000_steps \
--timesteps 500000 # Additional timesteps to train
# Curriculum training - automatic state restoration
python unified_training.py --mode curriculum --resume-curriculummodels/
βββ training_state.json # Training progress (human-readable)
βββ curriculum_state.pkl # Complete state (binary)
βββ checkpoints/ # Regular checkpoints (every 100k steps)
β βββ ppo_lunar_lander_100000_steps.zip
β βββ ppo_lunar_lander_200000_steps.zip
β βββ vecnormalize.pkl # Normalization statistics
βββ stage*_checkpoints/ # Per-stage checkpoints (curriculum)
βββ stage*_vecnormalize.pkl # Per-stage normalization
βββ best_model/ # Best performing model
βββ best_model.zip
# Gracefully stop training (saves state)
# Press Ctrl+C during training
# Resume automatically (curriculum mode)
python unified_training.py --mode curriculum --resume-curriculum
# Or manually specify checkpoint (standard mode)
python unified_training.py --mode standard --resume ./models/checkpoints/ppo_lunar_lander_450000_stepsAfter full curriculum training:
- Mean reward: 800-1200 on extreme conditions (terminal 1000 + bonuses up to 400)
- Success rate: 60%+ successful landings (curriculum requires this for advancement)
- Landing criteria: Altitude < 5m, vertical velocity < 3 m/s, horizontal speed < 2 m/s, attitude < 15Β° from upright
- Fuel efficiency: Bonus up to +150 points for high fuel remaining (only awarded on successful landing)
The reward system uses a comprehensive multi-component architecture designed to guide the agent from initial random actions to optimal landing performance.
Total Reward = Terminal Rewards + Progress Tracking + Safety/Efficiency + Control Quality
(Β±1000 scale) (0-5 scale) (Β±2 scale) (Β±1 scale)
Success Landing (1000-1600 points):
- Base success: +1000 (largest single component)
- Precision bonus: +0 to +200 (scales with distance from target)
- Softness bonus: +0 to +100 (scales with touchdown velocity)
- Attitude bonus: +0 to +100 (scales with upright orientation)
- Fuel efficiency: +0 to +150 (quadratic curve, ONLY on success)
- Control smoothness: +0 to +50 (rewards stable final approach)
Hard Landing (-300 to -450 points):
- Base penalty scaled by violation severity (velocity, position, attitude errors)
Crash (-400 to -800 points):
- Penalty scales with impact energy (velocity squared)
High Altitude Failure (-200 to -400 points):
- Penalty scales with altitude (higher failure = worse penalty)
Descent Profile (0-1): Encourages proper descent rate proportional to altitude (-2 to -10 m/s)
Approach Angle (0-0.5): Rewards vertical descent near ground (low horizontal/vertical velocity ratio)
Proximity to Target (0-1): Progressive reward for getting closer to landing site (active < 200m altitude)
Attitude Stability (0-0.5): Rewards upright orientation near ground (< 100m altitude)
Final Approach Quality (0-2): High reward in last 50m for being slow, upright, and on-target
Danger Zone Warnings (altitude < 50m):
- Speed danger: -0 to -1 (excessive vertical velocity)
- Tilt danger: -0 to -0.5 (tilted orientation)
- Lateral danger: -0 to -0.5 (high horizontal velocity)
Fuel Management: -0 to -1 (progressive warning as fuel depletes below 10%)
High Altitude Loitering: -0.5 (discourages hovering above 500m)
Control Effort: -0.001 Γ effort (encourages efficient control)
Control Jitter: -0.1 Γ change (penalizes rapid control changes)
Spin Rate: -0 to -0.5 (discourages uncontrolled rotation)
- Terminal rewards dominate (10x larger than shaping) - clearly signals success/failure
- Progressive difficulty - rewards increase as agent approaches landing
- Fuel efficiency rewarded ONLY on success - prevents hoarding during flight
- Multi-dimensional success criteria - velocity, position, attitude, fuel all matter
- Penalties scale with severity - worse violations get worse penalties
- Action smoothing - exponential moving average (80% old, 20% new) for stable control
| Outcome | Cumulative Reward | Description |
|---|---|---|
| Perfect Landing | 1200-1600 | All bonuses achieved (precision, fuel, smoothness) |
| Good Landing | 900-1200 | Most bonuses achieved |
| Basic Landing | 600-900 | Minimal bonuses, but successful |
| Poor Landing | 400-600 | Barely meets success criteria |
| Hard Landing | -300 to -450 | Crashes but in landing zone |
| Crash | -400 to -800 | Impact below surface |
| Timeout/Abort | -200 to -400 | Failure at high altitude |
The system tracks individual reward components for debugging and analysis:
# Launch TensorBoard to view reward component breakdown
tensorboard --logdir=./logs
# Navigate to "reward_components" section to see:
# - terminal_success, precision_bonus, fuel_efficiency, etc.
# - descent_profile, approach_angle, proximity, etc.
# - danger penalties, control quality metricsEach episode's info dict includes reward_components with detailed breakdown of all reward sources.
If agent not landing:
- Increase
success_thresholdin curriculum stages - Increase
min_episodesfor better mastery - Check TensorBoard for
episode/success_rate_100metric
If agent too cautious (hovers):
- Increase loitering penalty
- Reduce fuel efficiency bonus coefficient
- Add progressive time penalty
If agent crashes frequently:
- Increase danger zone penalties
- Reduce initial velocity range in curriculum
- Add more stages to curriculum
If agent uses too much fuel:
- Increase fuel efficiency bonus coefficient
- Add fuel consumption penalty during flight
- Reward descent profile adherence more
- README.md - This file: Quick start guide and overview
- REWARD_SYSTEM_GUIDE.md - π Comprehensive reward system documentation with tuning guide
- PRODUCTION_CHECKLIST.md - β Production deployment checklist and validation guide
- unified_training.py - Main training script with curriculum learning
- lunar_lander_env.py - Gymnasium environment implementation
- ScenarioLunarLanderStarter.py - Basilisk simulation setup
- generate_terrain.py - Procedural lunar terrain generation
- terrain_simulation.py - Lunar regolith physics (Bekker-Wong model)
- starship_constants.py - Starship HLS physical constants
# 1. First-time setup verification
python unified_training.py --mode test
# 2. Quick demo to understand the system
python unified_training.py --mode demo
# 3. Start full curriculum training
python unified_training.py --mode curriculum --n-envs 4 --algorithm ppo
# 4. Monitor progress (in separate terminal)
tensorboard --logdir=./logs
# 5. After training completes, evaluate
python unified_training.py --mode eval \
--model-path ./models/curriculum_final \
--eval-episodes 20# Generate custom terrain
python generate_terrain.py \
--output generated_terrain/custom_terrain.npy \
--size 2000 \
--craters 25 \
--seed 42 \
--visualizeEdit unified_training.py to customize:
- Curriculum stages and difficulty progression
- Model hyperparameters (learning rate, network architecture)
- Environment configuration (sensors, terrain, initial conditions)
- Success thresholds and advancement criteria
# Use more parallel environments
python unified_training.py --mode curriculum --n-envs 8
# Or try a faster algorithm
python unified_training.py --mode curriculum --algorithm sac# Use curriculum learning (automatic difficulty progression)
python unified_training.py --mode curriculum
# Or train longer
python unified_training.py --mode standard --timesteps 2000000# Run diagnostic test
python unified_training.py --mode test
# If test fails, check dependencies
pip install --upgrade stable-baselines3[extra] gymnasium- Basilisk: High-fidelity spacecraft dynamics with 6-DOF rigid body simulation
- Gravity: Lunar gravitational field (ΞΌ = 4.9028Γ10ΒΉΒ² mΒ³/sΒ²)
- Propulsion: 3 Raptor Vacuum engines (2.5 MN thrust each, 40-100% throttle)
- Sensors: IMU (noisy), LIDAR (64-ray cone scan), altimeter, fuel gauges, attitude sensors
- Terrain: Analytical Bekker-Wong model with procedural crater generation and realistic regolith mechanics
- Framework: Stable Baselines3 (PyTorch-based)
- Observation: 32-dimensional state vector (position, velocity, Euler angles, fuel flow rate, time-to-impact, LIDAR azimuthal bins, IMU)
- Observation normalization: VecNormalize for zero-mean, unit-variance observations (improves stability)
- Action: 15-dimensional continuous control:
- Primary engine throttles (3): individual throttle [0.4-1.0]
- Primary engine gimbals (6): pitch/yaw per engine [-8Β°, +8Β°]
- Mid-body thruster groups (3): rotation control [0, 1]
- RCS thruster groups (3): pitch/yaw/roll [0, 1]
- Action smoothing: Exponential moving average filter (80% old, 20% new) for stable control
- Reward: Comprehensive multi-component system with:
- Terminal rewards (Β±1000): Dominant signals for success/failure
- Progress tracking (0-5): Continuous guidance toward landing
- Safety penalties (Β±2): Danger zone warnings and efficiency
- Control quality (Β±1): Smooth control and technique optimization
- Curriculum Learning: 5 progressive stages with advancement requiring both mean reward threshold AND 60%+ success rate
See LICENSE file for details.
β
Curriculum learning for robust policy development
β
Multiple RL algorithms (PPO, SAC, TD3)
β
High-fidelity physics via Basilisk
β
Procedural terrain generation
β
Real-time monitoring with TensorBoard
β
Checkpoint system for resuming training
β
Comprehensive evaluation tools
Ready to train an AI to land on the Moon? π
Start here:
python unified_training.py --mode testFor more details, see the Documentation section above.