A simulation pipeline for training a residual reinforcement learning policy on a MuJoCo humanoid that can recover from pushes. The system uses a classical stabilization controller as the base policy and SAC (Soft Actor-Critic) for learning residual actions.
- Pre-built MuJoCo Models: Uses gymnasium's built-in Humanoid-v5 model (no custom XML files needed)
- Base Control: Classical PD (Proportional-Derivative) stabilization controller
- Residual RL: SAC algorithm learns residual actions on top of base control
- Push Disturbances: Configurable random push forces applied to the humanoid torso
- Training Pipeline: Complete training script with evaluation and checkpointing
- Install dependencies:
pip install -r requirements.txtTrain a residual RL agent:
python train.py --total_timesteps 1000000 --base_controller pd --push_probability 0.1Key arguments:
--total_timesteps: Total training timesteps (default: 1,000,000)--base_controller: Base controller type: "pd" or "lqr" (default: "pd")--push_probability: Probability of push per step (default: 0.1)--push_force_min: Minimum push force (default: 50.0)--push_force_max: Maximum push force (default: 200.0)--log_dir: Directory for logs and tensorboard (default: "./logs")--save_dir: Directory for saved models (default: "./models")--eval_freq: Evaluation frequency in timesteps (default: 10000)--render_eval: Render during evaluation episodes (slows training)--vis_freq: Visualize every N timesteps during training (default: 50000, set to 0 to disable)--vis_episodes: Number of episodes to visualize each time (default: 1)
Visualize with base controller only (no trained agent):
python visualize.py --base_controller pd --push_probability 0.1Visualize with trained agent:
python visualize.py --model_path ./models/best_model.zip --speed 1.0Key arguments:
--model_path: Path to trained model (optional, if not provided shows base controller only)--base_controller: Base controller type: "pd" or "lqr" (default: "pd")--push_probability: Probability of push per step (default: 0.1)--push_force_min: Minimum push force (default: 50.0)--push_force_max: Maximum push force (default: 200.0)--speed: Simulation speed multiplier (default: 1.0, use 2.0 for 2x speed)--deterministic: Use deterministic policy for trained models (default: True)
Quick test with visualization:
python test_env.py --renderEvaluate a trained model:
python evaluate.py --model_path ./models/best_model.zip --n_episodes 10 --renderKey arguments:
--model_path: Path to trained model--vec_normalize_path: Path to normalization stats (optional)--n_episodes: Number of evaluation episodes (default: 10)--render: Render episodes during evaluation--deterministic: Use deterministic policy (default: True)
HumanoidPushEnv wraps gymnasium's Humanoid-v5 environment and adds:
- Push disturbance mechanism (random forces applied to torso)
- Integration with base controller
- Residual action space for RL agent
PDStabilizationController provides classical control:
- Maintains upright posture
- Stabilizes base position
- Provides baseline policy for residual RL to improve upon
- Creates base controller and environment
- Trains SAC agent with residual actions
- Evaluates periodically and saves checkpoints
- Uses VecNormalize for observation/reward normalization
- Loads trained model
- Evaluates performance across multiple episodes
- Computes statistics (mean reward, success rate, etc.)
- Optional rendering for visualization
- Base Control: The PD controller computes base actions to maintain stability
- Residual Action: The SAC agent learns to output residual actions (additions to base control)
- Combined Action: Total action = base_action + residual_action
- Learning: SAC optimizes residual actions to improve recovery from pushes
.
├── logs/ # Tensorboard logs and evaluation results
├── models/ # Saved models and checkpoints
│ ├── best_model.zip
│ ├── final_model.zip
│ ├── checkpoints/
│ └── vec_normalize.pkl
└── ...
View training progress with TensorBoard:
tensorboard --logdir ./logsModify push parameters in training:
python train.py --push_probability 0.2 --push_force_min 100 --push_force_max 300Use LQR controller instead of PD:
python train.py --base_controller lqrModify _compute_reward() in residual_rl/env.py to adjust the reward function.
- Python 3.8+
- MuJoCo 3.0+
- PyTorch
- stable-baselines3
- gymnasium
- The environment uses gymnasium's built-in Humanoid-v5 model
- Base controller uses PD control with configurable gains
- SAC hyperparameters can be adjusted in
train.py - Push disturbances are applied horizontally to the torso
MIT License