This repository was archived by the owner on Feb 24, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
This repository was archived by the owner on Feb 24, 2026. It is now read-only.
Improving PPO #9
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Reproducing the official PPO implementation
Please check the link above for details.
- 13 core implementation details
- 9 Atari specific implementation details
- 9 implementation details for robotics tasks (with continuous action spaces)
- 5 LSTM implementation details
- 1 MultiDiscrete implementation detail
Core implementation details
- Vectorized architecture
- Orthogonal Initialization of Weights and Constant Initialization of biases
- The policy output layer weights are initialized with the scale of 0.01. The value output layer weights are initialized with the scale of 1
- torch.nn.init.orthogonal_
- The Adam Optimizer’s Epsilon Parameter
- PPO sets the epsilon parameter to 1e-5
- Adam Learning Rate Annealing
- In MuJoCo, the learning rate linearly decays from 3e-4 to 0
- Atari games set the learning rate to linearly decay from 2.5e-4 to 0
- Generalized Advantage Estimation
- Termination caused by environment length limits must be count as non terminals in target value calculation
- Mini-batch Updates
- Normalization of Advantages
- After calculating the advantages based on GAE, PPO normalizes the advantages by subtracting their mean and dividing them by their standard deviation. In particular, this normalization happens at the minibatch level instead of the whole batch level!
- Value Function Loss Clipping
- Overall Loss and Entropy Bonus
- Global Gradient Clipping
- Debug variables
- policy_loss
- value_loss
- entropy_loss
- clipfrac
- approxkl
- Shared and separate MLP networks for policy and value functions
Details for continuous action domains (e.g. Mujoco)
- Continuous actions via normal distributions
- State-independent log standard deviation
- Independent action components
- Separate MLP networks for policy and value functions
- Handling of action clipping to valid range and storage
- The original unclipped action is stored as part of the episodic data
- Squashing function (tanh) to the Gaussian samples to satisfy constraints works better
- Normalization of Observation
VecNormalize. The raw observation was normalized by subtracting its running mean and divided by its variance.
- Observation Clipping
- Reward Scaling
- Reward Clipping
nsteps=2048, nminibatches=32, lam=0.95, gamma=0.99, noptepochs=10, log_interval=1, ent_coef=0.0, lr=lambda f: 3e-4 * f, cliprange=0.2, value_network='copy'
LSTM implementation details
- Layer initialization for LSTM layer
- Initialize the LSTM states to be zeros
- Reset LSTM states at the end of the episode
- Prepare sequential rollouts in mini-batches
- Reconstruct LSTM states during training !!!
Auxiliary implementation details
- Clip Range Annealing
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request