Skip to content
This repository was archived by the owner on Feb 24, 2026. It is now read-only.
This repository was archived by the owner on Feb 24, 2026. It is now read-only.

Improving PPO #9

@TolgaOk

Description

@TolgaOk

Reproducing the official PPO implementation

Please check the link above for details.

  • 13 core implementation details
  • 9 Atari specific implementation details
  • 9 implementation details for robotics tasks (with continuous action spaces)
  • 5 LSTM implementation details
  • 1 MultiDiscrete implementation detail

Core implementation details

  • Vectorized architecture
  • Orthogonal Initialization of Weights and Constant Initialization of biases
    • The policy output layer weights are initialized with the scale of 0.01. The value output layer weights are initialized with the scale of 1
    • torch.nn.init.orthogonal_
  • The Adam Optimizer’s Epsilon Parameter
    • PPO sets the epsilon parameter to 1e-5
  • Adam Learning Rate Annealing
    • In MuJoCo, the learning rate linearly decays from 3e-4 to 0
    • Atari games set the learning rate to linearly decay from 2.5e-4 to 0
  • Generalized Advantage Estimation
    • Termination caused by environment length limits must be count as non terminals in target value calculation
  • Mini-batch Updates
  • Normalization of Advantages
    • After calculating the advantages based on GAE, PPO normalizes the advantages by subtracting their mean and dividing them by their standard deviation. In particular, this normalization happens at the minibatch level instead of the whole batch level!
  • Value Function Loss Clipping
  • Overall Loss and Entropy Bonus
  • Global Gradient Clipping
  • Debug variables
    • policy_loss
    • value_loss
    • entropy_loss
    • clipfrac
    • approxkl
  • Shared and separate MLP networks for policy and value functions

Details for continuous action domains (e.g. Mujoco)

  • Continuous actions via normal distributions
  • State-independent log standard deviation
  • Independent action components
  • Separate MLP networks for policy and value functions
  • Handling of action clipping to valid range and storage
    • The original unclipped action is stored as part of the episodic data
    • Squashing function (tanh) to the Gaussian samples to satisfy constraints works better
  • Normalization of Observation
    • VecNormalize. The raw observation was normalized by subtracting its running mean and divided by its variance.
  • Observation Clipping
  • Reward Scaling
  • Reward Clipping

nsteps=2048, nminibatches=32, lam=0.95, gamma=0.99, noptepochs=10, log_interval=1, ent_coef=0.0, lr=lambda f: 3e-4 * f, cliprange=0.2, value_network='copy'

LSTM implementation details

  • Layer initialization for LSTM layer
  • Initialize the LSTM states to be zeros
  • Reset LSTM states at the end of the episode
  • Prepare sequential rollouts in mini-batches
  • Reconstruct LSTM states during training !!!

Auxiliary implementation details

  • Clip Range Annealing

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions