Improving PPO

## [Reproducing the official PPO implementation](https://ppo-details.cleanrl.dev//2021/11/05/ppo-implementation-details/)

Please check the link above for details.

- 13 core implementation details
- 9 Atari specific implementation details
- 9 implementation details for robotics tasks (with continuous action spaces)
- 5 LSTM implementation details
- 1 MultiDiscrete implementation detail

### Core implementation details
- [x] Vectorized architecture
- [x] Orthogonal Initialization of Weights and Constant Initialization of biases
  - The policy output layer weights are initialized with the scale of 0.01. The value output layer weights are initialized with the scale of 1
  - [torch.nn.init.orthogonal_](https://pytorch.org/docs/stable/_modules/torch/nn/init.html#orthogonal_)
- [x] The Adam Optimizer’s Epsilon Parameter
  - PPO sets the epsilon parameter to 1e-5
- [x] Adam Learning Rate Annealing
  - In MuJoCo, the learning rate linearly decays from 3e-4 to 0
  - Atari games set the learning rate to linearly decay from 2.5e-4 to 0
- [x] Generalized Advantage Estimation
  - Termination caused by environment length limits must be count as non terminals in target value calculation
- [x] Mini-batch Updates
- [x] Normalization of Advantages
  - After calculating the advantages based on GAE, PPO normalizes the advantages by subtracting their mean and dividing them by their standard deviation. In particular, this normalization happens at the minibatch level instead of the whole batch level!
- [ ] Value Function Loss Clipping
- [x] Overall Loss and Entropy Bonus
- [x] Global Gradient Clipping
- [x] Debug variables
  - policy_loss
  - value_loss
  - entropy_loss
  - clipfrac
  - approxkl
- [x] Shared and separate MLP networks for policy and value functions

### Details for continuous action domains (e.g. Mujoco)

- [x] Continuous actions via normal distributions
- [ ] State-independent log standard deviation
- [x] Independent action components
- [x] Separate MLP networks for policy and value functions
- [x] Handling of action clipping to valid range and storage
   - The original unclipped action is stored as part of the episodic data
   - Squashing function (tanh) to the Gaussian samples to satisfy constraints works better
- [x] Normalization of Observation
  - `VecNormalize`. The raw observation was normalized by subtracting its running mean and divided by its variance.
- [x] Observation Clipping
- [x] Reward Scaling
- [x] Reward Clipping

`nsteps=2048,
  nminibatches=32,
        lam=0.95,
        gamma=0.99,
        noptepochs=10,
        log_interval=1,
        ent_coef=0.0,
        lr=lambda f: 3e-4 * f,
        cliprange=0.2,
        value_network='copy'`

### LSTM implementation details

- [x] Layer initialization for LSTM layer
- [x] Initialize the LSTM states to be zeros
- [x] Reset LSTM states at the end of the episode
- [x] Prepare sequential rollouts in mini-batches
- [x] Reconstruct LSTM states during training !!!

### Auxiliary implementation details

- [x] Clip Range Annealing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving PPO #9

Reproducing the official PPO implementation

Core implementation details

Details for continuous action domains (e.g. Mujoco)

LSTM implementation details

Auxiliary implementation details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improving PPO #9

Description

Reproducing the official PPO implementation

Core implementation details

Details for continuous action domains (e.g. Mujoco)

LSTM implementation details

Auxiliary implementation details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions