This project reproduces the Proximal Policy Optimization (PPO) algorithm from the original paper "Proximal Policy Optimization Algorithms" by Schulman et al. (2017) using PyTorch. The implementation features four specialized versions for different types of environments:
ppo_descrete.py: Optimized for classic control tasks (CartPole, LunarLander) with MLP networksppo_atari.py: Optimized for Atari games (Breakout, Pong) with CNN networksppo_continous.py: Optimized for continuous control tasks (MuJoCo environments like HalfCheetah, Ant, Humanoid) with MLP networksppo_racing.py: Optimized for CarRacing environments with CNN networks for visual input
The code supports logging to TensorBoard and Weights & Biases (wandb) for experiment tracking and visualization.
- Python 3.8 or higher
- Conda (Miniconda or Anaconda)
- Dependencies are listed in
requirements.txt.
-
Create a conda environment with Python 3.8:
conda create -n ppo-algorithm python=3.8 conda activate ppo-algorithm
-
Install box2d-py from conda-forge (avoids compilation issues):
conda install -c conda-forge box2d-py
-
Install remaining dependencies:
pip install -r requirements.txt
If you prefer to use pip only, ensure you have the necessary build tools installed:
For Ubuntu/Debian:
sudo apt update
sudo apt install -y build-essential python3-dev swig
pip install -r requirements.txtFor macOS:
brew install swig
pip install -r requirements.txtNote: The conda approach is recommended as it avoids compilation issues with box2d-py and provides pre-compiled binaries.
-
Activate the conda environment:
conda activate ppo-algorithm
-
Choose your implementation:
For Classic Control Tasks (CartPole, LunarLander):
python ppo_descrete.py --gym-id CartPole-v1 --track --wandb-project-name ppo-reproduction
For Atari Games (Breakout, Pong):
python ppo_atari.py --gym-id BreakoutNoFrameskip-v4 --track --wandb-project-name ppo-reproduction
-
Activate the conda environment:
conda activate ppo-algorithm
-
Choose the appropriate script and environment:
Best for: Classic control tasks with discrete action spaces (CartPole, LunarLander, Acrobot)
CartPole-v1:
python ppo_descrete.py --gym-id CartPole-v1 --total-timesteps 25000 --track --wandb-project-name ppo-reproductionLunarLander-v2:
python ppo_descrete.py --gym-id LunarLander-v2 --total-timesteps 100000 --track --wandb-project-name ppo-reproductionAcrobot-v1:
python ppo_descrete.py --gym-id Acrobot-v1 --total-timesteps 50000 --track --wandb-project-name ppo-reproductionBest for: Atari games with visual input requiring CNN processing
Breakout:
python ppo_atari.py --gym-id BreakoutNoFrameskip-v4 --total-timesteps 10000000 --track --wandb-project-name ppo-reproductionPong:
python ppo_atari.py --gym-id PongNoFrameskip-v4 --total-timesteps 10000000 --track --wandb-project-name ppo-reproductionSpaceInvaders:
python ppo_atari.py --gym-id SpaceInvadersNoFrameskip-v4 --total-timesteps 10000000 --track --wandb-project-name ppo-reproductionBest for: MuJoCo and PyBullet environments with continuous action spaces
HalfCheetah:
python ppo_continous.py --gym-id HalfCheetahBulletEnv-v0 --total-timesteps 2000000 --track --wandb-project-name ppo-reproductionAnt:
python ppo_continous.py --gym-id AntBulletEnv-v0 --total-timesteps 2000000 --track --wandb-project-name ppo-reproductionHumanoid:
python ppo_continous.py --gym-id HumanoidBulletEnv-v0 --total-timesteps 2000000 --track --wandb-project-name ppo-reproductionWalker2D:
python ppo_continous.py --gym-id Walker2DBulletEnv-v0 --total-timesteps 2000000 --track --wandb-project-name ppo-reproductionBest for: CarRacing environments with visual input and continuous steering/acceleration
CarRacing-v2:
python ppo_racing.py --gym-id CarRacing-v2 --total-timesteps 1000000 --track --wandb-project-name ppo-reproductionCarRacing-v1:
python ppo_racing.py --gym-id CarRacing-v1 --total-timesteps 1000000 --track --wandb-project-name ppo-reproductionCarRacing-v0:
python ppo_racing.py --gym-id CarRacing-v0 --total-timesteps 1000000 --track --wandb-project-name ppo-reproductionPurpose: Test environment setup and visualize random agent behavior
Test Ant environment:
python test_random_action.pyNote: Logging to Weights & Biases requires you to set --track, --wandb-project-name, and optionally --wandb-entity for organizational logging.
| Feature | ppo_descrete.py |
ppo_atari.py |
ppo_continous.py |
ppo_racing.py |
|---|---|---|---|---|
| Target Environments | Classic Control (CartPole, LunarLander) | Atari Games (Breakout, Pong) | MuJoCo/PyBullet (HalfCheetah, Ant) | CarRacing (v0, v1, v2) |
| Action Space | Discrete | Discrete | Continuous | Continuous |
| Neural Network | MLP (Multi-Layer Perceptron) | CNN (Convolutional Neural Network) | MLP (Multi-Layer Perceptron) | CNN (Convolutional Neural Network) |
| Input Processing | Raw observations | Preprocessed frames (84x84, grayscale, stacked) | Raw observations | RGB frames (96x96) |
| Default Timesteps | 25,000 | 10,000,000 | 2,000,000 | 1,000,000 |
| Default Environments | 4 | 8 | 1 | 4 |
| Default Clip Coef | 0.2 | 0.1 | 0.2 | 0.2 |
| Default Learning Rate | 2.5e-4 | 2.5e-4 | 3e-4 | 3e-4 |
| GPU Utilization | Low (3-10%) | High (50-90%) | Medium (20-40%) | High (40-70%) |
| Training Time | Fast (minutes) | Slow (hours) | Medium (hours) | Medium (hours) |
Use ppo_descrete.py when:
- Working with classic control tasks (CartPole, LunarLander, Acrobot)
- Need fast experimentation
- Limited computational resources
- Learning PPO fundamentals
- Discrete action spaces
Use ppo_atari.py when:
- Working with Atari games (Breakout, Pong, SpaceInvaders)
- Need high GPU utilization
- Researching computer vision + RL
- Have sufficient computational resources
- Discrete action spaces with visual input
Use ppo_continous.py when:
- Working with MuJoCo/PyBullet environments (HalfCheetah, Ant, Humanoid)
- Continuous action spaces
- Robotics applications
- Physics simulation tasks
- Medium computational resources
Use ppo_racing.py when:
- Working with CarRacing environments
- Continuous action spaces with visual input
- Autonomous driving research
- Visual navigation tasks
- Medium to high computational resources
For Classic Control (ppo_descrete.py):
python ppo_descrete.py --learning-rate 3e-4 --gamma 0.98 --clip-coef 0.2 --gae-lambda 0.95 --num-envs 8For Atari Games (ppo_atari.py):
python ppo_atari.py --learning-rate 2.5e-4 --gamma 0.99 --clip-coef 0.1 --gae-lambda 0.95 --num-envs 16For Continuous Control (ppo_continous.py):
python ppo_continous.py --learning-rate 3e-4 --gamma 0.99 --clip-coef 0.2 --gae-lambda 0.95 --num-envs 4For CarRacing (ppo_racing.py):
python ppo_racing.py --learning-rate 3e-4 --gamma 0.99 --clip-coef 0.2 --gae-lambda 0.95 --num-envs 8To log runs in Weights & Biases, provide:
--track: Enable wandb logging.--wandb-project-name: Name of your wandb project.--wandb-entity: (Optional) Wandb team or user entity.
Examples:
# For classic control tasks
python ppo_descrete.py --track --wandb-project-name ppo-experiments --wandb-entity your_team
# For Atari games
python ppo_atari.py --track --wandb-project-name ppo-experiments --wandb-entity your_teamBelow is a list of commonly used arguments for all scripts:
| Argument | Description | ppo_descrete.py | ppo_atari.py | ppo_continous.py | ppo_racing.py |
|---|---|---|---|---|---|
--gym-id |
ID of the Gym environment | CartPole-v1 |
BreakoutNoFrameskip-v4 |
HalfCheetahBulletEnv-v0 |
CarRacing-v2 |
--total-timesteps |
Total timesteps to run the experiment | 25000 |
10000000 |
2000000 |
1000000 |
--learning-rate |
Optimizer learning rate | 2.5e-4 |
2.5e-4 |
3e-4 |
3e-4 |
--seed |
Random seed for reproducibility | 1 |
1 |
1 |
1 |
--track |
Enable wandb logging | False |
False |
False |
False |
--wandb-project-name |
Wandb project name | ppo-test-new |
ppo-test-new |
ppo-implementation-details |
ppo-racing |
--num-envs |
Number of parallel environments | 4 |
8 |
1 |
4 |
--num-steps |
Steps per environment rollout | 128 |
128 |
2048 |
128 |
--clip-coef |
Clipping coefficient for PPO | 0.2 |
0.1 |
0.2 |
0.2 |
--gae |
Enable Generalized Advantage Estimation (GAE) | True |
True |
True |
True |
--gae-lambda |
GAE lambda parameter | 0.95 |
0.95 |
0.95 |
0.95 |
- TensorBoard: Logs are stored in the
runs/directory. Usetensorboard --logdir=runsto visualize training metrics. - Weights & Biases: Track your experiment online with wandb by setting
--trackand specifying the project name and entity.
Running with --track enables detailed experiment tracking in wandb, including episode rewards, loss, and training progress visualizations.
To visualize experiment logs, the project is configured to use Weights & Biases (wandb). You can access the project's experiment dashboard here on wandb to monitor the training progress, compare results, and analyze performance across runs.
This project is complete and fully functional. All PPO implementations are stable, tested, and ready for use across different environment types.
- PPO for Discrete Action Spaces (
ppo_descrete.py): Classic control tasks (CartPole, LunarLander, Acrobot) - PPO for Atari Games (
ppo_atari.py): Visual environments with CNN processing (Breakout, Pong, SpaceInvaders) - PPO for Continuous Control (
ppo_continous.py): MuJoCo/PyBullet environments (HalfCheetah, Ant, Humanoid, Walker2D) - PPO for CarRacing (
ppo_racing.py): Visual continuous control (CarRacing v0, v1, v2) - TensorBoard and Weights & Biases logging: Comprehensive experiment tracking
- Video recording: Training episode visualization
- Test utilities: Environment testing and random agent visualization (
test_random_action.py)
Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
- None currently reported - all implementations are stable and working correctly
Last Updated: October 2025





