TL;DR. This repository implements a comparative benchmark of classic and modern Deep RL algorithms on the VizDoom Deadly Corridor task, using a unified environment wrapper, YAML-based configs, and wandb logging.
All agents share:
- The same VizDoom environment (Deadly Corridor)
- The same preprocessing & frame stacking
- The same logging / checkpointing pipeline
so that we can focus on algorithmic differences in sample efficiency and final performance.
All algorithms act on the same discrete 7-action space and stacked grayscale frames (C, H, W):
- DQN
- DDQN (Double DQN)
- DDDQN (Dueling Double DQN)
- Rainbow-style DQN (Rainbow-lite)
- Dueling Q-head
- Double DQN target update
- Prioritized Experience Replay (PER) via a dedicated buffer
- No noisy nets / no distributional head (simplified Rainbow variant)
- REINFORCE (Monte Carlo policy gradient, with optional value baseline)
- A2C (Advantage Actor-Critic, synchronous)
- A3C (Advantage Actor-Critic, asynchronous-style loop in a single process)
- PPO (Proximal Policy Optimization)
- TRPO (Trust Region Policy Optimization, simplified with a KL penalty)
The goal is a clean, reproducible benchmark that can run on a lab server (SSH + tmux) without touching global system packages.
- Scenario: VizDoom – Deadly Corridor
- Goal: Reach the vest at the end of the corridor without dying.
- Map: narrow corridor with monsters on both sides.
- Reward (shaped):
+dXfor getting closer to the vest-dXfor moving away-100death penalty
We use 7 discrete actions (one-hot):
- MOVE_LEFT
- MOVE_RIGHT
- ATTACK
- MOVE_FORWARD
- MOVE_BACKWARD
- TURN_LEFT
- TURN_RIGHT
Implemented in rldoom/envs/deadly_corridor.py:
- Take raw RGB buffer from VizDoom.
- Convert to grayscale.
- Crop uninformative regions (HUD, floor, etc.).
- Resize to
frame_size x frame_size(default: 84×84). - Normalize to
[0, 1]. - Maintain a deque of last 4 frames, stacked to tensor of shape
(4, 84, 84).
This ensures all algorithms see identical observations.
RLDoom/
train.py # Main training entrypoint (single algo/seed)
eval.py # Evaluation entrypoint
requirements.txt # Python dependencies
doom_files/
deadly_corridor.cfg # VizDoom config
deadly_corridor.wad # VizDoom scenario
rldoom/
__init__.py
configs/
__init__.py
deadly_corridor.yaml # YAML config (env, train, logging, algo-specific)
# make_config(algo, seed) → flat cfg object
envs/
__init__.py
deadly_corridor.py # VizDoom wrapper + preprocessing + frame stacking
models/
__init__.py
cnn_backbone.py # Shared convolutional encoder for images
heads.py # Q-heads, dueling heads, actor/critic heads, etc.
buffers/
__init__.py
replay_buffer.py # Standard experience replay buffer
prioritized_replay.py # Prioritized Experience Replay (Rainbow)
rollout_buffer.py # On-policy rollout storage for PPO/TRPO/A2C/A3C
agents/
__init__.py
base.py # Base Agent class (common interface)
dqn.py # Vanilla DQN
ddqn.py # Double DQN
dddqn.py # Dueling Double DQN
rainbow.py # Rainbow-lite DQN with PER
reinforce.py # REINFORCE (MC policy gradient, optional baseline)
a2c.py # Advantage Actor-Critic
a3c.py # A3C-style agent (single-process version)
ppo.py # PPO agent
trpo.py # TRPO-style agent
trainers/
__init__.py
offpolicy.py # Training loop for DQN / DDQN / DDDQN / Rainbow
onpolicy.py # Training loop for REINFORCE / A2C / A3C / PPO / TRPO
utils/
__init__.py
logger.py # Wandb/console logger wrapper
seeding.py # Seeding helper
misc.py # Small utilities (path helpers, etc.)
scripts/
run_train.sh # tmux-friendly single-algorithm launcher
run_eval.sh # tmux-friendly evaluation launcher
launch_queue.py # Multi-GPU job launcher for training
launch_eval_queue.py # Multi-GPU job launcher for evaluation
checkpoints/ # Saved model checkpoints (created at runtime)
logs/ # Text logs + wandb local cache (created at runtime)
README.md
Create and activate a dedicated conda environment:
git clone https://github.com/LeeChangmin0310/RLDoom.git
cd RLDoom
conda create -n doomrl python=3.9 -y
conda activate doomrl
pip install -r requirements.txtCheck that the environment Python is used:
which python
# -> .../anaconda3/envs/doomrl/bin/pythonThe YAML config assumes the following paths:
env:
cfg_path: "doom_files/deadly_corridor.cfg"
wad_path: "doom_files/deadly_corridor.wad"Place these files under doom_files/:
RLDoom/
doom_files/
deadly_corridor.cfg
deadly_corridor.wad
You can copy them from the official VizDoom examples or tutorial resources.
All configuration is centralized in:
rldoom/configs/deadly_corridor.yaml
rldoom/configs/__init__.py # make_config
env:
cfg_path: "doom_files/deadly_corridor.cfg"
wad_path: "doom_files/deadly_corridor.wad"
frame_size: 84
stack_size: 4
frame_skip: 4
train:
num_episodes: 3000
max_steps_per_episode: 3000
checkpoint_dir: "checkpoints"
checkpoint_interval: 150
logs_dir: "logs"
defaults:
feature_dim: 512
gamma: 0.99
grad_clip: 10.0
logging:
use_wandb: true
wandb_project: "RLDoom"
wandb_entity: "lee_changmin-sangmyung-uni"
algos:
dqn:
type: "offpolicy"
# ...
ddqn:
dddqn:
rainbow:
reinforce:
a2c:
a3c:
ppo:
trpo:
# optionally: reinforce_tuned / a2c_tuned / dddqn_tuned / ppo_tunedmake_config(algo, seed) flattens this into a simple object:
from rldoom.configs import make_config
cfg = make_config("dddqn", seed=0)
cfg.algo # "dddqn"
cfg.algo_type # "offpolicy" / "onpolicy"
cfg.cfg_path # doom_files/deadly_corridor.cfg
cfg.wad_path # doom_files/deadly_corridor.wad
cfg.frame_size # 84
cfg.stack_size # 4
cfg.frame_skip # 4
cfg.num_episodes
cfg.max_steps_per_episode
cfg.checkpoint_dir
cfg.logs_dir
cfg.feature_dim
cfg.gamma
cfg.grad_clip
# algo-specific examples
cfg.lr
cfg.buffer_size
cfg.batch_size
cfg.learn_start
cfg.eps_start
cfg.eps_end
cfg.eps_decay
cfg.target_update_everyAll agents and trainers use this same cfg object.
Every algorithm logs per episode:
episode: 1-based episode index (used as wandb step)return: sum of rewards in the episodelength: number of environment steps in the episodeglobal_step: total environment steps so far
Algorithm-specific losses:
-
Off-policy (DQN, DDQN, DDDQN, Rainbow)
loss: TD loss (total, identical tovalue_loss)value_loss: same scalar, logged explicitly
-
REINFORCE
loss: total loss (policy + value baseline term)policy_lossvalue_loss(baseline fitting; optional)
-
A2C
loss: total (policy + vf_coef * value_loss − ent_coef * entropy)policy_lossvalue_lossentropy
-
PPO
loss: total (clipped objective + value term − entropy)policy_lossvalue_loss
-
TRPO (simplified)
loss: total (policy + KL penalty + value term − entropy)policy_lossvalue_losskl: mean KL divergence between old and new policies
This makes it easy to compare return curves and loss dynamics across algorithms under a unified logging schema.
Wandb is configured via environment variables and/or the YAML:
Create a .env file in the project root (or export in your shell):
WANDB_API_KEY="YOUR_KEY"
WANDB_ENTITY="lee_changmin-sangmyung-uni"
WANDB_PROJECT="RLDoom"
WANDB_DIR="/home/cia/disk1/bci_intern/AAAI2026/RLDoom/logs/wandb"Then either:
export $(grep -v '^#' .env | xargs)or your logger can load .env internally (e.g., via python-dotenv).
rldoom/utils/logger.py reads:
cfg.use_wandbcfg.wandb_projectcfg.wandb_entityWANDB_API_KEY,WANDB_DIRfrom the environment
and initializes wandb accordingly.
From the project root:
conda activate doomrl
cd /home/cia/disk1/bci_intern/AAAI2026/RLDoom
# DDDQN
python train.py --algo dddqn --seed 0
# DQN / DDQN / Rainbow
python train.py --algo dqn --seed 0
python train.py --algo ddqn --seed 0
python train.py --algo rainbow --seed 0
# On-policy algorithms
python train.py --algo reinforce --seed 0
python train.py --algo a2c --seed 0
python train.py --algo a3c --seed 0
python train.py --algo ppo --seed 0
python train.py --algo trpo --seed 0train.py:
-
Builds
cfg = make_config(algo, seed) -
Sets seeds (Python / NumPy / Torch)
-
Creates
obs_shape = (stack_size, frame_size, frame_size)andnum_actions = 7 -
Instantiates the corresponding
Agentsubclass -
Wraps logging via
Logger(cfg) -
Dispatches to:
train_offpolicy(agent, cfg, logger)for DQN / DDQN / DDDQN / Rainbowtrain_onpolicy(agent, cfg, logger)for REINFORCE / A2C / A3C / PPO / TRPO
Both trainers use tqdm to show progress over episodes.
#!/usr/bin/env bash
set -e
source ~/anaconda3/etc/profile.d/conda.sh
conda activate doomrl
cd /home/cia/disk1/bci_intern/AAAI2026/RLDoom
# Optional: load wandb env vars from .env
if [ -f ".env" ]; then
set -a
source .env
set +a
fi
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0}
export WANDB_DIR="${PWD}/logs/wandb"
mkdir -p logs checkpoints
ALGO=${1:-dddqn}
SEED=${2:-0}
python train.py --algo "${ALGO}" --seed "${SEED}" \
2>&1 | tee "logs/train_${ALGO}_seed${SEED}.log"Usage:
chmod +x scripts/run_train.sh
tmux new -s doomrl
CUDA_VISIBLE_DEVICES=0 ./scripts/run_train.sh dddqn 0
# detach: Ctrl+b, d
# attach: tmux attach -t doomrllaunch_queue.py assigns different algorithms to different GPUs and runs them in parallel:
- Specify a job list:
JOBS = [(algo, seed), ...] - Specify available GPUs:
GPUS = ["1", "2", "3"] - Each job is launched as:
bash scripts/run_train.sh <algo> <seed>withCUDA_VISIBLE_DEVICESset internally.
Example usage:
tmux new -s doomrl_queue
python scripts/launch_queue.pyconda activate doomrl
cd /home/cia/disk1/bci_intern/AAAI2026/RLDoom
python eval.py \
--algo dddqn \
--checkpoint checkpoints/dddqn_seed0_ep001500.pth \
--episodes 10eval.py:
- Loads
cfg = make_config(algo, seed)and disables wandb - Builds the agent and loads weights (
.pthcheckpoint) - Creates the Deadly Corridor environment
- Runs deterministic evaluation episodes (
deterministic=Trueinagent.act) - Prints per-episode return and mean return
The script also supports repeated evaluation:
--chunk_episodes: number of episodes per chunk (e.g., 30)--n_chunks: number of chunks (e.g., 10)
This yields 10 mean returns over 30 episodes each, useful for more stable comparison.
This repository is intended as a simple, comparable setup for:
- Testing classic value-based methods (DQN family, Rainbow-lite)
- Comparing policy-gradient / actor-critic methods (REINFORCE, A2C/A3C, PPO, TRPO)
- Running controlled experiments on the same Doom task, with unified logging and preprocessing
Adding a new algorithm typically means:
- Implementing a new
Agentclass underrldoom/agents/. - Adding a corresponding entry under
algos:inrldoom/configs/deadly_corridor.yaml.
Our experiments were mainly run on a NAS server with:
-
OS: Ubuntu 22.04 LTS
-
CPU/RAM: multi-core CPU with several 512 GB RAM
-
GPUs:
- 2 × NVIDIA RTX A5000 (24 GB each)
- 1 × NVIDIA GeForce RTX 3090 (24 GB)
The code itself does not depend on this exact setup: any machine with a recent GPU (or even CPU, at slower speed) and a working VizDoom installation should be able to run the experiments.
This project builds on:
- Thomas Simonini, “Dueling Double Deep Q-Learning with PER — Doom Deadly Corridor”
- Mnih et al., “Human-level control through deep reinforcement learning,” Nature, 2015.
- Van Hasselt et al., “Deep Reinforcement Learning with Double Q-learning,” AAAI, 2016.
- Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning,” ICML, 2016.
- Schaul et al., “Prioritized Experience Replay,” ICLR, 2016.
- Mnih et al., “Asynchronous Methods for Deep Reinforcement Learning,” ICML, 2016. (A3C)
- Schulman et al., “Trust Region Policy Optimization,” ICML, 2015.
- Schulman et al., “Proximal Policy Optimization Algorithms,” arXiv, 2017.
- VizDoom: “ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning.”
All original Doom assets belong to their respective copyright holders.
![]() Changmin Lee @LeeChangmin0310 Maintainer |
![]() Suyeon Myung @suyeonmyeong Core Contributor |
![]() Ui-Hyun Maeng @maeng00 Core Contributor |



