Train and evaluate reinforcement learning agents for Tetris using
tetris-gymnasium. The repo includes:
- DQN after-state agent with Dellacherie features (holes, bumpiness, aggregate height, lines cleared).
- Baseline heuristics (greedy, random, hard drop) and a manual-play script.
This project uses the Gymnasium-compatible tetris-gymnasium environment
(gym.make("tetris_gymnasium/Tetris")). The score reported in baseline scripts
and evaluation is the environment return:
env_return = sum_t env_reward_t
Where env_reward_t is the raw reward returned by env.step(...) at each step
(or macro-action sequence in the after-state agent). The exact reward scheme is
defined inside tetris-gymnasium:
env_reward_t = 0.001 (alive bonus)
+ 1.0 * lines_cleared_t
- 2.0 * [game_over_t]
The reported score/return in the scripts is the sum of these per-step rewards over an episode.
The default ActionsMapping in tetris-gymnasium is:
move_left: 0
move_right: 1
move_down: 2
rotate_clockwise: 3
rotate_counterclockwise: 4
hard_drop: 5
swap: 6
no_operation: 7
The greedy baseline in tetris_code/policies.py scores many candidate
placements and picks the one with the best heuristic score.
- A micro-action is one primitive environment action (see ActionsMapping above).
- A sequence of actions is a short list of micro-actions.
- A macro-action is a sequence that represents "place the current piece"; it represents the possible legal moves from state s. (e.g., rotate, move left/right several times, then hard drop).
The Tetris state space is enormous. A conservative lower bound is the number of binary board configurations for a 20x10 grid:
2^(200) ~= 1.6e60
The true state space is much larger because it also depends on the current piece, its orientation, its position, the queue of upcoming pieces, and other environment details.
This repo uses a Double DQN (DDQN) with after-states and macro-actions. The design choices below are tailored for Tetris and for data efficiency.
- A macro-action is a short sequence of primitive actions (rotate, shift, hard drop) that results in placing the current piece. We enumerate many such sequences and treat each as one decision.
- An after-state is the board immediately after a macro-action completes. This reduces the action space complexity because we evaluate the resulting board states directly.
- The network estimates a value for the after-state, which is combined with the immediate shaped reward to choose the best candidate:
score(candidate) = shaped_reward(candidate) + gamma * V(after_state)
We use the Dellacherie feature set, a compact, hand-crafted summary that is known to correlate well with Tetris performance:
- holes: empty cells with at least one filled cell above in the same column.
- bumpiness: sum of absolute height differences between adjacent columns.
- aggregate height: sum of all column heights.
- lines cleared: number of lines cleared by the last placement.
Why these features:
- They capture key board quality factors (holes and bumpiness are strongly predictive of future failure).
- They are low-dimensional and data efficient, enabling fast learning with a small MLP instead of a much larger CNN.
- They reflect classic Tetris heuristics and reduce the need for very large datasets or long training runs.
Training uses shaped rewards to provide denser feedback than the raw env score. The shaping encourages line clears and smooth stacks while penalizing bad placements:
line_reward: {1: 1, 2: 3, 3: 6, 4: 12}
shaped_reward = line_reward - 2.0 * new_holes - 0.5 * bump_increase + 0.1
This makes learning more stable by signaling progress even when full lines are not cleared every move.
Evaluation reports the environment return (not the shaped reward). The printed "Env return" is the running sum of raw env rewards for a single episode, and the final "Average env return" is the mean over all evaluation episodes:
env_return = sum_t env_reward_t
We store transitions in a replay buffer to break correlations between sequential decisions and to reuse experience:
- state_features: after-state feature vector for the chosen macro-action.
- reward: shaped reward for that placement.
- done: terminal flag.
- next_features: features for all next candidate after-states.
Training flow:
- Enumerate candidate macro-actions and their after-state features.
- Select a macro-action using epsilon-greedy over the DDQN score.
- Execute the macro-action, compute shaped reward, and push to the buffer.
- Sample random mini-batches and update the policy network.
- Periodically sync the target network from the policy network.
We use a DDQN target to reduce overestimation bias:
if done or no next candidates:
target = reward
else:
a* = argmax_a Q_policy(next_after_state_a)
target = reward + gamma * Q_target(next_after_state_a*)
loss = MSE(Q_policy(current_after_state), target)
- The max operator in Q-learning can overestimate values. DDQN reduces this by selecting actions with the policy network and evaluating them with a separate target network.
- The target network changes slowly (synced periodically), which stabilizes learning and avoids chasing a moving target at every update.
- Tetris has high branching and delayed consequences; value-based learning with a replay buffer is effective and sample efficient.
- After-states simplify the decision by evaluating final placements rather than every low-level step.
- Reward shaping provides dense feedback, which helps in a sparse-reward game.
- The feature vector is small and encodes expert knowledge, reducing the amount of data needed to learn a good policy.
- A small MLP trains quickly and is easier to stabilize than a deep CNN.
- The raw grid has large spatial redundancy; for this project, features are a strong bias that speeds learning.
- Early episodes often end quickly and may skip training updates until the replay buffer is warm. Later episodes include full backprop updates each step.
- As the policy improves, episodes last longer (more macro-actions per episode), which increases total compute time per episode.
Epsilon is the probability of taking a random macro-action instead of the greedy one. We decay it over time:
epsilon(step) = end + (start - end) * exp(-step / decay)
High epsilon early encourages exploration; lower epsilon later favors exploitation of the learned policy.
The training script saves a plot of the average return per episode and epsilon
over time in sources/dqn_afterstate_training.png:
- The agent scores ~40k on average over 10 episodes with a standard deviation of ~17k.
- This is about 100x higher than the greedy baseline.
- More compute improves performance; as the policy improves, evaluation runs take longer.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf pip errors on cv2, remove that line and keep opencv-python installed
(cv2 comes from opencv-python).
DQN after-state (feature-based):
python train_dqn_afterstate.pyUse --render to watch training (slow).
python evaluate_dqn_afterstate.py --episodes 10Add --render to visualize gameplay. For the policy gradient agent, pass
--stochastic to sample actions instead of greedy play.
python Baseline/view_episode_policy_greedy.py
python Baseline/view_episode_policy_random.py
python Baseline/view_episode_policy_down.py
python Baseline/evaluate_policy_greedy.py
python Baseline/play_tetris.py- Checkpoints in
checkpoints/are pre-trained weights. The training and evaluation scripts default toDQN_scripts/checkpoints/; override with--save-pathand--model-pathto use the existingcheckpoints/folder. Tetris_RL_script_runner.ipynbshows an end-to-end workflow.
