An AlphaZero-inspired project that learns optimal Tic-Tac-Toe from scratch using self-play RL, Monte Carlo Tree Search (PUCT), and a ResNet policy-value network in PyTorch.
This mini-project reproduces the core ideas of DeepMind’s AlphaZero in a lightweight setting:
- Self-Play — the agent improves by playing itself; no human labels needed.
- MCTS (PUCT) — search balances exploration vs. exploitation using network priors.
- Policy-Value Net (ResNet) — one network outputs move probabilities (policy) and win/loss estimate (value).
- Encode state → tensor (e.g., 2 or 3 planes for current player, opponent, empties).
- Neural net forward pass → returns
(policy_logits, value)for the position. - MCTS uses these priors to guide search and produce an improved policy.
- Self-play game: sample moves from the MCTS policy; store
(state, π, z)whereπis the MCTS policy andz ∈ {-1, 0, +1}is the final outcome from the current player’s perspective. - Train the network to fit
π(policy head with cross-entropy) andz(value head with MSE). - Repeat: newer nets generate stronger data.
