A Python implementation of a chess neural network engine with policy and value heads, trained using supervised learning from PGN games. Supports AlphaZero-style architecture with PUCT search using neural network value estimates.
- 4672 Move Encoding: 64 squares × 73 planes (56 queen-like, 8 knight, 9 underpromotion)
- Board Encoding: 18 channels (12 pieces, side-to-move, castling rights, en passant)
- Dual-Head ResNet Architecture:
- Policy Head: Outputs move probabilities
[4672]logits - Value Head: Outputs position evaluation
[-1, 1](win probability from current player's perspective)
- Policy Head: Outputs move probabilities
- Training:
- Supervised learning from PGN games (policy + value from game outcomes)
- Combined loss: Cross-entropy (policy) + MSE (value)
- Label smoothing, mixed precision (AMP), gradient clipping
- PUCT Search:
- Uses neural network policy for move priors
- Uses neural network value for position evaluation (replaces material heuristic)
- Configurable simulations and exploration constant
- UCI Interface: Standard chess engine protocol for GUI integration
Using Poetry:
- Install Poetry if needed, then install deps
pipx install poetry(or see poetry docs)poetry install
Or using pip:
pip install -r <(printf "python-chess\ntorch\nnumpy\ntqdm\n")(Linux/macOS)
Python 3.9+ recommended.
src/chess_policy/encoding.py— Board → tensor encoding (18 channels) and legal move maskssrc/chess_policy/move_index.py— Fixed 4672 index mapping and move conversionssrc/chess_policy/model.py— ResNet models:PolicyOnlyResNet: Policy-only (backward compatibility)PolicyValueResNet: Policy + Value heads (AlphaZero-style, default)
src/chess_policy/data.py— PGN dataset with value targets extracted from game resultssrc/chess_policy/train.py— Supervised training (policy + value losses)src/chess_policy/infer.py— Inference helpers (policy logits, value extraction)src/chess_policy/puct.py— PUCT search with neural network value integrationsrc/chess_policy/uci.py— UCI engine interfacescripts/train_policy.py— Train on PGN files (policy + value)scripts/play_uci.py— Run UCI engine with PUCT search
The default training setup uses a shared-residual ResNet with dual heads:
- Input:
[B, 18, 8, 8]tensor fromencoding.board_to_tensor(board, side-to-move, castling, en passant) - Stem:
Conv2d(18 → width, 3×3)+ BatchNorm + ReLU - Trunk:
n_blocks×ResidualBlock(width)(each block = 3×3 Conv → BN → ReLU → 3×3 Conv → BN + skip connection) - Policy head:
1×1Conv(width → width)→ BN → ReLU1×1Conv(width → 73)→ reshape to[B, 73, 8, 8]→ flatten to[B, 4672]
- Value head:
1×1Conv(width → 32)→ BN → ReLU- Global pooling:
AdaptiveAvgPool2d(1)→ flatten to[B, 32] - Linear
(32 → 1)+tanh→[B, 1]in[-1, 1]
The scripts/train_policy.py entrypoint builds a PolicyValueResNet with:
- Default:
width=64,blocks=8(≈1–2M params) - Tunable via CLI:
--widthand--blocks
The PolicyValueResNet model outputs both policy and value:
- Policy:
[B, 4672]logits - move probabilities - Value:
[B, 1]in range[-1, 1]- position evaluation from current player's perspective+1.0= current player is winning-1.0= current player is losing0.0= draw
PUCT (Predictor + UCT) search uses:
- Policy priors: From neural network policy head
- Value estimates: From neural network value head (replaces material heuristic)
- MCTS: Monte Carlo Tree Search with UCB formula
The search is configured via:
--sims: Number of simulations (default: 200-400)--c_puct: Exploration constant (default: 1.2)use_nn_value: Use neural network value (default: True)
Run the model directly from Python using the helpers in chess_policy:
import chess
import torch
from chess_policy.train import load_checkpoint
from chess_policy.infer import choose_move
device = "cuda" if torch.cuda.is_available() else "cpu"
model = load_checkpoint("models/best.pt") # or your .pt file
board = chess.Board() # start position (or set FEN)
move, probs, value = choose_move(board, model, device=device, temperature=1.0, sample=False)
print("Chosen move:", move.uci() if move else None)
print("Position value (current side):", f"{value:+.3f}")choose_move takes care of:
- Encoding the board to
[1, 18, 8, 8] - Running the model (policy-only or policy+value)
- Masking illegal moves and sampling/argmax over the 4672 logits
Train a model with both policy and value heads on PGN game data:
poetry run python scripts/train_policy.py \
games.pgn \
--epochs 10 \
--batch_size 256 \
--width 64 \
--blocks 8 \
--lr 3e-4 \
--value_loss_weight 1.0 \
--out policy_value.ptKey Parameters:
--value_loss_weight: Weight for value loss vs policy loss (default: 1.0)- The model automatically extracts value targets from PGN game results
- All positions from a winning game get value
+1.0(from winner's perspective)
poetry run python scripts/play_uci.py \
--model policy_value.pt \
--puct \
--sims 400 \
--c_puct 1.2The engine uses:
- Neural network policy for move priors
- Neural network value for position evaluation
- PUCT search for move selection
The engine supports two modes:
-
Greedy Policy: Direct move selection from policy (fast, weaker)
poetry run python scripts/play_uci.py --model policy_value.pt
-
PUCT Search: Monte Carlo Tree Search with neural network guidance (slower, stronger)
poetry run python scripts/play_uci.py --model policy_value.pt --puct --sims 400
PUCT Parameters:
--sims: Number of search simulations (more = stronger but slower)--c_puct: Exploration constant (higher = more exploration)--use_nn_value: Use neural network value (default: True)
UCI Commands:
uci→ Engine infoisready→ Check readinessposition startpos moves e2e4 e7e5→ Set positiongo→ Get best movequit→ Exit
The value head learns from PGN game results:
- Extracts game outcome from PGN headers (
[Result "1-0"], etc.) - Converts to value target from each position's perspective:
- If White won:
+1.0when White to move,-1.0when Black to move - If Black won:
-1.0when White to move,+1.0when Black to move - If draw:
0.0for both sides
- If White won:
- Trained with MSE loss:
value_loss = MSE(value_pred, value_target)
Total loss combines policy and value:
loss = policy_loss + value_loss_weight * value_loss
Recommended Hyperparameters:
- Model:
width=64-128,blocks=8-12(~1-3M params) - Optimizer: AdamW,
lr=2e-4to3e-4 - Batch size: 256-512
- Value loss weight: 0.5-2.0 (default: 1.0)
- Label smoothing: 0.0-0.1
- PUCT:
sims=200-800,c_puct=1.0-1.5
- Value Targets: Uses final game outcome for all positions (simplified but effective)
- Search Depth: PUCT uses limited simulations compared to strong engines
- No Self-Play: Currently supervised learning only (self-play coming soon)
- Estimated Strength: ~1200-1800 Elo (depends on training data and model size)
See TRAINING_GUIDE.md for comprehensive documentation.
MIT (or adjust as you prefer)