This repository contains a PyTorch transfomer-based approach for predicting chess Elo based on only the sequence of algebraic moves. The default training data are from Lichess.
- Clone the repo and navigate to it
- Obtain a list of zstandard-compressed PGN files to train on and save under the name
game_list.txt. The default is a list of all games from January 2013 through September 2023 from Lichess. - Change the settings in
settings.py.
a.FILE_DIR- where the data is stored. This location should have enough storage for all the decompressed files (at least several terabytes).
b.EMBEDDING_DIM- the number of dimensions used for the move embedding (default=72)
c.SEQUENCE_LENGTH- the maximum number of moves permitted in a game (default=150) - Download and process the data using
download_data.py. The parameters passed toparse_filecan be modified to filter the data.
a. By default the data is downloaded and parsed using five processes, before being combined and split into training, validation, and testing sets. - Run either
create_indices.pyto generate an index mapping for every unique move (this will cause training to use a trainableEmbeddinglayer), or runcreate_embeddings.pyto useWord2Vecto pretrain embeddings.
a. Pretrained embeddings fromWord2Vecareembeddings.modelandmove_vecs.wordvectors; note that these embeddings are of length 100. - Run the training using either
train.pyortrain_ddp.py. This defaults to using trainable embeddings, and the setup may require tuning for the available memory. The current settings are for a setup with 4 RTX 2080 Tis with 11 GB of memory each.
a.train.pywill use PyTorch'sDataParallelto run on multiple GPUs if applicable, otherwise it will target the GPU or CPU depending on availability.
b.train_ddp.pyuses PyTorch'sDistributedDataParallelto run on multiple GPUs. However, this was slower than the simplerDataParallelapproach.
c. The default model ismodelAvg- see the How it works section for more details - Training progress can be checked using
check_training.pywhich will generate basic plots to check the current training status - After training, run
generate_val_results.pyand thenget_percentiles.pyto run validation scripts. - If desired, run
convert_to_onnx.pyto create ONNX version of the models.
Work in progress
Work in progress
- The raw data should be as PGN (Portable Game Notation) files. These are looped through line by line, and the game year/month, time (both base and bonus), black and white Elos, and move sequences parsed out. The result of the game and whether it ended in a checkmate are also parsed out, although ideally the model shouldn't need this information.
a. The data will be parsed with 5 processes (more can cause the Lichess server to refuse to respond). Each process will write to a different text file; every game will be on a separate line. The downloaded and decompressed files both use temporary files.
b. After all the data is downloaded, the first file will be split in two for the validation and testing data, and the other 4 will be combined into training. The original files will be deleted afterwards.
- How stationary the Elo target is can be checked in
check_year_scores.py- this loops through two years (default 2015 and 2022) and grabs all games with identical move sequences and compares the Elos. create_indices.pygenerates a JSON mapping of every unique move to an integer index;create_onehot.pydoes the same thing except each value is now a list with zeros everywhere except a 1 at the index.create_embeddings.pyusesgensim's Word2Vecto create pretrained embeddings of each move using a window size of 3 and an embeddings size of 72. This isn't the approach used in the actual models, so is just there as legacy code.
There are several main model options in train.py (train_ddp.py is an older version and has not been updated with all options):
The main model approaches follow the diagram above.
This model first passes the sequence of moves through an Embedding layer and then a TransformerEncoder. It then takes an mean down the embedding dimension (so the result is the same length as the sequence) and passes the result through linear layers until the black and white Elo are predicted.
This model is the same as modelAvg, except that after the mean it concatenates the base and bonus time values prior to the linear layers.
This model is the same as modelAvg, except that after the mean it concatenates the result prior to the linear layers.
This model is the same as modelAvg, except that after the mean it concatenates the base and bonus times and the result prior to the linear layers.
A few other models were also experimented with.
This model first passes the sequence of moves through an Embedding layer. It then flattens the result and passes that through linear layers until the black and white Elo are predicted.
This approach is faster to train and achieves similar results to the transformer-based approachs, but results in orders-of-magnitude more weights.
This model first passes the sequence of moves through an Embedding layer and then a TransformerEncoder. The results are then flattened and passed through linear layers until the black and white Elo are predicted.
This approach is far slower and larger than it needs to be (the worst of both the transformer and linear worlds).