├── Notebooks/ # Juypter Notebooks with alternative codes to filter and reformulate data
├── data/
├──comments/ # All comments extracted from games
├──evaluations/ # Results of model evaluation
├──raw/ # All raw .pgn files to create the dataset
├──*.py # Utility scripts used to manipulate the dataset
└──*.csv # Partial or complete datasets
├── models/ # Weights of trained models
├── modules/ # Modules of the project
├── controller.py # Contains the main functions of the project
└── main.py # Entry point of the code
This code was made with python version 3.10. I haven't tested it with any other versions.
This code was made to run on a computer with CUDA version >= 12.8. It might work with other versions of CUDA but I haven't tested it and it requires to install the right version of PyTorch. The code should also work on a computer with no GPU although it is not recommended.
The list of the required libraries is in the requirements.txt file and can be installed with :
pip install -r ./requirements.txtYou can copy the .env.example file to create your .env and fill it with the proper values.
ENGINE_PATH="/path/to/chess/engine"
DATA_PATH="./data"
DATA_RAW_PATH="./data/raw"
DATA_ANALYZED_PATH="./data/analyzed"
DATA_COMMENTS_PATH="./data/comments"
DATA_EVALUATIONS_PATH="./data/evaluations"
MODEL_PATH="./models"
HUGGING_FACE_TOKEN="TOKEN HERE"You can download Stockfish here and set ENGINE_PATH to the path were you install it.
The script can be run using the following command:
python main.pyThis will create a dataset using Gemma3 1B it to filter are reformulate comments from the ./data/raw folder.
Gemma3 1B it will then be trained on the resulting dataset.
This script can be modified with different parameters.
-
--debug
Default: True
Show debug informations.
-
--llm
Default: "google/gemma-3-1b-it"
LLM used as base model. This is the model that will be trained on the dataset.
-
--dataset
Default: None
The path to the csv file used for training/evaluation. If
Nonethen the script will create a dataset from content in./data/raw/. The csv file should contain at least the following columns: "moves", "engine_eval", "engine_best_line", "engine_best_alternative" and "reformulated". However if you want the model to use different column in its prompts or as a target, you can modify the parametersinput_columnsandinput_targetinmain.py.You can use an already made dataset with the following command:
python main.py --dataset "./Notebooks/reformulated_data_20250902_163137.csv" -
--trained_model
Default: None
Path to a trained model. If None, will train a model with parameters given. By default, trained models are saved in
./models. -
--llm_filter
Default: "google/gemma-3-1b-it"
LLM used to filter/reformulate comments.
-
--evaluate_model
Default: True
Evaluate the trained model with 20% of the dataset.
-
--evaluate_base_model
Default: True
Evaluate the base model with 20% of the dataset.
-
--save_evaluation
Default: True
Will save the graphs and answers of the evaluation in a folder in
./data/evaluations/. -
--show_evaluation
Default: False
Will show the graphs of the evaluation.
-
--prompt
Default: None
Will answer to the prompt using the trained model.