This repository provides basic scripts needed to develop a Large Language Model (LLM) for question and answering purposes from scratch. The goal is not to provide the best configured set of codes for the task but to offer a general framework that can be adapted and extended. You can find the pdf version of presentation slides that comes with this repo here.
This project aims to create a foundational framework for developing a Q&A LLM. It includes scripts for training a base model and fine-tuning it for specific question-and-answer tasks. The repository is designed for educational purposes and can be used as a starting point for more complex implementations.
The base model can be found in the folder gpt-base, which includes the following files and folders:
- data: Contains the input text for model training and the "cleaned" version of the text.
- log: Once the model is trained, a log file is created and placed in this folder. It will contain log files for training the tokenizer and training the base model, as well as an epoch versus loss plot for the training procedure.
- models: Both the tokenizer and the base model will be stored in this folder.
Follow these steps:
- Ensure that the input text is placed in the data folder as a
.txtfile. The default file currently is the English translation of the text of Little Prince by Antoine de Saint-Exupéry. - To clean up the input text, you can use the sample code text_cleaning. This is just a sample, and you need to modify it based on your input text. Set the correct address for your input and output files.
- Run the tokenizer trainer code. Pay attention to inputs and change them as needed. The current code assumes execution from
./qagpt_codes/gpt-base. This will generate two files for the output tokenizer, named<path_save_tokenizer>.modeland<path_save_tokenizer>.vocab, wherepath_save_tokenizeris defined in the first few lines of the code. - Run the gpt trainer script. This code takes the input for the tokenizer, the cleaned text, and the address for saving the base model after training. Adjust the input parameters as necessary.
The scripts needed to perform a simplified version of fine-tuning can be found in the folder fine-tuning, which includes the following folders:
- data: Contains a JSON file with question & answer pairs. This can be generated by a code like this one or done manually.
- log: Contains log files generated by question-and-answer generation and the fine-tuning scripts, including an epoch versus loss plot for the training procedure.
- models: The Q&A model will be saved in this folder as a default location. If the location is changed in the fine-tuning model, the required folder needs to be created.
Follow these steps:
- Generate a set of questions and answers either manually (recommended) or using a code like the one here. Keep in mind that developing a code that can generate a meaningful set of questions and answers might be challenging and depends on the input text formatting and general structure.
- Run the fine-tuning script to train the previously trained base model for the purpose of Q&A. Ensure that you set the input paths (
input_tokenizer,input_text,gpt_model_path,model_save_path) properly to read the base model and tokenizer and write the final model in the correct directories/locations.
To set up the project, clone the repository and install the required libraries:
git clone git@github.com:MasoudMiM/qpt_qa.git
cd qpt_qa
pip install -r requirements.txtAfter setting up the project, you can start training the base model and fine-tuning it for your specific Q&A tasks. Follow the instructions in the Base Model Training and Fine-Tuning Procedure sections to get started.
Here are some example commands to run the scripts:
-
Clean the input text:
python gpt-base/text_cleaning.py
-
Train the tokenizer:
python gpt-base/tokenizer_trainer.py
-
Train the base model:
python gpt-base/gpt_trainer.py
-
Fine-tune the model:
python fine-tune/fine-tune.py
This project requires the following Python libraries:
- torch
- sentencepiece
- matplotlib
- numpy
- nltk
You can install these libraries using pip. You can simply run:
pip install -r requirements.txtMake sure to set the NLTK data location on your own laptop/server. You can do this by appending the path to your NLTK data directory in the script:
import nltk
nltk.data.path.append("C:\\...\\nltk_data")- Andrej Karpathy YouTube channel
- Understanding Deep Learning by Simon J.D. Prince
- Deep Learning: Foundations and Concepts by Chris Bishop and Hugh Bishop
- Probabilistic Machine Learning: An Introduction by Kevin Patrick Murphy
- Deep Learning by Aaron Courville, Ian Goodfellow, and Yoshua Bengio
- Transformers (how LLMs work) explained visually from 3Blue1Brown youtube channel
- Visualizing Attention, a Transformer's Heart
- Transformers - Fundamental Concepts with Python Implementation