Developing a Q&A LLM

This repository provides basic scripts needed to develop a Large Language Model (LLM) for question and answering purposes from scratch. The goal is not to provide the best configured set of codes for the task but to offer a general framework that can be adapted and extended. You can find the pdf version of presentation slides that comes with this repo here.

Overview

This project aims to create a foundational framework for developing a Q&A LLM. It includes scripts for training a base model and fine-tuning it for specific question-and-answer tasks. The repository is designed for educational purposes and can be used as a starting point for more complex implementations.

Base Model

The base model can be found in the folder gpt-base, which includes the following files and folders:

data: Contains the input text for model training and the "cleaned" version of the text.
log: Once the model is trained, a log file is created and placed in this folder. It will contain log files for training the tokenizer and training the base model, as well as an epoch versus loss plot for the training procedure.
models: Both the tokenizer and the base model will be stored in this folder.

Base Model Training

Follow these steps:

Ensure that the input text is placed in the data folder as a .txt file. The default file currently is the English translation of the text of Little Prince by Antoine de Saint-Exupéry.
To clean up the input text, you can use the sample code text_cleaning. This is just a sample, and you need to modify it based on your input text. Set the correct address for your input and output files.
Run the tokenizer trainer code. Pay attention to inputs and change them as needed. The current code assumes execution from ./qagpt_codes/gpt-base. This will generate two files for the output tokenizer, named <path_save_tokenizer>.model and <path_save_tokenizer>.vocab, where path_save_tokenizer is defined in the first few lines of the code.
Run the gpt trainer script. This code takes the input for the tokenizer, the cleaned text, and the address for saving the base model after training. Adjust the input parameters as necessary.

Fine Tuning

The scripts needed to perform a simplified version of fine-tuning can be found in the folder fine-tuning, which includes the following folders:

data: Contains a JSON file with question & answer pairs. This can be generated by a code like this one or done manually.
log: Contains log files generated by question-and-answer generation and the fine-tuning scripts, including an epoch versus loss plot for the training procedure.
models: The Q&A model will be saved in this folder as a default location. If the location is changed in the fine-tuning model, the required folder needs to be created.

Fine-Tuning Procedure

Follow these steps:

Generate a set of questions and answers either manually (recommended) or using a code like the one here. Keep in mind that developing a code that can generate a meaningful set of questions and answers might be challenging and depends on the input text formatting and general structure.
Run the fine-tuning script to train the previously trained base model for the purpose of Q&A. Ensure that you set the input paths (input_tokenizer, input_text, gpt_model_path, model_save_path) properly to read the base model and tokenizer and write the final model in the correct directories/locations.

Installation

To set up the project, clone the repository and install the required libraries:

git clone git@github.com:MasoudMiM/qpt_qa.git
cd qpt_qa
pip install -r requirements.txt

Usage

After setting up the project, you can start training the base model and fine-tuning it for your specific Q&A tasks. Follow the instructions in the Base Model Training and Fine-Tuning Procedure sections to get started.

Example Commands

Here are some example commands to run the scripts:

Clean the input text:
```
python gpt-base/text_cleaning.py
```
Train the tokenizer:
```
python gpt-base/tokenizer_trainer.py
```
Train the base model:
```
python gpt-base/gpt_trainer.py
```
Fine-tune the model:
```
python fine-tune/fine-tune.py
```

Requirements

This project requires the following Python libraries:

torch
sentencepiece
matplotlib
numpy
nltk

You can install these libraries using pip. You can simply run:

pip install -r requirements.txt

NLTK Data

Make sure to set the NLTK data location on your own laptop/server. You can do this by appending the path to your NLTK data directory in the script:

import nltk
nltk.data.path.append("C:\\...\\nltk_data")

References

Andrej Karpathy YouTube channel
Understanding Deep Learning by Simon J.D. Prince
Deep Learning: Foundations and Concepts by Chris Bishop and Hugh Bishop
Probabilistic Machine Learning: An Introduction by Kevin Patrick Murphy
Deep Learning by Aaron Courville, Ian Goodfellow, and Yoshua Bengio
Transformers (how LLMs work) explained visually from 3Blue1Brown youtube channel
Visualizing Attention, a Transformer's Heart
Transformers - Fundamental Concepts with Python Implementation

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
fine-tune		fine-tune
gpt-base		gpt-base
.gitignore		.gitignore
LLM_02112025.pdf		LLM_02112025.pdf
README.md		README.md
requirements.txt		requirements.txt
tokenization.py		tokenization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Developing a Q&A LLM

Table of Contents

Overview

Base Model

Base Model Training

Fine Tuning

Fine-Tuning Procedure

Installation

Usage

Example Commands

Requirements

NLTK Data

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MasoudMiM/qpt_qa

Folders and files

Latest commit

History

Repository files navigation

Developing a Q&A LLM

Table of Contents

Overview

Base Model

Base Model Training

Fine Tuning

Fine-Tuning Procedure

Installation

Usage

Example Commands

Requirements

NLTK Data

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages