Skip to content

This repository provides a foundational framework for developing a Large Language Model (LLM) focused on question and answering tasks. Designed for educational purposes, this project serves as a flexible starting point for developers and researchers exploring LLMs.

Notifications You must be signed in to change notification settings

MasoudMiM/qpt_qa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Developing a Q&A LLM

This repository provides basic scripts needed to develop a Large Language Model (LLM) for question and answering purposes from scratch. The goal is not to provide the best configured set of codes for the task but to offer a general framework that can be adapted and extended. You can find the pdf version of presentation slides that comes with this repo here.

Table of Contents

Overview

This project aims to create a foundational framework for developing a Q&A LLM. It includes scripts for training a base model and fine-tuning it for specific question-and-answer tasks. The repository is designed for educational purposes and can be used as a starting point for more complex implementations.

Base Model

The base model can be found in the folder gpt-base, which includes the following files and folders:

  • data: Contains the input text for model training and the "cleaned" version of the text.
  • log: Once the model is trained, a log file is created and placed in this folder. It will contain log files for training the tokenizer and training the base model, as well as an epoch versus loss plot for the training procedure.
  • models: Both the tokenizer and the base model will be stored in this folder.

Base Model Training

Follow these steps:

  1. Ensure that the input text is placed in the data folder as a .txt file. The default file currently is the English translation of the text of Little Prince by Antoine de Saint-Exupéry.
  2. To clean up the input text, you can use the sample code text_cleaning. This is just a sample, and you need to modify it based on your input text. Set the correct address for your input and output files.
  3. Run the tokenizer trainer code. Pay attention to inputs and change them as needed. The current code assumes execution from ./qagpt_codes/gpt-base. This will generate two files for the output tokenizer, named <path_save_tokenizer>.model and <path_save_tokenizer>.vocab, where path_save_tokenizer is defined in the first few lines of the code.
  4. Run the gpt trainer script. This code takes the input for the tokenizer, the cleaned text, and the address for saving the base model after training. Adjust the input parameters as necessary.

Fine Tuning

The scripts needed to perform a simplified version of fine-tuning can be found in the folder fine-tuning, which includes the following folders:

  • data: Contains a JSON file with question & answer pairs. This can be generated by a code like this one or done manually.
  • log: Contains log files generated by question-and-answer generation and the fine-tuning scripts, including an epoch versus loss plot for the training procedure.
  • models: The Q&A model will be saved in this folder as a default location. If the location is changed in the fine-tuning model, the required folder needs to be created.

Fine-Tuning Procedure

Follow these steps:

  1. Generate a set of questions and answers either manually (recommended) or using a code like the one here. Keep in mind that developing a code that can generate a meaningful set of questions and answers might be challenging and depends on the input text formatting and general structure.
  2. Run the fine-tuning script to train the previously trained base model for the purpose of Q&A. Ensure that you set the input paths (input_tokenizer, input_text, gpt_model_path, model_save_path) properly to read the base model and tokenizer and write the final model in the correct directories/locations.

Installation

To set up the project, clone the repository and install the required libraries:

git clone git@github.com:MasoudMiM/qpt_qa.git
cd qpt_qa
pip install -r requirements.txt

Usage

After setting up the project, you can start training the base model and fine-tuning it for your specific Q&A tasks. Follow the instructions in the Base Model Training and Fine-Tuning Procedure sections to get started.

Example Commands

Here are some example commands to run the scripts:

  1. Clean the input text:

    python gpt-base/text_cleaning.py
  2. Train the tokenizer:

    python gpt-base/tokenizer_trainer.py
  3. Train the base model:

    python gpt-base/gpt_trainer.py
  4. Fine-tune the model:

    python fine-tune/fine-tune.py

Requirements

This project requires the following Python libraries:

  • torch
  • sentencepiece
  • matplotlib
  • numpy
  • nltk

You can install these libraries using pip. You can simply run:

pip install -r requirements.txt

NLTK Data

Make sure to set the NLTK data location on your own laptop/server. You can do this by appending the path to your NLTK data directory in the script:

import nltk
nltk.data.path.append("C:\\...\\nltk_data")

References

About

This repository provides a foundational framework for developing a Large Language Model (LLM) focused on question and answering tasks. Designed for educational purposes, this project serves as a flexible starting point for developers and researchers exploring LLMs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages