Build a Large Language Model from Scratch
This repository contains an implementation of GPT-2, built and trained from scratch on the Tiny Stories Dataset. The dataset consists of short stories ideal for experimenting with language modeling.
-
model.py
Contains the GPT-2 model architecture. -
tokenizer.py
Implements a custom tokenizer using SentencePiece. This file includes:- The
CustomTokenizerclass which:- Initializes the tokenizer with a model prefix.
- Trains the tokenizer on a subset of JSON files (from the Tiny Stories Dataset).
- Loads the trained model.
- Provides methods to encode text to token IDs and decode token IDs back into text.
- Utility functions:
process_dataset: Processes all JSON files in a data folder, tokenizes the stories, and saves them as binary files.process_data: A helper function to train the tokenizer and process the dataset.
- The
-
train.py
The training script that uses the tokenized data to train the GPT-2 model from scratch. -
text_generation.py
Contains code for generating text using the trained model. This script implements multiple sampling techniques (temperature scaling, top‑k sampling, and top‑p/nucleus sampling) for flexible text generation.
-
Clone the Repository
git clone https://github.com/yourusername/llm-from-scratch.git cd llm-from-scratch -
Install Dependencies
Ensure you have Python 3.8+ installed, then install the required packages:pip install -r requirements.txt
-
Prepare the Dataset
Download the Tiny Stories Dataset and place the JSON files into thedata/folder. -
Train the Tokenizer and Process the Data
Run the tokenizer script to train the SentencePiece model and process the dataset:python tokenizer.py
-
Train the Model
Use the training script to train your GPT-2 model from scratch:python train.py
-
Generate Text
After training, generate text using the following command:python text_generation.py --prompt "Once upon a time" --max_length 500 --temperature 0.8 --top_k 40 --top_p 0.9 --weights ./weights/gpt_weights.pt
The trained model weights will be made available on our Hugging Face account soon. Check that page for updates and downloads.
- Zaineb Boukhetala
- Kamel Brouthen
- Meriem Bouyahiaoui
- Ouail Mohammed Oucherif
- Lina Amdirt
- Nesrine Abdelhak
- Nada Djedjig
- Larbi Saidchikh