llm-from-scratch

Build a Large Language Model from Scratch
This repository contains an implementation of GPT-2, built and trained from scratch on the Tiny Stories Dataset. The dataset consists of short stories ideal for experimenting with language modeling.

Project Structure

model.py
Contains the GPT-2 model architecture.
tokenizer.py
Implements a custom tokenizer using SentencePiece. This file includes:
- The CustomTokenizer class which:
  - Initializes the tokenizer with a model prefix.
  - Trains the tokenizer on a subset of JSON files (from the Tiny Stories Dataset).
  - Loads the trained model.
  - Provides methods to encode text to token IDs and decode token IDs back into text.
- Utility functions:
  - process_dataset: Processes all JSON files in a data folder, tokenizes the stories, and saves them as binary files.
  - process_data: A helper function to train the tokenizer and process the dataset.
train.py
The training script that uses the tokenized data to train the GPT-2 model from scratch.
text_generation.py
Contains code for generating text using the trained model. This script implements multiple sampling techniques (temperature scaling, top‑k sampling, and top‑p/nucleus sampling) for flexible text generation.

How to Get Started

Clone the Repository

git clone https://github.com/yourusername/llm-from-scratch.git
cd llm-from-scratch

Install Dependencies
Ensure you have Python 3.8+ installed, then install the required packages:
```
pip install -r requirements.txt
```
Prepare the Dataset
Download the Tiny Stories Dataset and place the JSON files into the data/ folder.
Train the Tokenizer and Process the Data
Run the tokenizer script to train the SentencePiece model and process the dataset:
```
python tokenizer.py
```
Train the Model
Use the training script to train your GPT-2 model from scratch:
```
python train.py
```

Generate Text
After training, generate text using the following command:

python text_generation.py --prompt "Once upon a time" --max_length 500 --temperature 0.8 --top_k 40 --top_p 0.9 --weights ./weights/gpt_weights.pt

Weights and Model Sharing

The trained model weights will be made available on our Hugging Face account soon. Check that page for updates and downloads.

Contributors

Zaineb Boukhetala
Kamel Brouthen
Meriem Bouyahiaoui
Ouail Mohammed Oucherif
Lina Amdirt
Nesrine Abdelhak
Nada Djedjig
Larbi Saidchikh

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Tokenization		Tokenization
logs		logs
tokenizer		tokenizer
.gitignore		.gitignore
README.md		README.md
Tokenizer.py		Tokenizer.py
dataset.py		dataset.py
model.py		model.py
requirements.txt		requirements.txt
text_generation.py		text_generation.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llm-from-scratch

Project Structure

How to Get Started

Weights and Model Sharing

Contributors

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

SchoolofAI-Algiers/llm-from-scratch

Folders and files

Latest commit

History

Repository files navigation

llm-from-scratch

Project Structure

How to Get Started

Weights and Model Sharing

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages