A custom implementation of a GPT model built from scratch. This project demonstrates the fundamental concepts behind large language models like ChatGPT by implementing each component step by step.
- BPE Tokenizer: Custom implementation of Byte Pair Encoding for text tokenization
- Dataset Handling: Preprocessing and management of text datasets
- Model Training: Train your own GPT model on custom data
- Clean Architecture: Modular design for easy understanding and extension
# Clone the repository
git clone https://github.com/yourusername/chatgpt_from_scratch.git
cd chatgpt_from_scratch
# Install the package in development mode
pip install -e .
# Install pre-commit hooks
pre-commit install --install-hooksfrom pxl_gpt.dataset import Dataset
from pxl_gpt.tokenizer import BPETokenizer
# Load and preprocess dataset
dataset = Dataset()
dataset.preprocess()
# Train tokenizer
tokenizer = BPETokenizer()
tokenizer.train(dataset)
# Save tokenizer
tokenizer.save("data/tokenizer.json")# Load a saved tokenizer
tokenizer = BPETokenizer()
tokenizer.load("data/tokenizer.json")
# Test encoding and decoding
test_text = "Bonjour, comment ça va?"
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)
print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")chatgpt_from_scratch/
├── data/
│ ├── cleaned_data.csv # Preprocessed dataset
│ └── tokenizer.json # Trained tokenizer
├── extra/
│ ├── emb_comparison.py # Word vector arithmetic demonstrations
│ └── simple_cbow.py # Simple CBOW implementation for word embeddings
├── pxl_gpt/
│ ├── dataset.py # Dataset handling
│ ├── dataset_loader.py # Dataset loading utilities
│ ├── model.py # GPT model implementation
│ └── tokenizer.py # BPE tokenizer implementation
├── scripts/
│ └── train_tokenizer.py # Script to train the tokenizer
└── pyproject.toml # Project configuration
The dataset module handles loading and preprocessing text data. By default, it uses a French Reddit discussion dataset.
A custom implementation of Byte Pair Encoding (BPE) tokenization, similar to what's used in models like GPT. The tokenizer:
- Splits text into initial tokens
- Iteratively merges the most frequent adjacent token pairs
- Builds a vocabulary of subword units
- Provides encoding and decoding functionality
The extra directory contains supplementary scripts that demonstrate concepts related to word embeddings:
- emb_comparison.py: Demonstrates word vector arithmetic using Google's pre-trained word2vec model. It shows interesting semantic relationships like "king - man + woman = queen" and similar vector operations.
- simple_cbow.py: Implements a Continuous Bag of Words (CBOW) model from scratch using PyTorch. This script trains word embeddings on a small French corpus and visualizes the resulting word vectors, showing how related words cluster together in the embedding space.
These scripts help understand the embedding concepts that underlie modern language models.
The GPT model architecture implementation.
- Code en Java by MrQsdf - Code translated to java
- Attention Is All You Need - The original Transformer paper
- Improving Language Understanding with Unsupervised Learning - OpenAI's GPT approach
This project is licensed under the MIT License - see the LICENSE file for details.
Created by PixelCrafted