PXL-GPT: Build ChatGPT From Scratch

A custom implementation of a GPT model built from scratch. This project demonstrates the fundamental concepts behind large language models like ChatGPT by implementing each component step by step.

🌟 Features

BPE Tokenizer: Custom implementation of Byte Pair Encoding for text tokenization
Dataset Handling: Preprocessing and management of text datasets
Model Training: Train your own GPT model on custom data
Clean Architecture: Modular design for easy understanding and extension

🛠️ Installation

# Clone the repository
git clone https://github.com/yourusername/chatgpt_from_scratch.git
cd chatgpt_from_scratch

# Install the package in development mode
pip install -e .

# Install pre-commit hooks
pre-commit install --install-hooks

🚀 Quick Start

Training a Tokenizer

from pxl_gpt.dataset import Dataset
from pxl_gpt.tokenizer import BPETokenizer

# Load and preprocess dataset
dataset = Dataset()
dataset.preprocess()

# Train tokenizer
tokenizer = BPETokenizer()
tokenizer.train(dataset)

# Save tokenizer
tokenizer.save("data/tokenizer.json")

Testing the Tokenizer

# Load a saved tokenizer
tokenizer = BPETokenizer()
tokenizer.load("data/tokenizer.json")

# Test encoding and decoding
test_text = "Bonjour, comment ça va?"
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)
print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

📊 Project Structure

chatgpt_from_scratch/
├── data/
│   ├── cleaned_data.csv  # Preprocessed dataset
│   └── tokenizer.json    # Trained tokenizer
├── extra/
│   ├── emb_comparison.py # Word vector arithmetic demonstrations
│   └── simple_cbow.py    # Simple CBOW implementation for word embeddings
├── pxl_gpt/
│   ├── dataset.py        # Dataset handling
│   ├── dataset_loader.py # Dataset loading utilities
│   ├── model.py          # GPT model implementation
│   └── tokenizer.py      # BPE tokenizer implementation
├── scripts/
│   └── train_tokenizer.py # Script to train the tokenizer
└── pyproject.toml        # Project configuration

🔍 Components

Dataset

The dataset module handles loading and preprocessing text data. By default, it uses a French Reddit discussion dataset.

Tokenizer

A custom implementation of Byte Pair Encoding (BPE) tokenization, similar to what's used in models like GPT. The tokenizer:

Splits text into initial tokens
Iteratively merges the most frequent adjacent token pairs
Builds a vocabulary of subword units
Provides encoding and decoding functionality

Extra Scripts

The extra directory contains supplementary scripts that demonstrate concepts related to word embeddings:

emb_comparison.py: Demonstrates word vector arithmetic using Google's pre-trained word2vec model. It shows interesting semantic relationships like "king - man + woman = queen" and similar vector operations.
simple_cbow.py: Implements a Continuous Bag of Words (CBOW) model from scratch using PyTorch. This script trains word embeddings on a small French corpus and visualizes the resulting word vectors, showing how related words cluster together in the embedding space.

These scripts help understand the embedding concepts that underlie modern language models.

Model (Coming Soon)

The GPT model architecture implementation.

Translated Code

Code en Java by MrQsdf - Code translated to java

📚 References

Attention Is All You Need - The original Transformer paper
Improving Language Understanding with Unsupervised Learning - OpenAI's GPT approach

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👨‍💻 Author

Created by PixelCrafted

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
extra		extra
pxl_gpt		pxl_gpt
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PXL-GPT: Build ChatGPT From Scratch

🌟 Features

🛠️ Installation

🚀 Quick Start

Training a Tokenizer

Testing the Tokenizer

📊 Project Structure

🔍 Components

Dataset

Tokenizer

Extra Scripts

Model (Coming Soon)

Translated Code

📚 References

📄 License

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

ThePixelCrafted/chatgpt_de_zero

Folders and files

Latest commit

History

Repository files navigation

PXL-GPT: Build ChatGPT From Scratch

🌟 Features

🛠️ Installation

🚀 Quick Start

Training a Tokenizer

Testing the Tokenizer

📊 Project Structure

🔍 Components

Dataset

Tokenizer

Extra Scripts

Model (Coming Soon)

Translated Code

📚 References

📄 License

👨‍💻 Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages