Skip to content

ThePixelCrafted/chatgpt_de_zero

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PXL-GPT: Build ChatGPT From Scratch

Python License

A custom implementation of a GPT model built from scratch. This project demonstrates the fundamental concepts behind large language models like ChatGPT by implementing each component step by step.

🌟 Features

  • BPE Tokenizer: Custom implementation of Byte Pair Encoding for text tokenization
  • Dataset Handling: Preprocessing and management of text datasets
  • Model Training: Train your own GPT model on custom data
  • Clean Architecture: Modular design for easy understanding and extension

🛠️ Installation

# Clone the repository
git clone https://github.com/yourusername/chatgpt_from_scratch.git
cd chatgpt_from_scratch

# Install the package in development mode
pip install -e .

# Install pre-commit hooks
pre-commit install --install-hooks

🚀 Quick Start

Training a Tokenizer

from pxl_gpt.dataset import Dataset
from pxl_gpt.tokenizer import BPETokenizer

# Load and preprocess dataset
dataset = Dataset()
dataset.preprocess()

# Train tokenizer
tokenizer = BPETokenizer()
tokenizer.train(dataset)

# Save tokenizer
tokenizer.save("data/tokenizer.json")

Testing the Tokenizer

# Load a saved tokenizer
tokenizer = BPETokenizer()
tokenizer.load("data/tokenizer.json")

# Test encoding and decoding
test_text = "Bonjour, comment ça va?"
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)
print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

📊 Project Structure

chatgpt_from_scratch/
├── data/
│   ├── cleaned_data.csv  # Preprocessed dataset
│   └── tokenizer.json    # Trained tokenizer
├── extra/
│   ├── emb_comparison.py # Word vector arithmetic demonstrations
│   └── simple_cbow.py    # Simple CBOW implementation for word embeddings
├── pxl_gpt/
│   ├── dataset.py        # Dataset handling
│   ├── dataset_loader.py # Dataset loading utilities
│   ├── model.py          # GPT model implementation
│   └── tokenizer.py      # BPE tokenizer implementation
├── scripts/
│   └── train_tokenizer.py # Script to train the tokenizer
└── pyproject.toml        # Project configuration

🔍 Components

Dataset

The dataset module handles loading and preprocessing text data. By default, it uses a French Reddit discussion dataset.

Tokenizer

A custom implementation of Byte Pair Encoding (BPE) tokenization, similar to what's used in models like GPT. The tokenizer:

  • Splits text into initial tokens
  • Iteratively merges the most frequent adjacent token pairs
  • Builds a vocabulary of subword units
  • Provides encoding and decoding functionality

Extra Scripts

The extra directory contains supplementary scripts that demonstrate concepts related to word embeddings:

  • emb_comparison.py: Demonstrates word vector arithmetic using Google's pre-trained word2vec model. It shows interesting semantic relationships like "king - man + woman = queen" and similar vector operations.
  • simple_cbow.py: Implements a Continuous Bag of Words (CBOW) model from scratch using PyTorch. This script trains word embeddings on a small French corpus and visualizes the resulting word vectors, showing how related words cluster together in the embedding space.

These scripts help understand the embedding concepts that underlie modern language models.

Model (Coming Soon)

The GPT model architecture implementation.

Translated Code

📚 References

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👨‍💻 Author

Created by PixelCrafted

About

On recode ensemble ChatGPT de zéro sur ma chaîne YouTube et Twitch !

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages