A custom implementation of a GPT model built from scratch. This project demonstrates the fundamental concepts behind large language models like ChatGPT by implementing each component step by step.
- BPE Tokenizer: Custom implementation of Byte Pair Encoding for text tokenization
- Dataset Handling: Preprocessing and management of text datasets
- Model Training: Train your own GPT model on custom data
- Clean Architecture: Modular design for easy understanding and extension
chatgpt_from_scratch/
├── assets/
│ ├── data/
│ │ └── tokenizer.json # Trained tokenizer
│ └── dataset/
│ └── dataset.json # Preprocessed dataset
│
├── fr/mrqsdf/gptlike/
│ ├── resource/
│ │ └── Pair.java # Pair Class
│ ├── utils/
│ │ ├── ColoredLogger # Custom Logger with color
│ │ ├── Dataset.java # dataset Class
│ │ ├── BPETokenizer # BPE tokenizer implementation
│ │ └── DatasetLoader # Dataset loading utilities
│ └── Main.java # Main Class for Train Tokenisation.
The dataset module handles loading and preprocessing text data. By default, it uses a French discussion dataset.
A custom implementation of Byte Pair Encoding (BPE) tokenization, similar to what's used in models like GPT. The tokenizer:
- Splits text into initial tokens
- Iteratively merges the most frequent adjacent token pairs
- Builds a vocabulary of subword units
- Provides encoding and decoding functionality
The GPT model architecture implementation.
- Attention Is All You Need - The original Transformer paper
- Improving Language Understanding with Unsupervised Learning - OpenAI's GPT approach
- Original Creator - The original Creator
This project is licensed under the MIT License - see the LICENSE file for details.
Created by PixelCrafted
Translated to Java By MrQsdf