GitHub - mrqsdf/chatgpt_de_zero-Java: Fork du projet de @ThePixelCrafted pour du java

A custom implementation of a GPT model built from scratch. This project demonstrates the fundamental concepts behind large language models like ChatGPT by implementing each component step by step.

🌟 Features

BPE Tokenizer: Custom implementation of Byte Pair Encoding for text tokenization
Dataset Handling: Preprocessing and management of text datasets
Model Training: Train your own GPT model on custom data
Clean Architecture: Modular design for easy understanding and extension

📊 Project Structure

chatgpt_from_scratch/
├── assets/
│   ├── data/
│   │   └── tokenizer.json  # Trained tokenizer
│   └── dataset/
│       └── dataset.json    # Preprocessed dataset
│    
├── fr/mrqsdf/gptlike/
│   ├── resource/
│   │   └── Pair.java       # Pair Class
│   ├── utils/
│   │   ├── ColoredLogger   # Custom Logger with color
│   │   ├── Dataset.java    # dataset Class
│   │   ├── BPETokenizer    # BPE tokenizer implementation
│   │   └── DatasetLoader   # Dataset loading utilities
│   └── Main.java           # Main Class for Train Tokenisation.

🔍 Components

Dataset

The dataset module handles loading and preprocessing text data. By default, it uses a French discussion dataset.

Tokenizer

A custom implementation of Byte Pair Encoding (BPE) tokenization, similar to what's used in models like GPT. The tokenizer:

Splits text into initial tokens
Iteratively merges the most frequent adjacent token pairs
Builds a vocabulary of subword units
Provides encoding and decoding functionality

Model (Coming Soon)

The GPT model architecture implementation.

📚 References

Attention Is All You Need - The original Transformer paper
Improving Language Understanding with Unsupervised Learning - OpenAI's GPT approach
Original Creator - The original Creator

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👨‍💻 Author

Created by PixelCrafted

Translated to Java By MrQsdf

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
gradle/wrapper		gradle/wrapper
src/main/java/fr/mrqsdf/gptlike		src/main/java/fr/mrqsdf/gptlike
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌟 Features

📊 Project Structure

🔍 Components

Dataset

Tokenizer

Model (Coming Soon)

📚 References

📄 License

👨‍💻 Author

About

Uh oh!

Releases

Packages

Languages

mrqsdf/chatgpt_de_zero-Java

Folders and files

Latest commit

History

Repository files navigation

🌟 Features

📊 Project Structure

🔍 Components

Dataset

Tokenizer

Model (Coming Soon)

📚 References

📄 License

👨‍💻 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages