Skip to content

Parry-97/llm-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 Building an LLM from Scratch

Python PyTorch tiktoken uv Progress

A step-by-step implementation of a GPT-like Large Language Model following Sebastian Raschka's "Build a Large Language Model (From Scratch)"

📖 About This Project

This repository documents my journey through "Build a Large Language Model (From Scratch)" by Sebastian Raschka. I'm implementing each concept from the book in PyTorch, building a GPT-like language model from the ground up to truly understand how modern LLMs work.

📚 Book Progress: Chapter 6 of 7

Currently implementing: "Fine-tuning for classification"

✅ Completed Chapters

  • Chapter 1: Understanding large language models
  • Chapter 2: Working with text data
  • Chapter 3: Coding attention mechanisms
  • Chapter 4: Implementing a GPT model from scratch to generate text
  • Chapter 5: Pretraining on unlabeled data

🔜 Upcoming Chapters

  • Chapter 7: Fine-tuning to follow instructions

🎯 Learning Objectives

By following along with the book and this implementation, I'm learning:

  • Fundamentals of LLMs: How transformers revolutionized NLP and the architecture behind GPT models
  • Text Processing: Tokenization strategies, vocabulary building, and data preparation for neural networks
  • Attention Mechanisms: The mathematics and intuition behind self-attention and multi-head attention
  • Model Architecture: How to build a complete GPT model with embeddings, transformer blocks, and generation capabilities
  • Training Strategies: Pretraining objectives, loss functions, and optimization techniques (upcoming)
  • Fine-tuning: Adapting pretrained models for specific tasks (upcoming)

🏗️ Current Implementation Status

✨ Implemented Components

📝 Text Data Processing (Chapter 2)

  • SimpleTokenizerV1 (src/llm_from_scratch/tokenizer/simple_tokenizer.py): Custom regex-based tokenizer
  • Text splitting and preprocessing utilities
  • Vocabulary management and encoding/decoding
  • Dataset preparation for training
  • Text download utilities for fetching training data

🎯 Attention Mechanisms (Chapter 3)

🤖 GPT Model Architecture (Chapter 4 - Completed)

🔤 Text Generation (Chapter 4 - Implemented)

📦 Pretraining on Unlabeled Data (Chapter 5 - Completed)

  • Pretraining Utils (src/llm_from_scratch/pretraining/utils.py): Helper functions for training
  • Objective: next-token prediction on unlabeled corpora (language modeling)
  • Data pipeline: tokenize with tiktoken (cl100k_base), create sequences of length context_length with next-token targets
  • Batching: (batch_size, context_length) input IDs with shifted targets
  • Loss: CrossEntropyLoss over vocabulary logits on shifted targets
  • Optimizer: AdamW; regularization via dropout; gradient clipping
  • Training loop: learning-rate warmup, cosine decay (planned), checkpointing and evaluation via perplexity (planned)

💬 Fine-tuning for Classification (Chapter 6 - Current Focus)

📁 Project Structure

llm-from-scratch/
├── src/
│   └── llm_from_scratch/
│       ├── __init__.py
│       ├── attention/                    # Chapter 3: Attention implementations
│       │   ├── __init__.py
│       │   ├── simple_attention.py      # Simplified attention for learning
│       │   ├── self_attention.py        # Self-attention basics
│       │   ├── causal_attention.py      # Masked attention for autoregression
│       │   ├── simple_causal_attention.py # Simple causal attention variant
│       │   ├── trainable_attention.py   # Attention with learnable parameters
│       │   ├── multi_head_attention.py  # Multi-head attention mechanism
│       │   ├── multi_head_attention_wrapper.py # MHA wrapper utilities
│       │   └── batched_multiplication.py # Batched tensor operations
│       ├── gpt_architecture/             # Chapter 4: GPT model components
│       │   ├── __init__.py
│       │   ├── dummy_gpt_model.py       # Main GPT model class
│       │   ├── transformer.py           # Transformer block
│       │   ├── feed_forward.py          # FFN layer
│       │   ├── layer_normalization.py   # LayerNorm implementation
│       │   ├── gelu.py                  # GELU activation
│       │   └── text_generation.py       # Greedy decoding utilities
│       ├── tokenizer/                    # Chapter 2: Text processing
│       │   ├── __init__.py
│       │   ├── simple_tokenizer.py      # Tokenizer implementation
│       │   ├── gpt_dataset.py          # Dataset utilities
│       │   ├── sampling.py             # Generation sampling methods
│       │   └── text_download.py        # Text data downloading
│       └── pretraining/                  # Chapter 5: Pretraining components
│           ├── __init__.py
│           └── utils.py                 # Training utilities
├── tests/                                # Test files and scripts
│   ├── test_text_generation.py         # Text generation example
│   ├── test_embeddings.py              # Embeddings testing
│   ├── test_transformer_import.py      # Import verification
│   ├── dummy_gpt_use.py                # GPT model usage example
│   ├── loss_calculation.py             # Loss computation tests
│   ├── text_splitting.py               # Text processing tests
│   └── the-verdict.txt                 # Sample text data
├── docs/                                 # Documentation and notes
│   ├── ffn_importance.md
│   ├── gpt_output.md
│   ├── input_output_dimensions.md
│   ├── llm-optimization-insights.md
│   ├── positional_embedding.md
│   ├── python_project_best_practices.md
│   ├── pytorch_batched_matmul_guide.md
│   ├── self_attention_explained.md
│   ├── self_attention_weights.md
│   └── trainable_weight_matrices.md
├── main.py                              # Main entry point
├── pyproject.toml                       # Project configuration
├── uv.lock                              # Dependency lock file
└── README.md                            # This file

🚀 Installation

Prerequisites

  • Python 3.11+
  • Git
  • (Optional) CUDA-capable GPU for faster computation

Setup Instructions

This project uses uv for fast, reliable dependency management.

  1. Clone the repository:
git clone <your-repo-url>
cd llm-from-scratch
  1. Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Create and activate virtual environment:
uv venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies:
uv sync

PyTorch GPU Support

The project uses PyTorch 2.4.0 (CPU version by default). For GPU support:

  1. Visit PyTorch Get Started
  2. Select your configuration and install:
uv pip install torch --index-url https://download.pytorch.org/whl/cu121  # CUDA 12.1

💻 Usage Examples

Basic Tokenization (Chapter 2)

from llm_from_scratch.tokenizer.simple_tokenizer import SimpleTokenizerV1

# Create tokenizer with vocabulary
vocab = {
    "Hello": 0, ",": 1, " ": 2, "world": 3, "!": 4,
    "LLM": 5, "from": 6, "scratch": 7
}
tokenizer = SimpleTokenizerV1(vocab)

# Encode and decode text
text = "Hello, world!"
token_ids = tokenizer.encode(text)
print(f"Tokens: {token_ids}")
print(f"Decoded: {tokenizer.decode(token_ids)}")

Attention Mechanism (Chapter 3)

import torch
from llm_from_scratch.attention.multi_head_attention import MultiHeadAttention

# Setup multi-head attention
batch_size, seq_len, d_model = 2, 10, 768
mha = MultiHeadAttention(
    d_in=d_model,
    d_out=d_model,
    context_length=seq_len,
    dropout=0.1,
    num_heads=12
)

# Process input
x = torch.randn(batch_size, seq_len, d_model)
output = mha(x)
print(f"Output shape: {output.shape}")  # [2, 10, 768]

GPT Model Forward Pass (Chapter 4)

import torch
from llm_from_scratch.gpt_architecture.dummy_gpt_model import DummyGPTModel

# Model configuration
config = {
    "vocab_size": 5000,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

# Initialize model
model = DummyGPTModel(config)

# Forward pass
input_ids = torch.randint(0, config["vocab_size"], (2, 10))
with torch.no_grad():
    logits = model(input_ids)
print(f"Logits shape: {logits.shape}")  # [2, 10, 5000]

Text Generation Quickstart (Chapter 4)

Example using the greedy generation loop:

import torch
from tiktoken import get_encoding
from llm_from_scratch.gpt_architecture.dummy_gpt_model import DummyGPTModel
from llm_from_scratch.gpt_architecture.text_generation import generate_text

# Tokenizer and model configuration
tokenizer = get_encoding("cl100k_base")
config = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False,
}

model = DummyGPTModel(config).eval()

start = "Hello, I am"
encoded = tokenizer.encode(start)
idx = torch.tensor(encoded).unsqueeze(0)

out = generate_text(
    model=model,
    idx=idx,
    max_new_tokens=6,
    context_size=config["context_length"],
)
print(tokenizer.decode(out.squeeze(0).tolist()))

Run the example script directly:

uv run python tests/test_text_generation.py

🔬 Technical Implementation Details

Current Architecture (Chapter 4)

Input Text
    ↓
[Tokenization]
    ↓
Token IDs → Token Embeddings + Positional Embeddings
    ↓
[Transformer Block] × N_LAYERS
    ├── Multi-Head Attention (with causal mask)
    ├── Add & Norm
    ├── Feed-Forward Network
    └── Add & Norm
    ↓
[Final Layer Norm]
    ↓
[Output Projection] → Logits
    ↓
[Sampling/Generation] → Generated Text

Key Design Decisions

  • Tokenizer: Simple regex-based splitting (will explore BPE in later chapters)
  • Attention: Scaled dot-product with causal masking for autoregression
  • Positional Encoding: Learned embeddings (not sinusoidal)
  • Activation: GELU in feed-forward networks
  • Normalization: Pre-norm architecture (LayerNorm before sub-layers)
  • Model Size: Configurable, default similar to GPT-2 small (768 dim, 12 heads, 12 layers)

🛠️ Technologies Used

  • Python 3.11+: Core language
  • PyTorch 2.4.0: Deep learning framework
  • NumPy 2.3.2+: Numerical operations
  • tiktoken 0.11.0+: OpenAI's BPE tokenizer (for comparison)
  • uv: Fast Python package management

Development Tools

  • pytest: Testing framework
  • IPython: Interactive development
  • matplotlib: Visualizations

📚 References & Resources

Primary Reference

"Build a Large Language Model (From Scratch)" by Sebastian Raschka

Additional Resources

🚧 Roadmap

Immediate Next Steps (Chapter 6: Fine-tuning)

  • Implement classification head
  • Implement fine-tuning loop for classification
  • Build data loading pipeline for spam classification
  • Implement training metrics and logging for classification
  • Add checkpointing and resumability for fine-tuning
  • Provide a fine-tuning entry point (e.g., train_finetuning.py) and docs

Backlog

  • Temperature-based sampling for generation
  • Top-k and top-p (nucleus) sampling
  • Interactive text generation demo

Upcoming Chapters

  • Chapter 7: Instruction following capabilities
  • Chapter 7: RLHF concepts

Future Enhancements

  • Add comprehensive test coverage
  • Create Jupyter notebooks for each chapter
  • Build web interface with Gradio
  • Add model checkpointing
  • Performance profiling and optimization
  • Docker containerization

🤝 Contributing

This is a personal learning project following the book's progression. However, I welcome:

  • Bug reports and fixes
  • Clarifications and documentation improvements
  • Discussions about the concepts
  • Suggestions for better implementations

📄 License

This project is for educational/starter purposes. No explicit license.

🙏 Acknowledgments

  • Sebastian Raschka for writing this excellent book and making LLMs accessible
  • The PyTorch team for the amazing framework
  • The open-source community for inspiration and resources

"The best way to understand something is to build it from scratch"
🧠 Currently learning at Chapter 6/7 of the book 📚

About

LLM from Scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages