🤖 Building an LLM from Scratch

A step-by-step implementation of a GPT-like Large Language Model following Sebastian Raschka's "Build a Large Language Model (From Scratch)"

📖 About This Project

This repository documents my journey through "Build a Large Language Model (From Scratch)" by Sebastian Raschka. I'm implementing each concept from the book in PyTorch, building a GPT-like language model from the ground up to truly understand how modern LLMs work.

📚 Book Progress: Chapter 6 of 7

Currently implementing: "Fine-tuning for classification"

✅ Completed Chapters

Chapter 1: Understanding large language models
Chapter 2: Working with text data
Chapter 3: Coding attention mechanisms
Chapter 4: Implementing a GPT model from scratch to generate text
Chapter 5: Pretraining on unlabeled data

🔜 Upcoming Chapters

Chapter 7: Fine-tuning to follow instructions

🎯 Learning Objectives

By following along with the book and this implementation, I'm learning:

Fundamentals of LLMs: How transformers revolutionized NLP and the architecture behind GPT models
Text Processing: Tokenization strategies, vocabulary building, and data preparation for neural networks
Attention Mechanisms: The mathematics and intuition behind self-attention and multi-head attention
Model Architecture: How to build a complete GPT model with embeddings, transformer blocks, and generation capabilities
Training Strategies: Pretraining objectives, loss functions, and optimization techniques (upcoming)
Fine-tuning: Adapting pretrained models for specific tasks (upcoming)

🏗️ Current Implementation Status

✨ Implemented Components

📝 Text Data Processing (Chapter 2)

SimpleTokenizerV1 (src/llm_from_scratch/tokenizer/simple_tokenizer.py): Custom regex-based tokenizer
Text splitting and preprocessing utilities
Vocabulary management and encoding/decoding
Dataset preparation for training
Text download utilities for fetching training data

🎯 Attention Mechanisms (Chapter 3)

Simple Attention (src/llm_from_scratch/attention/simple_attention.py): Basic attention implementation for understanding
Causal Attention (src/llm_from_scratch/attention/causal_attention.py): Masked attention to prevent looking ahead
Multi-Head Attention (src/llm_from_scratch/attention/multi_head_attention.py): Parallel attention heads with projection
Trainable Attention (src/llm_from_scratch/attention/trainable_attention.py): Attention with learnable parameters
Batched Multiplication utilities for efficient tensor operations
Scaled dot-product attention with proper normalization

🤖 GPT Model Architecture (Chapter 4 - Completed)

DummyGPTModel (src/llm_from_scratch/gpt_architecture/dummy_gpt_model.py): Complete GPT model implementation
TransformerBlock (src/llm_from_scratch/gpt_architecture/transformer.py): Core transformer building block
FeedForward Networks (src/llm_from_scratch/gpt_architecture/feed_forward.py): Position-wise feed-forward with GELU activation
GELU Activation (src/llm_from_scratch/gpt_architecture/gelu.py): Custom GELU implementation
Layer Normalization (src/llm_from_scratch/gpt_architecture/layer_normalization.py): Custom implementation for training stability
Positional embeddings (learned)
Residual connections and dropout

🔤 Text Generation (Chapter 4 - Implemented)

Greedy decoding with context window cropping (src/llm_from_scratch/gpt_architecture/text_generation.py)
Deterministic next-token selection via argmax over softmax logits
Example script: tests/test_text_generation.py using tiktoken (cl100k_base)

📦 Pretraining on Unlabeled Data (Chapter 5 - Completed)

Pretraining Utils (src/llm_from_scratch/pretraining/utils.py): Helper functions for training
Objective: next-token prediction on unlabeled corpora (language modeling)
Data pipeline: tokenize with tiktoken (cl100k_base), create sequences of length context_length with next-token targets
Batching: (batch_size, context_length) input IDs with shifted targets
Loss: CrossEntropyLoss over vocabulary logits on shifted targets
Optimizer: AdamW; regularization via dropout; gradient clipping
Training loop: learning-rate warmup, cosine decay (planned), checkpointing and evaluation via perplexity (planned)

💬 Fine-tuning for Classification (Chapter 6 - Current Focus)

Spam Dataset (src/llm_from_scratch/clf_finetuning/spam_dataset.py): Dataset preparation for spam classification
Finetuning Utils (src/llm_from_scratch/clf_finetuning/utils.py): Helper functions for fine-tuning
Objective: classification on labeled data
Data pipeline: preparing the spam dataset for classification
Loss: CrossEntropyLoss for classification
Training loop: fine-tuning the pretrained model on the classification task

📁 Project Structure

llm-from-scratch/
├── src/
│   └── llm_from_scratch/
│       ├── __init__.py
│       ├── attention/                    # Chapter 3: Attention implementations
│       │   ├── __init__.py
│       │   ├── simple_attention.py      # Simplified attention for learning
│       │   ├── self_attention.py        # Self-attention basics
│       │   ├── causal_attention.py      # Masked attention for autoregression
│       │   ├── simple_causal_attention.py # Simple causal attention variant
│       │   ├── trainable_attention.py   # Attention with learnable parameters
│       │   ├── multi_head_attention.py  # Multi-head attention mechanism
│       │   ├── multi_head_attention_wrapper.py # MHA wrapper utilities
│       │   └── batched_multiplication.py # Batched tensor operations
│       ├── gpt_architecture/             # Chapter 4: GPT model components
│       │   ├── __init__.py
│       │   ├── dummy_gpt_model.py       # Main GPT model class
│       │   ├── transformer.py           # Transformer block
│       │   ├── feed_forward.py          # FFN layer
│       │   ├── layer_normalization.py   # LayerNorm implementation
│       │   ├── gelu.py                  # GELU activation
│       │   └── text_generation.py       # Greedy decoding utilities
│       ├── tokenizer/                    # Chapter 2: Text processing
│       │   ├── __init__.py
│       │   ├── simple_tokenizer.py      # Tokenizer implementation
│       │   ├── gpt_dataset.py          # Dataset utilities
│       │   ├── sampling.py             # Generation sampling methods
│       │   └── text_download.py        # Text data downloading
│       └── pretraining/                  # Chapter 5: Pretraining components
│           ├── __init__.py
│           └── utils.py                 # Training utilities
├── tests/                                # Test files and scripts
│   ├── test_text_generation.py         # Text generation example
│   ├── test_embeddings.py              # Embeddings testing
│   ├── test_transformer_import.py      # Import verification
│   ├── dummy_gpt_use.py                # GPT model usage example
│   ├── loss_calculation.py             # Loss computation tests
│   ├── text_splitting.py               # Text processing tests
│   └── the-verdict.txt                 # Sample text data
├── docs/                                 # Documentation and notes
│   ├── ffn_importance.md
│   ├── gpt_output.md
│   ├── input_output_dimensions.md
│   ├── llm-optimization-insights.md
│   ├── positional_embedding.md
│   ├── python_project_best_practices.md
│   ├── pytorch_batched_matmul_guide.md
│   ├── self_attention_explained.md
│   ├── self_attention_weights.md
│   └── trainable_weight_matrices.md
├── main.py                              # Main entry point
├── pyproject.toml                       # Project configuration
├── uv.lock                              # Dependency lock file
└── README.md                            # This file

🚀 Installation

Prerequisites

Python 3.11+
Git
(Optional) CUDA-capable GPU for faster computation

Setup Instructions

This project uses uv for fast, reliable dependency management.

Clone the repository:

git clone <your-repo-url>
cd llm-from-scratch

Install uv (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Create and activate virtual environment:

uv venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

uv sync

PyTorch GPU Support

The project uses PyTorch 2.4.0 (CPU version by default). For GPU support:

Visit PyTorch Get Started
Select your configuration and install:

uv pip install torch --index-url https://download.pytorch.org/whl/cu121  # CUDA 12.1

💻 Usage Examples

Basic Tokenization (Chapter 2)

from llm_from_scratch.tokenizer.simple_tokenizer import SimpleTokenizerV1

# Create tokenizer with vocabulary
vocab = {
    "Hello": 0, ",": 1, " ": 2, "world": 3, "!": 4,
    "LLM": 5, "from": 6, "scratch": 7
}
tokenizer = SimpleTokenizerV1(vocab)

# Encode and decode text
text = "Hello, world!"
token_ids = tokenizer.encode(text)
print(f"Tokens: {token_ids}")
print(f"Decoded: {tokenizer.decode(token_ids)}")

Attention Mechanism (Chapter 3)

import torch
from llm_from_scratch.attention.multi_head_attention import MultiHeadAttention

# Setup multi-head attention
batch_size, seq_len, d_model = 2, 10, 768
mha = MultiHeadAttention(
    d_in=d_model,
    d_out=d_model,
    context_length=seq_len,
    dropout=0.1,
    num_heads=12
)

# Process input
x = torch.randn(batch_size, seq_len, d_model)
output = mha(x)
print(f"Output shape: {output.shape}")  # [2, 10, 768]

GPT Model Forward Pass (Chapter 4)

import torch
from llm_from_scratch.gpt_architecture.dummy_gpt_model import DummyGPTModel

# Model configuration
config = {
    "vocab_size": 5000,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

# Initialize model
model = DummyGPTModel(config)

# Forward pass
input_ids = torch.randint(0, config["vocab_size"], (2, 10))
with torch.no_grad():
    logits = model(input_ids)
print(f"Logits shape: {logits.shape}")  # [2, 10, 5000]

Text Generation Quickstart (Chapter 4)

Example using the greedy generation loop:

import torch
from tiktoken import get_encoding
from llm_from_scratch.gpt_architecture.dummy_gpt_model import DummyGPTModel
from llm_from_scratch.gpt_architecture.text_generation import generate_text

# Tokenizer and model configuration
tokenizer = get_encoding("cl100k_base")
config = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False,
}

model = DummyGPTModel(config).eval()

start = "Hello, I am"
encoded = tokenizer.encode(start)
idx = torch.tensor(encoded).unsqueeze(0)

out = generate_text(
    model=model,
    idx=idx,
    max_new_tokens=6,
    context_size=config["context_length"],
)
print(tokenizer.decode(out.squeeze(0).tolist()))

Run the example script directly:

uv run python tests/test_text_generation.py

🔬 Technical Implementation Details

Current Architecture (Chapter 4)

Input Text
    ↓
[Tokenization]
    ↓
Token IDs → Token Embeddings + Positional Embeddings
    ↓
[Transformer Block] × N_LAYERS
    ├── Multi-Head Attention (with causal mask)
    ├── Add & Norm
    ├── Feed-Forward Network
    └── Add & Norm
    ↓
[Final Layer Norm]
    ↓
[Output Projection] → Logits
    ↓
[Sampling/Generation] → Generated Text

Key Design Decisions

Tokenizer: Simple regex-based splitting (will explore BPE in later chapters)
Attention: Scaled dot-product with causal masking for autoregression
Positional Encoding: Learned embeddings (not sinusoidal)
Activation: GELU in feed-forward networks
Normalization: Pre-norm architecture (LayerNorm before sub-layers)
Model Size: Configurable, default similar to GPT-2 small (768 dim, 12 heads, 12 layers)

🛠️ Technologies Used

Python 3.11+: Core language
PyTorch 2.4.0: Deep learning framework
NumPy 2.3.2+: Numerical operations
tiktoken 0.11.0+: OpenAI's BPE tokenizer (for comparison)
uv: Fast Python package management

Development Tools

pytest: Testing framework
IPython: Interactive development
matplotlib: Visualizations

📚 References & Resources

Primary Reference

"Build a Large Language Model (From Scratch)" by Sebastian Raschka

Book on Manning

Official GitHub Repository

Author's Website

Additional Resources

🚧 Roadmap

Immediate Next Steps (Chapter 6: Fine-tuning)

Implement classification head
Implement fine-tuning loop for classification
Build data loading pipeline for spam classification
Implement training metrics and logging for classification
Add checkpointing and resumability for fine-tuning
Provide a fine-tuning entry point (e.g., train_finetuning.py) and docs

Backlog

Temperature-based sampling for generation
Top-k and top-p (nucleus) sampling
Interactive text generation demo

Upcoming Chapters

Chapter 7: Instruction following capabilities
Chapter 7: RLHF concepts

Future Enhancements

Add comprehensive test coverage
Create Jupyter notebooks for each chapter
Build web interface with Gradio
Add model checkpointing
Performance profiling and optimization
Docker containerization

🤝 Contributing

This is a personal learning project following the book's progression. However, I welcome:

Bug reports and fixes
Clarifications and documentation improvements
Discussions about the concepts
Suggestions for better implementations

📄 License

This project is for educational/starter purposes. No explicit license.

🙏 Acknowledgments

Sebastian Raschka for writing this excellent book and making LLMs accessible
The PyTorch team for the amazing framework
The open-source community for inspiration and resources

"The best way to understand something is to build it from scratch"
🧠 Currently learning at Chapter 6/7 of the book 📚

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docs		docs
src/llm_from_scratch		src/llm_from_scratch
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Parry-97/llm-from-scratch

Folders and files

Latest commit

History

Repository files navigation