A step-by-step implementation of a GPT-like Large Language Model following Sebastian Raschka's "Build a Large Language Model (From Scratch)"
This repository documents my journey through "Build a Large Language Model (From Scratch)" by Sebastian Raschka. I'm implementing each concept from the book in PyTorch, building a GPT-like language model from the ground up to truly understand how modern LLMs work.
Currently implementing: "Fine-tuning for classification"
- Chapter 1: Understanding large language models
- Chapter 2: Working with text data
- Chapter 3: Coding attention mechanisms
- Chapter 4: Implementing a GPT model from scratch to generate text
- Chapter 5: Pretraining on unlabeled data
- Chapter 7: Fine-tuning to follow instructions
By following along with the book and this implementation, I'm learning:
- Fundamentals of LLMs: How transformers revolutionized NLP and the architecture behind GPT models
- Text Processing: Tokenization strategies, vocabulary building, and data preparation for neural networks
- Attention Mechanisms: The mathematics and intuition behind self-attention and multi-head attention
- Model Architecture: How to build a complete GPT model with embeddings, transformer blocks, and generation capabilities
- Training Strategies: Pretraining objectives, loss functions, and optimization techniques (upcoming)
- Fine-tuning: Adapting pretrained models for specific tasks (upcoming)
- SimpleTokenizerV1 (src/llm_from_scratch/tokenizer/simple_tokenizer.py): Custom regex-based tokenizer
- Text splitting and preprocessing utilities
- Vocabulary management and encoding/decoding
- Dataset preparation for training
- Text download utilities for fetching training data
- Simple Attention (src/llm_from_scratch/attention/simple_attention.py): Basic attention implementation for understanding
- Causal Attention (src/llm_from_scratch/attention/causal_attention.py): Masked attention to prevent looking ahead
- Multi-Head Attention (src/llm_from_scratch/attention/multi_head_attention.py): Parallel attention heads with projection
- Trainable Attention (src/llm_from_scratch/attention/trainable_attention.py): Attention with learnable parameters
- Batched Multiplication utilities for efficient tensor operations
- Scaled dot-product attention with proper normalization
- DummyGPTModel (src/llm_from_scratch/gpt_architecture/dummy_gpt_model.py): Complete GPT model implementation
- TransformerBlock (src/llm_from_scratch/gpt_architecture/transformer.py): Core transformer building block
- FeedForward Networks (src/llm_from_scratch/gpt_architecture/feed_forward.py): Position-wise feed-forward with GELU activation
- GELU Activation (src/llm_from_scratch/gpt_architecture/gelu.py): Custom GELU implementation
- Layer Normalization (src/llm_from_scratch/gpt_architecture/layer_normalization.py): Custom implementation for training stability
- Positional embeddings (learned)
- Residual connections and dropout
- Greedy decoding with context window cropping (src/llm_from_scratch/gpt_architecture/text_generation.py)
- Deterministic next-token selection via argmax over softmax logits
- Example script: tests/test_text_generation.py using tiktoken (cl100k_base)
- Pretraining Utils (src/llm_from_scratch/pretraining/utils.py): Helper functions for training
- Objective: next-token prediction on unlabeled corpora (language modeling)
- Data pipeline: tokenize with tiktoken (cl100k_base), create sequences of length context_length with next-token targets
- Batching: (batch_size, context_length) input IDs with shifted targets
- Loss: CrossEntropyLoss over vocabulary logits on shifted targets
- Optimizer: AdamW; regularization via dropout; gradient clipping
- Training loop: learning-rate warmup, cosine decay (planned), checkpointing and evaluation via perplexity (planned)
- Spam Dataset (src/llm_from_scratch/clf_finetuning/spam_dataset.py): Dataset preparation for spam classification
- Finetuning Utils (src/llm_from_scratch/clf_finetuning/utils.py): Helper functions for fine-tuning
- Objective: classification on labeled data
- Data pipeline: preparing the spam dataset for classification
- Loss: CrossEntropyLoss for classification
- Training loop: fine-tuning the pretrained model on the classification task
llm-from-scratch/
├── src/
│ └── llm_from_scratch/
│ ├── __init__.py
│ ├── attention/ # Chapter 3: Attention implementations
│ │ ├── __init__.py
│ │ ├── simple_attention.py # Simplified attention for learning
│ │ ├── self_attention.py # Self-attention basics
│ │ ├── causal_attention.py # Masked attention for autoregression
│ │ ├── simple_causal_attention.py # Simple causal attention variant
│ │ ├── trainable_attention.py # Attention with learnable parameters
│ │ ├── multi_head_attention.py # Multi-head attention mechanism
│ │ ├── multi_head_attention_wrapper.py # MHA wrapper utilities
│ │ └── batched_multiplication.py # Batched tensor operations
│ ├── gpt_architecture/ # Chapter 4: GPT model components
│ │ ├── __init__.py
│ │ ├── dummy_gpt_model.py # Main GPT model class
│ │ ├── transformer.py # Transformer block
│ │ ├── feed_forward.py # FFN layer
│ │ ├── layer_normalization.py # LayerNorm implementation
│ │ ├── gelu.py # GELU activation
│ │ └── text_generation.py # Greedy decoding utilities
│ ├── tokenizer/ # Chapter 2: Text processing
│ │ ├── __init__.py
│ │ ├── simple_tokenizer.py # Tokenizer implementation
│ │ ├── gpt_dataset.py # Dataset utilities
│ │ ├── sampling.py # Generation sampling methods
│ │ └── text_download.py # Text data downloading
│ └── pretraining/ # Chapter 5: Pretraining components
│ ├── __init__.py
│ └── utils.py # Training utilities
├── tests/ # Test files and scripts
│ ├── test_text_generation.py # Text generation example
│ ├── test_embeddings.py # Embeddings testing
│ ├── test_transformer_import.py # Import verification
│ ├── dummy_gpt_use.py # GPT model usage example
│ ├── loss_calculation.py # Loss computation tests
│ ├── text_splitting.py # Text processing tests
│ └── the-verdict.txt # Sample text data
├── docs/ # Documentation and notes
│ ├── ffn_importance.md
│ ├── gpt_output.md
│ ├── input_output_dimensions.md
│ ├── llm-optimization-insights.md
│ ├── positional_embedding.md
│ ├── python_project_best_practices.md
│ ├── pytorch_batched_matmul_guide.md
│ ├── self_attention_explained.md
│ ├── self_attention_weights.md
│ └── trainable_weight_matrices.md
├── main.py # Main entry point
├── pyproject.toml # Project configuration
├── uv.lock # Dependency lock file
└── README.md # This file
- Python 3.11+
- Git
- (Optional) CUDA-capable GPU for faster computation
This project uses uv for fast, reliable dependency management.
- Clone the repository:
git clone <your-repo-url>
cd llm-from-scratch- Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh- Create and activate virtual environment:
uv venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
uv syncThe project uses PyTorch 2.4.0 (CPU version by default). For GPU support:
- Visit PyTorch Get Started
- Select your configuration and install:
uv pip install torch --index-url https://download.pytorch.org/whl/cu121 # CUDA 12.1from llm_from_scratch.tokenizer.simple_tokenizer import SimpleTokenizerV1
# Create tokenizer with vocabulary
vocab = {
"Hello": 0, ",": 1, " ": 2, "world": 3, "!": 4,
"LLM": 5, "from": 6, "scratch": 7
}
tokenizer = SimpleTokenizerV1(vocab)
# Encode and decode text
text = "Hello, world!"
token_ids = tokenizer.encode(text)
print(f"Tokens: {token_ids}")
print(f"Decoded: {tokenizer.decode(token_ids)}")import torch
from llm_from_scratch.attention.multi_head_attention import MultiHeadAttention
# Setup multi-head attention
batch_size, seq_len, d_model = 2, 10, 768
mha = MultiHeadAttention(
d_in=d_model,
d_out=d_model,
context_length=seq_len,
dropout=0.1,
num_heads=12
)
# Process input
x = torch.randn(batch_size, seq_len, d_model)
output = mha(x)
print(f"Output shape: {output.shape}") # [2, 10, 768]import torch
from llm_from_scratch.gpt_architecture.dummy_gpt_model import DummyGPTModel
# Model configuration
config = {
"vocab_size": 5000,
"context_length": 256,
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1,
"qkv_bias": False
}
# Initialize model
model = DummyGPTModel(config)
# Forward pass
input_ids = torch.randint(0, config["vocab_size"], (2, 10))
with torch.no_grad():
logits = model(input_ids)
print(f"Logits shape: {logits.shape}") # [2, 10, 5000]Example using the greedy generation loop:
import torch
from tiktoken import get_encoding
from llm_from_scratch.gpt_architecture.dummy_gpt_model import DummyGPTModel
from llm_from_scratch.gpt_architecture.text_generation import generate_text
# Tokenizer and model configuration
tokenizer = get_encoding("cl100k_base")
config = {
"vocab_size": 50257,
"context_length": 1024,
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1,
"qkv_bias": False,
}
model = DummyGPTModel(config).eval()
start = "Hello, I am"
encoded = tokenizer.encode(start)
idx = torch.tensor(encoded).unsqueeze(0)
out = generate_text(
model=model,
idx=idx,
max_new_tokens=6,
context_size=config["context_length"],
)
print(tokenizer.decode(out.squeeze(0).tolist()))Run the example script directly:
uv run python tests/test_text_generation.pyInput Text
↓
[Tokenization]
↓
Token IDs → Token Embeddings + Positional Embeddings
↓
[Transformer Block] × N_LAYERS
├── Multi-Head Attention (with causal mask)
├── Add & Norm
├── Feed-Forward Network
└── Add & Norm
↓
[Final Layer Norm]
↓
[Output Projection] → Logits
↓
[Sampling/Generation] → Generated Text
- Tokenizer: Simple regex-based splitting (will explore BPE in later chapters)
- Attention: Scaled dot-product with causal masking for autoregression
- Positional Encoding: Learned embeddings (not sinusoidal)
- Activation: GELU in feed-forward networks
- Normalization: Pre-norm architecture (LayerNorm before sub-layers)
- Model Size: Configurable, default similar to GPT-2 small (768 dim, 12 heads, 12 layers)
- Python 3.11+: Core language
- PyTorch 2.4.0: Deep learning framework
- NumPy 2.3.2+: Numerical operations
- tiktoken 0.11.0+: OpenAI's BPE tokenizer (for comparison)
- uv: Fast Python package management
- pytest: Testing framework
- IPython: Interactive development
- matplotlib: Visualizations
"Build a Large Language Model (From Scratch)" by Sebastian Raschka
- Attention Is All You Need (Original Transformer Paper)
- GPT-2 Paper
- The Illustrated Transformer
- PyTorch Documentation
- Implement classification head
- Implement fine-tuning loop for classification
- Build data loading pipeline for spam classification
- Implement training metrics and logging for classification
- Add checkpointing and resumability for fine-tuning
- Provide a fine-tuning entry point (e.g.,
train_finetuning.py) and docs
- Temperature-based sampling for generation
- Top-k and top-p (nucleus) sampling
- Interactive text generation demo
- Chapter 7: Instruction following capabilities
- Chapter 7: RLHF concepts
- Add comprehensive test coverage
- Create Jupyter notebooks for each chapter
- Build web interface with Gradio
- Add model checkpointing
- Performance profiling and optimization
- Docker containerization
This is a personal learning project following the book's progression. However, I welcome:
- Bug reports and fixes
- Clarifications and documentation improvements
- Discussions about the concepts
- Suggestions for better implementations
This project is for educational/starter purposes. No explicit license.
- Sebastian Raschka for writing this excellent book and making LLMs accessible
- The PyTorch team for the amazing framework
- The open-source community for inspiration and resources
🧠 Currently learning at Chapter 6/7 of the book 📚