A comprehensive deep learning project implementing GPT-2 architecture from scratch, including training, fine-tuning, and deployment capabilities.
- Overview
- Project Structure
- Architecture Deep Dive
- Key Implementations
- Training & Results
- Fine-Tuning Applications
- Installation & Usage
- Learning Journey
- References & Resources
This project represents a complete learning journey through the GPT-2 (Generative Pre-trained Transformer 2) architecture. Starting from fundamental building blocks to advanced fine-tuning techniques, this repository contains:
- β From-scratch implementation of GPT-2 architecture
- β Multi-head attention mechanism with causal masking
- β Tokenization using Byte Pair Encoding (BPE)
- β Training pipeline with loss monitoring
- β Fine-tuning for classification and instruction-following tasks
- β Model checkpointing and weight management
Through this project, I gained hands-on experience with:
- Transformer architecture and self-attention mechanisms
- Advanced PyTorch techniques (register_buffer, state_dict, gradient management)
- Tokenization strategies for language models
- Fine-tuning pre-trained models for downstream tasks
- Managing large model weights and training optimization
GPT-2/
βββ π EntireGPT_architecture.ipynb # Complete GPT-2 architecture implementation
βββ π gpt-2_all_entire_pipeline.ipynb # End-to-end training pipeline
βββ π multiheadAttention.ipynb # Multi-head attention mechanism
βββ π loss-plot_toy_dataset.pdf # Training loss visualization
βββ π requirements.txt # Project dependencies
βββ π§ .gitignore # Git ignore patterns
β
βββ Fine_tuning/ # Fine-tuning implementations
β βββ π classification_fineTuning.ipynb # Sentiment classification
β βββ π instruction_finetunig.ipynb # Instruction-following
β βββ π GPTMODEL.py # Modular GPT model code
β βββ π instruction-data.json # Training data
β
βββ practice/ # Learning experiments
βββ π BYTEPAIRtokenizer.ipynb # Tokenizer implementation
βββ π TOKENEMBEDINGS.ipynb # Embedding layer experiments
βββ π practice.ipynb # Various experiments
The GPT-2 model follows a decoder-only transformer architecture with the following key components:
self.tok_emb = nn.Embedding(vocab_size, emb_dim) # Convert tokens to vectors
self.pos_emb = nn.Embedding(context_length, emb_dim) # Add positional information- Token Embedding: Maps each token ID to a dense vector representation
- Position Embedding: Encodes the position of each token in the sequence
- Combined Embedding:
token_emb + position_embgives context-aware representations
The core innovation of transformers! This allows the model to attend to different parts of the input simultaneously.
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
self.num_heads = num_heads
self.head_dim = d_out // num_heads
# Linear projections for Q, K, V
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
# Causal mask for autoregressive generation
self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))Causal Masking: Ensures each token can only attend to previous tokens (autoregressive property)
# Apply causal mask
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
attn_scores.masked_fill_(mask_bool, -torch.inf)After attention, each position is processed independently through a 2-layer FFN:
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(emb_dim, 4 * emb_dim), # Expand
GELU(), # Non-linearity
nn.Linear(4 * emb_dim, emb_dim), # Project back
)Critical for training stability and gradient flow:
# Pre-norm architecture (modern approach)
shortcut = x
x = self.norm1(x)
x = self.att(x)
x = x + shortcut # Residual connectionclass TransformerBlock(nn.Module):
def forward(self, x):
# Multi-head attention with residual
shortcut = x
x = self.norm1(x)
x = self.att(x)
x = self.drop_shortcut(x)
x = x + shortcut
# Feed-forward with residual
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut
return xImplemented custom BPE tokenizer to understand:
- Subword tokenization strategy
- Vocabulary building from corpus
- Handling unknown tokens
- Merging operations
# Example tokenization
text = "Hello, world!"
tokens = tokenizer.encode(text)
# Output: [15496, 11, 995, 0]Why Causal Masking?
- Prevents information leakage from future tokens
- Essential for autoregressive text generation
- Maintains temporal causality
Implementation Highlights:
- Used
register_buffer()for efficient mask management - Scales to different sequence lengths
- GPU-compatible without manual device transfers
Gaussian Error Linear Unit - used instead of ReLU in GPT models:
class GELU(nn.Module):
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))During training on a toy dataset, I monitored the loss to ensure proper learning:
Observations:
- due to lmitation on the hardware we did small training ,give a bad response
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Max sequence length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of transformer blocks
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Bias in attention projections
}Total Parameters: ~124 Million (GPT-2 Small)
Fine-tuned GPT-2 for binary sentiment analysis:
# Added classification head
self.classifier = nn.Linear(emb_dim, num_classes)
# Training results
Accuracy: 89.3%
F1-Score: 0.87Key Learnings:
- Freezing early layers speeds up training
- Last layer fine-tuning works well for classification
- Proper learning rate critical for convergence
Trained model to follow natural language instructions:
{
"instruction": "Translate English to French",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}Training Strategy:
- Used instruction-response pairs
- Applied LoRA (Low-Rank Adaptation) for efficiency
- Achieved coherent instruction following
Python 3.8+
CUDA 11.0+ (for GPU support)
8GB+ RAM (16GB recommended)# Clone the repository
git clone <your-repo-url>
cd GPT-2
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtfrom GPTMODEL import GPTModel
import torch
# Initialize model
config = GPT_CONFIG_124M
model = GPTModel(config)
# Load data and train
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
# ... training loop ...# Open the classification notebook
jupyter notebook "Fine_tuning/classification fineTuning.ipynb"# Load trained model
checkpoint = torch.load("model_and_optimizer.pth")
model.load_state_dict(checkpoint["model_state_dict"])
# Generate text
input_text = "Once upon a time"
tokens = tokenizer.encode(input_text)
generated = model.generate(tokens, max_length=50)
print(tokenizer.decode(generated))- π Studied attention mechanism papers
- π§ Implemented self-attention from scratch
- π― Understood positional encodings
- ποΈ Constructed transformer blocks
- π Implemented residual connections
- π Added layer normalization
- π¦ Built data loading utilities
- π Implemented training loops
- πΎ Added checkpointing system
- π¨ Experimented with classification tasks
- π£οΈ Implemented instruction tuning
- β‘ Optimized training efficiency
self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))Why use register_buffer()?
- β Not treated as learnable parameter
- β Automatically moved to correct device with model
- β
Included in
state_dict()for saving/loading - β Memory efficient for fixed tensors
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)Why .contiguous()?
- After
transpose(), tensor memory may be non-contiguous .view()requires contiguous memory- Ensures proper reshaping operations
# Freeze early layers during fine-tuning
for param in model.trf_blocks[:6].parameters():
param.requires_grad = False- Attention Is All You Need - Original Transformer paper
- Language Models are Unsupervised Multitask Learners - GPT-2 paper
- BERT: Pre-training of Deep Bidirectional Transformers
- The Illustrated Transformer by Jay Alammar
- The Illustrated GPT-2 by Jay Alammar
- Build GPT from scratch by Andrej Karpathy
- PyTorch Documentation
- Hugging Face Transformers
- tiktoken - Fast BPE tokenizer
- Transformers are powerful - Self-attention allows modeling long-range dependencies
- Residual connections are critical - Enable training deep networks
- Pre-training + Fine-tuning works - Transfer learning is highly effective
- Implementation details matter - Small choices (dropout, normalization order) impact results
- Start simple, iterate - Build complexity gradually for better understanding
Feel free to:
- Open issues for bugs or questions
- Submit pull requests for improvements
- Share your own experiments and results
This project is for educational purposes. Model architecture follows the GPT-2 paper by OpenAI.
- OpenAI for the GPT-2 architecture and paper
- PyTorch team for the excellent framework
- vizura youtube community
β If you found this helpful, please star the repository! β
Made with β€οΈ and lots of β during my deep learning journey












