LLM-generated description; original version coded by hand but improvements have used LLMs --
This repository contains experiments with transformer language models, implementing a decoder-only architecture from scratch using PyTorch. The project focuses on training efficiency, performance optimization, and model evaluation.
The main experiment implements a GPT-style decoder-only transformer model trained on Wikipedia data. The implementation includes modern training techniques and optimizations for efficient learning.
- Architecture: Decoder-only transformer (GPT-style)
- Model dimension: 768 (configurable, tested up to 1024)
- Layers: 12 (tested with 24-layer configurations)
- Attention heads: 8 (tested with 16 heads)
- MLP dimension: 4x model dimension
- Context length: 512 tokens (configurable up to 1024)
- Vocabulary: ~50K tokens (GPT-2 tokenizer)
- Weight tying: Shared weights between token embedding and output projection
- Activation checkpointing: Memory-efficient gradient computation
- Scaled Dot-Product Attention: Using PyTorch's efficient SDPA implementation
- Layer normalization: Pre-norm configuration
- Dropout: Configurable dropout in attention and MLP layers
- Fused AdamW optimizer: Hardware-optimized for CUDA
- Gradient accumulation: Efficient large batch training
- Warmup cosine scheduler: Learning rate scheduling with warmup
- Mixed precision training: bfloat16 for memory efficiency
- Optional torch.compile: PyTorch 2.x compilation for speed
- Token budget: 200-300M tokens
- Effective batch size: 512 tokens
- Micro batch size: 32-64 (memory dependent)
- Learning rate: Peak LR with warmup and cosine decay
- Weight decay: Applied to parameters excluding embeddings and norms
- Source: Wikipedia streaming dataset
- Processing: Packed sequences with document boundaries marked by EOS tokens
- Tokenization: HuggingFace GPT-2 tokenizer adapter
- Streaming: Memory-efficient streaming with worker sharding
- Bits per token: Primary training metric
- Bits per byte: More interpretable evaluation metric
- Next-token accuracy: Validation accuracy
- Attention distance histograms: Analysis of attention patterns
Training runs achieved the following results on 200-300M token budgets:
- Initial baseline: ~14 bits per byte
- Optimized training: ~12.9 bits per byte (with proper weight decay and warmup)
- Improved hyperparameters: ~12.7 bits per byte (reduced dropout, adjusted weight decay)
- Learning rate scheduling: ~8.7 bits per byte (hold phase in LR schedule)
- Latest optimized run: ~11 bits per byte
- Fused AdamW: ~25% speed improvement (55min → 40min for 300M tokens)
- Activation checkpointing: Memory savings with minimal speed impact
- Gradient accumulation: Enables larger effective batch sizes
transformer_experiments.ipynb: Main implementation and training notebookrlvr.ipynb: Placeholder for future RL with verifiable rewards experimentsCLAUDE.md: AI assistant guidelines and coding standards
The notebook contains a complete implementation including:
- Model Definition: Full transformer architecture implementation
- Training Loop: Optimized training with logging and validation
- Data Loading: Wikipedia dataset streaming and preprocessing
- Evaluation: Comprehensive metrics and attention analysis
- Visualization: Training curves and attention pattern analysis
- Activation checkpointing for reduced memory usage
- Gradient accumulation for large effective batch sizes
- Mixed precision training with bfloat16
- Proper parameter initialization (GPT-2 style)
- Gradient clipping for training stability
- Layer norm for improved convergence
- Separate validation data stream
- Comprehensive metrics including bits per byte
- Attention pattern analysis for model interpretability
The rlvr.ipynb notebook is planned to contain experiments with reinforcement learning using verifiable rewards, building on the foundation established in the main transformer experiments.
- PyTorch 2.x+ (for torch.compile and SDPA)
- transformers (HuggingFace)
- datasets
- matplotlib
- tqdm
- CUDA-capable GPU (recommended)
This implementation serves as both a learning exercise and a foundation for more advanced transformer experiments, with a focus on training efficiency and modern optimization techniques.