Educational implementation of the Transformer architecture from the "Attention Is All You Need" paper, built with PyTorch.
- Complete Encoder-Decoder Architecture with cross-attention
- Modular Design - each component can be studied independently
- Multiple Tokenizers - SentencePiece and word-level tokenization
- WMT14 Dataset Integration - German-English translation
- Educational Focus - well-documented code with comprehensive docstrings
- Production Ready - proper error handling, logging, and testing
git clone https://github.com/MayukhSobo/Transformer.git
cd Transformer
# Using uv (recommended)
uv sync
# Or using pip
pip install -r requirements.txtfrom model import build_transformer
from config import Config
config = Config(config_file="config.toml")
transformer, dataset = build_transformer(config)
# Forward pass
output = transformer.forward(src_batch, tgt_batch, src_pad_mask, tgt_pad_mask)python main.py # Train with default config
python main.py --config custom.toml # Train with custom configpython test_runner.py # Run all tests
python test_runner.py pytest # Run with pytest
python test_runner.py coverage # Generate coverage reportTransformer/
├── arch/ # Core transformer modules
│ ├── attentions/ # Self, multi-head, and cross-attention
│ ├── encoder/ # Encoder components
│ ├── decoder/ # Decoder components
│ ├── embedding.py # Token embeddings
│ ├── positional_encoding.py
│ ├── feed_forward.py
│ └── residual_add_norm.py
├── tokenizer/ # Tokenization utilities
├── tests/ # Test suite
├── data/ # Dataset directory
├── config.toml # Model configuration
├── model.py # Model creation and orchestration
├── train.py # Training implementation
├── dataset.py # Dataset loading and preprocessing
└── main.py # CLI entry point
Default model configuration (~101 million parameters, using distinct embeddings):
[model]
vocab_size = 37000
hidden_size = 512
max_seq_len = 512
n_heads = 8
n_layers = 6
ff_hidden_size = 2048
dropout_pe = 0.1
[tokenizer]
kind = "sentencepiece" # or "word"
algorithm = "bpe" # or "unigram"
vocab_size = 32000
[training]
batch_size = 32
epochs = 10
learning_rate = 0.0005
[dataset]
path = "./data"- Multi-Head Attention: 8 heads with 64 dimensions each
- Positional Encoding: Sinusoidal encoding with non-learnable/fixed parameters
- Feed-Forward: Two-layer MLP (512 → 2048 → 512)
- Residual Connections: Post-norm architecture with LayerNorm
- Cross-Attention: Full encoder-decoder interaction
- ✅ Complete Architecture: Encoder, decoder, and cross-attention implemented
- ✅ Tokenization: SentencePiece and word-level tokenizers
- ✅ Dataset Integration: WMT14 German-English with streaming support
⚠️ Training Pipeline: Forward pass implemented, optimization in progress- ✅ Testing: Comprehensive test suite with 10.00/10 pylint score
# Run tests
python test_runner.py
# Run with coverage
python test_runner.py coverage
# Check code quality
pylint $(git ls-files '*.py')
# Format code
black .- Attention Is All You Need - Original paper
- The Illustrated Transformer - Visual explanation
- The Annotated Transformer - Implementation guide
MIT License - Free to use for educational purposes.
Educational transformer implementation with complete encoder-decoder architecture and cross-attention, ready for sequence-to-sequence tasks.