A clean, optimized, and interpretable implementation of a decoder-only Transformer in PyTorch.
Zoof is a high-efficiency Small Language Model (SLM) engineered from scratch. It demonstrates how modern architectural choices and high-quality data can yield competitive performance in the sub-400M parameter regime, even with limited compute.
-
Pre-Norm Architecture: Applies
RMSNormbefore self-attention and MLP blocks for better gradient flow and training stability. -
Rotary Positional Embeddings (RoPE): Replaces absolute learned positional embeddings from v1 with
RoPE, enabling better generalization to longer contexts. -
Flash Attention: Automatically uses PyTorch's
F.scaled_dot_product_attention, leveraging Flash Attention kernels when available for efficient$O(N^2)$ computing. -
Smart Initialization: Implements a specific weight initialization strategy (scaling projections by
$1/\sqrt{2L}$ ) to stabilize variance in deep residual paths. -
Extensive Pre-training: Trained on approximately 79 Billion tokens from the
FineWeb-Edudataset, focusing on reasoning-dense content.
You can prompt the Zoof model using Google Colab's free T4 GPUs. This is the fastest way to try the model without installing anything locally.
Click here to open the Interactive Notebook
The notebook handles:
- Cloning the repository.
- Installing dependencies (torch, transformers).
- Loading the model on the GPU (cuda).
- Running the interactive chat loop.
This project uses uv for fast package management, but standard pip works as well.
- Python 3.8+
- PyTorch (CUDA required for Flash Attention)
git clone https://github.com/yourusername/zoof.git
cd zoof
uv sync
I've provided a script to chat with a pre-trained & fine-tuned version of the model (zoof-v1.2-394M-chat) hosted on Hugging Face.
Run the following to prompt the model:
python prompt_zoof.py
This script will:
- Download the config and model weights from
Jiraya/zoof-250M-chat. - Download the tokenizer from
Jiraya/zoof-tokenizer. - Launch an interactive session.
Despite being trained on significantly less data than industry baselines, zoof-v1.2-394M demonstrates competitive performance, particularly in tasks requiring boolean logic and physical commonsense.
| Benchmark | Metric | Zoof-v1.2-394M | SmolLM-360M | SmolLM2-360M | Qwen2.5-0.5B |
|---|---|---|---|---|---|
| Training Tokens | Data Efficiency | 79B | 600B | 4T | 18T |
| PIQA | Physical Commonsense | 69.5 | 71.6 | 71.7 | 69.9 |
| BoolQ | Boolean Reasoning | 59.9 | - | - | - |
| WinoGrande | Pronoun Resolution | 53.8 | 52.8 | 52.5 | 54.1 |
| HellaSwag | Commonsense NLI | 47.0 | 51.8 | 54.5 | 51.2 |
| OBQA | OpenBookQA | 37.2 | 37.2 | 37.4 | 37.4 |
| ARC-E | Science (Easy) | 44.3 | - | - | - |
| ARC-C | Science (Challenge) | 32.3 | - | - | 35.6 |
| SIQA | Social Commonsense | 40.3 | - | - | - |
| MMLU (cloze) | General Knowledge | 28.5 | 34.4 | 35.8 | 33.7 |
| MMLU | General Knowledge | 29.6 | - | - | - |
| RACE | Reading Comprehension | 38.3 | - | - | - |
Note: Zoof achieves these scores with ~2% of the training compute used for SmolLM2 (79B vs 4T tokens), highlighting the efficiency of the architecture and FineWeb-Edu dataset.
│
├── src/
│ ├── zoof_v1
│ │ └── model.py # Model definition for v1
│ ├── zoof_v1_2
│ │ └── model.py # Model definition for v1.2
│ ├── config.py # Configuration dataclass
│ ├── prompt_zoof.py # Interactive CLI chat script
│ └── utils.py # Helper utilities
├── .gitignore
├── .pre-commit-config.yaml
├── pyproject.toml
└── uv.lock # Dependency lock file