Minimal text diffusion in Python.
Karpathy's MicroGPT showed how GPT works in a few hundred lines. This project aims for the same style of explanation, but for discrete text diffusion.
Built for readers who want to understand the core text-diffusion loop without a large framework, large dataset pipeline, or production training stack.
Quick links: Quick Start · Files · Dataset · Reproducibility · Limitations
Most text-generation examples focus on autoregressive models. This repository shows the other path:
- Mask tokens instead of shifting them
- Denoise the full sequence instead of predicting strictly left to right
- Swap the denoiser architecture without changing the diffusion loop
It is best treated as a learning repo and maintenance-friendly reference implementation, not a benchmark or production system.
| Autoregressive (GPT, etc.) | Diffusion (This Project) | |
|---|---|---|
| Generation | Left to right, one token at a time | All at once, refining from noise |
| Attention | Causal (can only look left) | Bidirectional (looks everywhere) |
| Analogy | Writing word by word | Solving a crossword puzzle |
| Training | Predict the next token | Predict the erased tokens |
Take the name "emma":
Forward (training - add noise by masking):
t=0: e m m a <- clean
t=25: e _ m a <- some letters masked
t=50: _ _ m _ <- more masked
t=100: _ _ _ _ <- fully masked
Reverse (generation - remove noise by unmasking):
t=100: _ _ _ _ <- start from all masked
t=75: _ m _ _ <- fill in confident guesses first
t=50: e m _ a <- keep going
t=0: e m m a <- done
Train: given masked text at noise level t, predict the original.
Generate: start from all masks, then reveal the most confident predictions first.
| File | Denoiser | Dependencies | Notes |
|---|---|---|---|
train_minimal.py |
2-layer MLP | NumPy | Smallest runnable version, now script-safe and CLI-driven. |
train_pure.py |
3-layer MLP + skip | NumPy | More commentary, forward-process preview, temperature sweep. |
train.py |
4-layer Transformer | PyTorch | Bidirectional transformer version with the same diffusion loop. |
names.txt |
- | - | 32,000 lowercase names from U.S. SSA-style data. |
All three share the same masking, denoising, and confidence-based unmasking idea.
The NumPy scripts cap sequence length at 12, so they train on 31,979 names.
The PyTorch script keeps a max length of 16 and uses all 32,000 names.
The training data is names.txt, a list of 32,000 lowercase names using only a-z.
- Character vocabulary: 26 letters
- Special tokens: PAD and MASK
- NumPy variants: names longer than 12 characters are filtered out
- PyTorch variant: uses the full file with a max sequence length of 16
This keeps the tokenizer simple and makes it easy to inspect the full corruption and denoising process by eye.
- Python 3
pip- For
train.py, a PyTorch build compatible with your Python version and platform
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3 train_minimal.py
python3 train_pure.pypip install -r requirements.txt
python3 train.pyIf PyTorch is missing, train.py now exits with a direct install hint instead of a raw import traceback.
Each training script supports a small CLI for quick experiments:
python3 train_minimal.py --steps 100 --samples 5 --seed 7
python3 train_pure.py --steps 100 --samples 5 --no-temperature-sweep
python3 train.py --steps 100 --samples 5 --no-forward-previewUseful options:
--steps: override training iterations--batch-size: override batch size--samples: number of generated names--temperature: primary sampling temperature--seed: make runs repeatable--quiet: reduce logs
Each script trains for a short run, then prints generated names. The exact output changes from run to run, but the structure is always:
Training...
...
Generated names:
amira
kayna
noria
For quick maintenance checks, reduce steps and sample count:
python3 train_minimal.py --steps 10 --samples 3 --seed 7
python3 train_pure.py --steps 10 --samples 3 --seed 7 --no-temperature-sweep
python3 train.py --steps 10 --samples 3 --seed 7 --no-temperature-sweepThis repository is intentionally small, so the main maintenance risks are environmental drift and script behavior drift.
Current baseline:
requirements.txtinstalls the shared NumPy baseline plus PyTorch for the transformer path.- Scripts no longer start training just because they were imported.
train_minimal.pyis compatible with older Python runtimes that do not support dictionary union syntax.train.pyavoids the duplicate temperature-sweep sampling call that previously doubled part of the runtime.
This project is intentionally lightweight, so reproducibility is partial rather than strict.
- All scripts support
--seed - NumPy runs are easy to make repeatable on the same machine
- PyTorch runs may still vary across devices and torch builds
- There is currently no checkpoint saving, evaluation harness, or experiment tracking
If you plan to extend this repository, the next maintenance step is usually adding saved checkpoints and a small regression harness for generated outputs.
Discrete diffusion. Image diffusion adds Gaussian noise to pixels. Text is discrete, so this project uses masking instead: replace tokens with [MASK]. This is an absorbing-state diffusion process.
Cosine schedule. Tokens are masked slowly at first, then faster later. This tends to work better than a constant masking rate.
Bidirectional attention. GPT uses causal masking. Diffusion models can look at both left and right context because they refine the whole sequence together.
Confidence-based unmasking. Each reverse step reveals only the predictions the model is most confident about. Uncertain positions stay masked until later.
Temperature. Lower values stay conservative. Higher values explore more and break down sooner.
MLP versions (train_minimal.py, train_pure.py):
one-hot(noisy tokens) + timestep -> Linear -> ReLU -> Linear -> logits
Transformer version (train.py):
Token Embed + Pos Embed + Time Embed -> Transformer x 4 -> logits
The diffusion loop is swappable. The denoiser could be an MLP, transformer, CNN, or something else.
| Here | Production (MDLM, SEDD, etc.) | |
|---|---|---|
| Data | 32K names | Billions of tokens |
| Vocab | 28 (a-z + pad + mask) | 32K-100K BPE |
| Model | Small MLP / 4-layer Transformer | 12-48 layer Transformer |
| Params | 170K-239K | 100M-10B |
| Training | Minutes, CPU | Days or weeks, GPU cluster |
Same core loop: mask -> denoise -> unmask.
Autoregressive models dominate, but diffusion can:
- Generate all tokens in parallel
- Edit any part of text, not just append
- Control what goes where more easily
- Generate in an order other than left to right
Quality still trails strong autoregressive models, but the gap has narrowed.
- Character-level names only, not wordpiece or BPE tokenization
- No validation split, held-out metrics, or benchmark reporting
- No checkpointing or resume support
- No packaged module layout yet; the project is still script-first
- README examples are educational, not guarantees of sample quality
This is deliberate. The repo optimizes for readability and hackability over completeness.
- D3PM - Austin et al., 2021
- Diffusion-LM - Li et al., 2022
- MDLM - Sahoo et al., 2024
- SimpleDM - Shi et al., 2024
- MicroGPT - Karpathy, inspiration for this project
MIT