A quick implementation of diffusion language models using transformers.
Much of this work is adapted from the paper Large Language Diffusion Models by Nie et al. (2025). I've tried to keep the code clean and concise, so currently the training script has fewer than 80 lines of code.
I recommend using uv to install packages (you can also just use pip):
pip install uv
uv pip install torch transformers datasets accelerate tqdm rich
- Run
accelerate launch train.pyto finetune DistilBERT on the TinyStories dataset.- Change the training arguments as required by your compute constraints.
- I also uploaded the trained diffusion model to Hugging Face.
- Run
python demo.pyto use a trained model to generate short stories similar to those in the dataset. - See below for details on how the scripts work.
The model used is DistilBERT, which is pretrained for masked language modeling. It is an encoder-only transformer well-suited for our purposes. Otherwise, you can swap it for any other language model - even a "decoder-only" transformer like GPT - just make sure the attention mask is full of 1s instead of causal.
The training script is adapted from Algorthms 1 and 2 from the Nie et al. paper:
- A sequence
xis sampled from the training corpus; - A time step
tis sampled uniformly between 0 and 1; - Each token in
xis masked with probabilityt; - The model is trained to predict the masked tokens via maximum likelihood.
Importantly, the padding tokens can also be attended to, masked and modeled.
The demo.py file is based on Algorithm 4:
- We start with a fully masked sequence
x; - For
tgoing from 1 to 0 linearly inTsteps:- Predict the masked tokens;
- Remask each of the predicted tokens with probability
s/t, wheresis the next value oft.
We have a fully unmasked sequence at the end.
Note that Nie et al. also describe a "lowest confidence" sampling process, but it is deterministic and unsuitable for unconditional generation. For more details, I recommend reading the paper and its references on language diffusion, such as Simple and Effective Masked Diffusion Language Models by Sahoo et al. (2024).
- Diffusion language models strongly remind me of the novella "Story of Your Life" by Ted Chiang, which the movie Arrival is based on.
- Of course, these models cannot tell the future, but they do process and communicate language non-sequentially. However, the language itself that they are trained to produce is sequential in nature, unlike in Chiang's story. Perhaps there is a way to train them on non-sequential representations of human language - if there are any good systems for that?
- For more comprehension, one might build the language model architecture from the ground up in PyTorch without
transformers, but this was not the point of this project.- Still, I might do something like that in future.
extra/train_alternative.pyis supposed to handle single-GPU training withoutaccelerate, but I haven't tested it yet. Dependencies can be further removed and this might grow into something resembling a package.
- Still, I might do something like that in future.
- Increasing the size of the model and pretraining data and training on an instruction dataset via conditional maximum likelihood should achieve similar results to Nie et al. Interestingly, there should also be ways to align language diffusion models via RLHF since it has been done in the image domain.
I welcome any contributions to this repository. As mentioned above, I might want to relax the reliance on dependencies and/or think of instruction tuning and alignment.
