Easy instruction tuning and preference tuning of LLMs using the TRL library. Instruction tuning a.k.a. supervised fine-tuning (SFT) and preference tuning via direct preference optimization (DPO) are two of the most common procedures in the modern LLM post-training pipeline. They help turn a pre-trained language model that only does text completion into a useful and human-aligned dialogue system.
I recommend using uv to install packages:
pip install uv
uv pip install numpy torch transformers datasets trl wandbOtherwise, simply use pip.
Before running SFT/DPO, you need to specify the model, dataset, sample size (optional) and training arguments in a .yaml file. See configs for examples of such files (or use them).
After that, run the following commands (change sft to dpo if needed):
huggingface-cli login
wandb login
# On a single gpu:
python src/sft.py path/to/yaml/file
# On multiple GPUs:
accelerate launch src/sft.py path/to/yaml/file # Add kwargs as neededFor a list of SFT and DPO training arguments, refer here and here. For help with the accelerate command, refer here.
The current scripts push the post-trained models to the Hugging Face Hub. The provided .yaml files use GPT-2 by OpenAI and datasets from OLMo 2 32B by AI2. You can find the models trained with these config files here.
Reinforcement learning with verifiable rewards (RLVR) is a recent method for teaching language models how to solve verifiable problems, such as math and programming problems, via chain of thought (CoT) reasoning (think DeepSeek R1). This codebase can be easily extended to RLVR using TRL's GRPOTrainer (or PPOTrainer). While running RLVR on the GPT-2 suite seems questionable, more recent models like Qwen 3 should be suitable.
