Post-Training LLMs with TRL

Easy instruction tuning and preference tuning of LLMs using the TRL library. Instruction tuning a.k.a. supervised fine-tuning (SFT) and preference tuning via direct preference optimization (DPO) are two of the most common procedures in the modern LLM post-training pipeline. They help turn a pre-trained language model that only does text completion into a useful and human-aligned dialogue system.

Install

I recommend using uv to install packages:

pip install uv
uv pip install numpy torch transformers datasets trl wandb

Otherwise, simply use pip.

Run

Before running SFT/DPO, you need to specify the model, dataset, sample size (optional) and training arguments in a .yaml file. See configs for examples of such files (or use them).

After that, run the following commands (change sft to dpo if needed):

huggingface-cli login
wandb login

# On a single gpu:
python src/sft.py path/to/yaml/file

# On multiple GPUs:
accelerate launch src/sft.py path/to/yaml/file # Add kwargs as needed

For a list of SFT and DPO training arguments, refer here and here. For help with the accelerate command, refer here.

Results

The current scripts push the post-trained models to the Hugging Face Hub. The provided .yaml files use GPT-2 by OpenAI and datasets from OLMo 2 32B by AI2. You can find the models trained with these config files here.

Further work

Reinforcement learning with verifiable rewards (RLVR) is a recent method for teaching language models how to solve verifiable problems, such as math and programming problems, via chain of thought (CoT) reasoning (think DeepSeek R1). This codebase can be easily extended to RLVR using TRL's GRPOTrainer (or PPOTrainer). While running RLVR on the GPT-2 suite seems questionable, more recent models like Qwen 3 should be suitable.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
configs		configs
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Post-Training LLMs with TRL

Install

Run

Results

Further work

About

Uh oh!

Releases

Packages

Languages

gumran/post-training

Folders and files

Latest commit

History

Repository files navigation

Post-Training LLMs with TRL

Install

Run

Results

Further work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages