Skip to content

gumran/post-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Post-Training LLMs with TRL

Easy instruction tuning and preference tuning of LLMs using the TRL library. Instruction tuning a.k.a. supervised fine-tuning (SFT) and preference tuning via direct preference optimization (DPO) are two of the most common procedures in the modern LLM post-training pipeline. They help turn a pre-trained language model that only does text completion into a useful and human-aligned dialogue system.

logs

Install

I recommend using uv to install packages:

pip install uv
uv pip install numpy torch transformers datasets trl wandb

Otherwise, simply use pip.

Run

Before running SFT/DPO, you need to specify the model, dataset, sample size (optional) and training arguments in a .yaml file. See configs for examples of such files (or use them).

After that, run the following commands (change sft to dpo if needed):

huggingface-cli login
wandb login

# On a single gpu:
python src/sft.py path/to/yaml/file

# On multiple GPUs:
accelerate launch src/sft.py path/to/yaml/file # Add kwargs as needed

For a list of SFT and DPO training arguments, refer here and here. For help with the accelerate command, refer here.

Results

The current scripts push the post-trained models to the Hugging Face Hub. The provided .yaml files use GPT-2 by OpenAI and datasets from OLMo 2 32B by AI2. You can find the models trained with these config files here.

Further work

Reinforcement learning with verifiable rewards (RLVR) is a recent method for teaching language models how to solve verifiable problems, such as math and programming problems, via chain of thought (CoT) reasoning (think DeepSeek R1). This codebase can be easily extended to RLVR using TRL's GRPOTrainer (or PPOTrainer). While running RLVR on the GPT-2 suite seems questionable, more recent models like Qwen 3 should be suitable.

About

Easy instruction tuning and preference tuning of LLMs using the TRL library.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages