Useful code for training and inference of Language Models. I currently support the following functionality:
Language Models:
- Inference with HuggingFace Transformers and vLLM (no vLLM support for VLMs at the moment)
- Pretraining
- Finetuning (Classification and Supervised Finetuning for Generation)
- Preference Optimization (Direct Preference Optimization, Contrastive Preference Optimization)
- Unlearning (Gradient Ascent, Negative Preference Optimization)
All code is based on HuggingFace Transformers and TRL and supports FSDP with multiple GPUs.
This branch is being actively worked on and may have breaking changes pushed to it at any time. If you want to use a stable version of the code base without running the examples/tests or actively adding features, see the app branch instead.
First, clone the repo, then follow the instructions to set up the environment with the right packages and Python version. Before running anything, you should make sure to populate the essential fields in the config files. After that, run:
python configs/create_env_file.pyThat's all the setup you need for inference, but for training, you will need to set up a couple of additional things.
Log in to WandB with
wandb loginThat's it! You can now run the code. Test that the code base works fine by running:
bash tests/test_all.shIf this fails, then isolate the problem by following the test instructions
If you want to use FSDP or Accelerate distribution, then set up the accelerate config file. In general, I recommend you do not do this unless you know what you're doing / you have a good reason to try it. Accelerate FSDP etc. is a bit buggy and can considerably slow down model training / saving for what seems to be little gain.
accelerate configCommon Setup:
- This Machine
- multi-GPU
- 1 node
- No checking distributed ops
- No torch Dynamo
- Enter number of available GPUs when asked
- mixed precision bf16
Basic Setup:
- No DeepSpeed, FSDP, Megatron
- yes numa efficiency
FSDP:
- No DeepSpeed
- Yes FSDP
- FSDP version 2
- Choose defaults for
enable resharding(yes),offload(no) - Transformer Based Wrap => yes to use the model's _no_split_modules
- SHARDED_STATE_DICT state dict type
- Yes to CPU RAM efficient model loading
- No to activation checkpointing
- No to parallelism config
The FSDP configuration gives me the accelerate config (at <path_to_huggingface>/accelerate/default_config.yaml) yaml:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_reshard_after_forward: true
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: falseIf you want to use accelerate to launch the training script with FSDP etc, instead of running train.py with python train.py --args, you must use accelerate launch train.py --args
I have a set of examples that show how to use the code for different tasks. The examples cover most functionality of the code.
The entry point for inference is the infer.py script. It supports both HuggingFace Transformers and vLLM inference pipelines for both Language Models and Vision Language Models. The call to inference has three components:
- Core arguments: These are found in the click declaration of the function
mainand should be passed in right after the filename withpython infer.py --modality vlmetc - Framework selection: This is done with the
hforvllmcommand, which selects the HuggingFace Transformers or vLLM inference pipeline respectively. e.g.python infer.py --model_name <name> hf - Framework specific arguments: These are passed in after the
hforvllmcommand. For example,python infer.py --model_name <name> hf --batch_size 8will run inference with the HuggingFace Transformers pipeline. See the huggingface and vllm inference files for the arguments that can be passed in after thehforvllmcommand.
The scripts will expect your input to be a csv file with a column named input that contains the text to be processed (and image with a url or path to an image for VLMs). It also expects that the input file does not contain the columns output or inference_completed. The output will be saved in the same directory as the input file, with a suffix _output added to the filename, and as a json lines file (.jsonl). The names of the columns, as well as the output path can be changed with the appropriate arguments.
This output file also automatically acts as a checkpoint if inference stops halfway, and unless you tell it not to, the code will always try to restart from a checkpoint.
The entry point for training is the train.py script. The final model is always saved to output_dir/final_checkpoint
WandDB is used to log the metrics, and you can always recover the history of a prior run with:
from utils import get_history
history = get_history("run_name")There are two sets of arguments this script accepts:
- ScriptArguments: Check these out in the
ScriptArgumentsclass. - Learning specific arguments: These depend on the kind of training you are doing, and all of them are taken from HuggingFace or HuggingFace TRL. Classification takes in the same arguments as TrainingArguments, Supervised Finetuning takes in the same arguments as SFTConfig and Direct Preference Optimization takes in the same arguments as DPOConfig.
To see the parameters that can be used on the command line see the respective config files. All arguments that are used internally in a Trainer Class are passed on to the Trainer class. So, for example, if you want to set the number of epochs for training (for classification finetuning), you must add --num_train_epochs <number> to the set of args passed in. Essentially, go through the config file argument options and know that you can ignore any argument which has the following disclaimer: This argument is not directly used by Trainer, it’s intended to be used by your training/evaluation scripts instead (e.g. --do_eval). The only exception to this rule is --resume_from_checkpoint which takes in <True/False/path-to-checkpoint> and is used in the script.
By default, we do try to resume from a checkpoint. If the output directory is not found, we will begin training from the first step. If there is an output directory with no valid checkpoint, the code will fail unless --resume_from_checkpoint is False. When using LoRA + FSDP etc, the checkpoint files are not complete models, but rather sharded adapters, and cannot be read and treated as normal HuggingFace models. In order to use a saved checkpoint, you must relaunch the script, but set the --num_train_epochs or --max_steps value to be lower than the checkpoint. This way, the script will load the model and immediately save it. There's an example of this being done here.
The essential format to follow for each training paradigm is given below:
- Classification: input files must be
.csvwithinputandoutputcolumns - Pretraining: input files can either be
.txtor.csv, csv must have columninputwith the text to learn. You can also include anoutputcolumn, in which case we will concatenate the two columns and pretrain on the whole thing. If instead, you want to only pretrain on the input column, pass in--pretrain_with_output False - Supervised Finetuning: input files must be a csv with
inputandoutputcolumns. Loss is only computed on completions/output, if you want loss to be computed on the input prompt as well, this is handled by the pretraining paradigm. - Direct Preference Optimization: input files must be a csv with
input,chosenandrejectedcolumns. - Gradient Ascent: input files must be a csv with
input,outputandforgetcolumns, whereforgetis a binary indicator as to whether or not that particular example should be forgotten. Settingforget=0for all rows is equivalent to running SFT - Negative Preference Optimization: input files must be a csv with
inputandoutputcolumns.