Skip to content

itu-rad/data-selection-framework

Repository files navigation

Data Selection Framework

This is a suite based on torchtune that aims to fairly compare a wide range of data selection methods, providing a overview of the field and introducing resource metrics via radT. We evaluate data selection methods on a range of tuning tasks.

 

Getting Started

# Create the 'selection' conda environment
conda env create -f conda.yaml
conda activate selection



# If you want to install additional dependencies add dependencies in conda.yaml and run:
conda env update --file conda.yaml



# Now set the HF_TOKEN environment variable in your conda environment
conda env config vars set HF_TOKEN=<enter token here>

Follow the instructions on the official meta-llama repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.

torchtune supports the following models:

Model Sizes
Llama3.3 70B [models, configs]
Llama3.2-Vision 11B, 90B [models, configs]
Llama3.2 1B, 3B [models, configs]
Llama3.1 8B, 70B, 405B [models, configs]
Llama3 8B, 70B [models, configs]
Llama2 7B, 13B, 70B [models, configs]
Code-Llama2 7B, 13B, 70B [models, configs]
Mistral 7B [models, configs]
Gemma 2B, 7B [models, configs]
Gemma2 2B, 9B, 27B [models, configs]
Microsoft Phi3 Mini [models, configs]
Qwen2 0.5B, 1.5B, 7B [models, configs]
Qwen2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs]

We recommend getting started with the small Llama3.2 models.

 

Downloading the model

# Insert huggingface model company and model name from huggingface model page.
model_company="meta-llama"
model_name="Llama-3.2-1B-Instruct" 

tune download $model_company/$model_name --ignore-patterns "original/consolidated.00.pth" --output-dir ./model_cache/downloaded_models/$model_name

 

Creating recipes and configs

To list all available torchtune recipes & configs

tune ls

Creating a recipe at the path

recipe="full_finetune_single_device"
recipe_path="./recipe/full_finetune"
tune cp $recipe $recipe_path --make-parents

Creating a config at the path.

By default configs will utilize linux 'tmp' folder. This will result in downloaded and finetuned models being deleted after each session.

# The current local model_cache pathsystem needs to be integrated as part of the config download pipeline. 
# TO BE IMPLEMENTED
model_config="llama3_2/1B_full_single_device"
config_path="./config/llama3_2/1b_full/train.yaml"
tune cp $model_config $config_path --make-parents

 

Running finetuning recipes

You can finetune Llama3.2 1B on a single GPU using the following command (with/without radT):

python tune.py run recipe/full_finetune.py --config config/llama3_2/1b_full/train.yaml
python -m radt --local --manual tune.py run recipe/full_finetune.py --config config/llama3_2/1b_full/train.yaml

Or with LoRA:

python tune.py run recipe/lora_finetune.py --config config/llama3_2/1b_lora/train.yaml
python -m radt --local --manual tune.py run recipe/lora_finetune.py --config config/llama3_2/1b_lora/train.yaml

Saving with RadT with a specific MLflow experiment ID

# set experiment_id to the MLflow experiment ID 
experiment_id=" " 
# full model 
python -m radt -e $experiment_id --local --manual tune.py run recipe/full_finetune.py --config config/llama3_2/1b_full/train.yaml
# With lora model
python -m radt -e $experiment_id --local --manual tune.py run recipe/lora_finetune.py --config config/llama3_2/1b_full/train.yaml

 

Evaluating Models

Training configs should be accompanied with evaluation configs. To evaluate the models trained above:

tune run recipe/eval.py --config config/llama3_2/1b_full/eval_base.yaml
tune run recipe/eval.py --config config/llama3_2/1b_full/eval_finetuned.yaml

Or with LoRA:

tune run recipe/eval.py --config config/llama3_2/1b_lora/eval_base.yaml
tune run recipe/eval.py --config config/llama3_2/1b_lora/eval_finetuned.yaml

 

Evaluation tasks

A full list of evaluation tasks can be found here: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md Additionally, a full list of datasets to train on can be found here: https://pytorch.org/torchtune/0.2/api_ref_datasets.html#datasets.

Further torchtune examples: https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/llama3.rst

 

Infering Models

Begin by creating a custom generation config, either by running the following commands or creating your own:

tune cp generation ./custom_generation_config.yaml 

Infering the model by: changeing the "user" field value within the config and running the following command:

tune run generate --config ./custom_generation_config.yam

Infering the model by: using torch tune cli run the following command:

tune run generate --config ./custom_generation_config.yaml prompt.user=<Your Prompt Here> 

About

Compare data selection methods on LLM's.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages