Data Selection Framework

This is a suite based on torchtune that aims to fairly compare a wide range of data selection methods, providing a overview of the field and introducing resource metrics via radT. We evaluate data selection methods on a range of tuning tasks.

Getting Started

# Create the 'selection' conda environment
conda env create -f conda.yaml
conda activate selection



# If you want to install additional dependencies add dependencies in conda.yaml and run:
conda env update --file conda.yaml



# Now set the HF_TOKEN environment variable in your conda environment
conda env config vars set HF_TOKEN=<enter token here>

Follow the instructions on the official meta-llama repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.

torchtune supports the following models:

Model	Sizes
Llama3.3	70B [models, configs]
Llama3.2-Vision	11B, 90B [models, configs]
Llama3.2	1B, 3B [models, configs]
Llama3.1	8B, 70B, 405B [models, configs]
Llama3	8B, 70B [models, configs]
Llama2	7B, 13B, 70B [models, configs]
Code-Llama2	7B, 13B, 70B [models, configs]
Mistral	7B [models, configs]
Gemma	2B, 7B [models, configs]
Gemma2	2B, 9B, 27B [models, configs]
Microsoft Phi3	Mini [models, configs]
Qwen2	0.5B, 1.5B, 7B [models, configs]
Qwen2.5	0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs]

We recommend getting started with the small Llama3.2 models.

Downloading the model

# Insert huggingface model company and model name from huggingface model page.
model_company="meta-llama"
model_name="Llama-3.2-1B-Instruct" 

tune download $model_company/$model_name --ignore-patterns "original/consolidated.00.pth" --output-dir ./model_cache/downloaded_models/$model_name

Creating recipes and configs

To list all available torchtune recipes & configs

tune ls

Creating a recipe at the path

recipe="full_finetune_single_device"
recipe_path="./recipe/full_finetune"
tune cp $recipe $recipe_path --make-parents

Creating a config at the path.

By default configs will utilize linux 'tmp' folder. This will result in downloaded and finetuned models being deleted after each session.

# The current local model_cache pathsystem needs to be integrated as part of the config download pipeline. 
# TO BE IMPLEMENTED
model_config="llama3_2/1B_full_single_device"
config_path="./config/llama3_2/1b_full/train.yaml"
tune cp $model_config $config_path --make-parents

Running finetuning recipes

You can finetune Llama3.2 1B on a single GPU using the following command (with/without radT):

python tune.py run recipe/full_finetune.py --config config/llama3_2/1b_full/train.yaml
python -m radt --local --manual tune.py run recipe/full_finetune.py --config config/llama3_2/1b_full/train.yaml

Or with LoRA:

python tune.py run recipe/lora_finetune.py --config config/llama3_2/1b_lora/train.yaml
python -m radt --local --manual tune.py run recipe/lora_finetune.py --config config/llama3_2/1b_lora/train.yaml

Saving with RadT with a specific MLflow experiment ID

# set experiment_id to the MLflow experiment ID 
experiment_id=" " 
# full model 
python -m radt -e $experiment_id --local --manual tune.py run recipe/full_finetune.py --config config/llama3_2/1b_full/train.yaml

# With lora model
python -m radt -e $experiment_id --local --manual tune.py run recipe/lora_finetune.py --config config/llama3_2/1b_full/train.yaml

Evaluating Models

Training configs should be accompanied with evaluation configs. To evaluate the models trained above:

tune run recipe/eval.py --config config/llama3_2/1b_full/eval_base.yaml
tune run recipe/eval.py --config config/llama3_2/1b_full/eval_finetuned.yaml

Or with LoRA:

tune run recipe/eval.py --config config/llama3_2/1b_lora/eval_base.yaml
tune run recipe/eval.py --config config/llama3_2/1b_lora/eval_finetuned.yaml

Evaluation tasks

A full list of evaluation tasks can be found here: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md Additionally, a full list of datasets to train on can be found here: https://pytorch.org/torchtune/0.2/api_ref_datasets.html#datasets.

Further torchtune examples: https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/llama3.rst

Infering Models

Begin by creating a custom generation config, either by running the following commands or creating your own:

tune cp generation ./custom_generation_config.yaml

Infering the model by: changeing the "user" field value within the config and running the following command:

tune run generate --config ./custom_generation_config.yam

Infering the model by: using torch tune cli run the following command:

tune run generate --config ./custom_generation_config.yaml prompt.user=<Your Prompt Here>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Selection Framework

Getting Started

Downloading the model

Creating recipes and configs

Creating a recipe at the path

Creating a config at the path.

Running finetuning recipes

Evaluating Models

Evaluation tasks

Infering Models

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
config/llama3_2		config/llama3_2
misc		misc
recipe		recipe
selection		selection
.gitignore		.gitignore
README.md		README.md
conda.yaml		conda.yaml
custom_generation_config.yaml		custom_generation_config.yaml
tune.py		tune.py

itu-rad/data-selection-framework

Folders and files

Latest commit

History

Repository files navigation

Data Selection Framework

Getting Started

Downloading the model

Creating recipes and configs

Creating a recipe at the path

Creating a config at the path.

Running finetuning recipes

Evaluating Models

Evaluation tasks

Infering Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages