Skip to content

galilai-group/LLM-research

Repository files navigation

LLM-reconstruction-free

Installation

Install all the usual huggingface, Pytorch libraries. For ease, we provide the conda environment that was used for development into the file environmnent.yml (from running conda env export --no-builds | grep -v "prefix" > environment.yml) which you can then use to install a new environment with

conda env create -n ENVNAME --file environment.yml

Before running any script, make sure to set the following environment variables, e.g., by adding them to your .bashrc:

export OMP_NUM_THREADS=10
export TOKENIZERS_PARALLELISM=false

The first is system depending and is used by torchrun when runnning the script, adjust it based on your needs. The second is needed as we use tokenizers which doesn't play well with other multiprocessing going in when training

Run the scripts

We leverage torchrun as in the following example:

torchrun --nproc-per-node 8 supervised_finetuning.py --dataset rotten_tomatoes --lora-rank 8 --training-steps 500 --per-device-batch-size 4

Furthermore, we leverage hydra (https://hydra.cc/) to schedule runs through submitit_slurm. The following is an example of what a call could look like which takes advantage of the ./run_model.sh script:

./run_model.sh ++params.training_steps=500 ++params.per_device_batch_size=4 ++params.use_spurious=True,False ++params.backbone=Snowflake/snowflake-arctic-embed-xs ++params.freeze=0 ++params.pretrained=0  ++params.dataset=common_sense ++params.spurious_proportion=0.75 ++params.spurious_token_proportion=0.1 ++params.spurious_location=random ++params.spurious_label=1 ++params.spurious_test_label=1 ++params.spurious_test_proportion=0.75 ++params.spurious_test_token_proportion=0.1 ++params.spurious_test_location=random ++hydra.launcher.partition=cs-all-gcondo

spurious_corr

This module provides a framework for injecting controlled spurious correlations into text data for evaluating the robustness of language models under distribution shifts.

Components

  • generators.py: Utilities to generate synthetic samples with or without spurious features.
  • modifiers.py: Injection logic for spurious tokens (e.g., HTML tags, keywords) at different locations or proportions.
  • transform.py: Pipeline to apply transformations and manage injection configurations.
  • utils.py: Helper functions for seed control, sampling, and tokenization.
  • sample_execution.py: Example script for running the injection pipeline on a dataset.

Usage

Adjust injection types, locations, and proportions via command-line or script-level configuration.

python sample_execution.py

See supervised_finetuning.py for an example of how to use this in combination with LoRA finetuning.

Datasets

Here is a list of the datasets currently supported. They are ready to be used out of the box. To learn more reference data.py.

Models

We currently support many different models. This is a list of the models that are currently supported out of the box. To learn more reference the init.py file for the llm_research directory.

  • apple/OpenELM-270M
  • apple/OpenELM-450M
  • apple/OpenELM-1_1B
  • apple/OpenELM-3B
  • meta-llama/Meta-Llama-3-8B
  • microsoft/phi-2
  • Snowflake/snowflake-arctic-embed-xs
  • Snowflake/snowflake-arctic-embed-s
  • Snowflake/snowflake-arctic-embed-m
  • Snowflake/snowflake-arctic-embed-l
  • Qwen/Qwen2-0.5B
  • Qwen/Qwen2-1.5B
  • Qwen/Qwen2-7B
  • mistralai/Mistral-7B-v0.1
  • mistralai/Mistral-7B-v0.3
  • google/gemma-2b
  • google/gemma-7b

For OpenELM we have to run the following to get the submodules (no need for the user to do that again, you should just clone the repo with git clone --recurse-submodules)

export GIT_LFS_SKIP_SMUDGE=1 

git submodule add https://huggingface.co/apple/OpenELM-270M
git mv OpenELM-270M OpenELM_270M

git submodule add https://huggingface.co/apple/OpenELM-450M
git mv OpenELM-450M OpenELM_450M

git submodule add https://huggingface.co/apple/OpenELM-1_1B
git mv OpenELM-1_1B OpenELM_1_1B

git submodule add https://huggingface.co/apple/OpenELM-3B
git mv OpenELM-3B OpenELM_3B




Default:

attention_dropout
num_hidden_layers
num_attention_heads
vocab_size
hidden_size
max_position_embeddings


Qwen:
defaults

OpenELM:

max_position_embeddings -> max_context_length
num_hidden_layers -> num_transformer_layers + update the num_kv_heads and num_query_heads and ffn_multipliers
hidden_size -> model_dim
attention_dropout -> None
num_attention_heads -> num_kv_heads and num_query_heads (lists)
vocab_size -> vocab_size

arctic

attention_dropout -> attention_probs_dropout_prob + hidden_dropout_prob
hidden_size -> hidden_size + intermediate_size
max_position_embeddings -> max_position_embeddings
num_hidden_layers -> num_hidden_layers
num_attention_heads -> num_attention_heads
vocab_size -> vocab_size


Llama3

attention_dropout -> attention_dropout
hidden_size -> hidden_size
intermediate_size -> intermediate_size
max_position_embeddings -> max_position_embeddings
num_attention_heads -> num_attention_heads
num_hidden_layers -> num_hidden_layers
num_key_value_heads -> num_key_value_heads
vocab_size -> vocab_size

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •