We use CSR_training.py for training CSR and its variants.
The required parameters are:
--pretrained_emb: Path to the embedding file for CSR training. (Note that the path must point to anpzfile.)--embed_dim: Dimension of the embedding for training.--hidden_size: Dimension of hidden latent in CSR.--gpu: The GPU used for training.--base_model: Name of the backbone for generating embeddings.--model_suffix: Suffix added to the end of checkpoints for easier search.--epochs: Number of epochs for training.--lr: Learning rate for training.--topk: TopK selection in CSR training.--use_CL: Add contrastive loss to training (i.e. training CSR instead of SAE).--cl_coef: Cofficient of Contrastive Loss.--auxk: Number of auxk in Auxk Loss.--auxk_coef: Coefficient of Auxk Loss.
To add anneal to CSR, we need to define two parameters:
--initial_top: TopK at the start of training.--k_decay_ratio: A float between 0 and 1 to define when K converges to--topk.
To add supervised contrastive loss to training, we need to add parameter:
--use_label_CL
Note that this file is executed as a subprocess of all_step_pipeline_mteb_evaluation.py and all_step_pipeline_personalized_evaluation.py. Names of the parameters are the same.
We use train_mrl_with_sentence_transformers.py to train MRL on embedding models that do not support MRL.
The required parameters:
-
--model_name: Name of the embedding model to finetune. -
--dataset: Path to the dataset for traiing. -
--lora_alpha:$\alpha$ in Lora. -
--lora_r:$r$ in Lora. -
--lr: Learning rate in training. -
--batch_size: Batch size for finetuning. -
--loss: Name of loss for finetuning. -
--gradient_accumulation_steps: Number of step of gradient accumulation, whose goal is to achieve larger batch size in limited GPU memory. -
--dataset_suffix: Suffix added in saving path.
One example is:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 torchrun --nproc_per_node=7 --master_port=31233 train_mrl_with_sentence_transformers.py \
--model_name intfloat/e5-mistral-7b-instruct \
--dataset_suffix mrl-0815 \
--lora_alpha 16 \
--batch_size 32 \
--lr 2e-5 \
--gpu 0,1,2,3,4,5,6 \
--load_from_disk \
--loss multiple_negatives_ranking_loss \
--gradient_accumulation_steps 4 \
--dataset ./mrl_datasetWe use topk_lora_finetuning.py to finetune backbone for further performance improvement.
The required parameters are:
--dataset: Path to the dataset backbone is finetuned on.--model_name: Model to finetune. This model must be loaded with sentence transformers.--loss: Loss function used in training.multiple_negative_ranking_lossis used for all task-type finetuning except STS, whose loss iscosent.--gpu: Used GPU(s).--batch_size: Batch size.- `--gradient_accumulation_steps: Number of step of gradient accumulation, whose goal is to achieve larger batch size in limited GPU memory.
topk_k_list: List of TopK values.topk_weights: Weights of each TopK intopk_k_list.max_seq_length: Maximum sequence length as input.--lora_rand--lora_alpha: Settings of Lora.--lr: Learning rate for training.--topk_mode: Way to define "topk", i.e. whether take the absolute value of features before topk masking.--save_steps: Number of steps between each time the model weights are saved.
One example is:
CUDA_VISIBLE_DEVICES=0,1
for TASK in classification; do
torchrun --nproc_per_node=2 --master_port=31233 topk_lora_finetuning.py \
--dataset ./MTEB_single_task_type_finetuning_dataset/${TASK} \
--model_name intfloat/e5-mistral-7b-instruct \
--loss "multiple_negatives_ranking_loss" \
--dataset_suffix "${TASK}_intfloat" \
--gpu 0,1 \
--batch_size 4 \
--gradient_accumulation_steps 32 \
--topk_k_list "16,32,64,128,256,512,1024,2048,2560" \
--topk_weights "1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0" \
--max_seq_length 512 \
--lora_r 8 \
--lora_alpha 16 \
--lr 1e-5 \
--topk_mode "magnitude" \
--apply_topk_to_backbone \
--load_from_disk \
--save_steps 100
doneWe use main_visual.py for CSRv2/CSR training.
The required parameters are:
--pretrained_emb: Path to the embedding file for CSR training. (Note that the path must point to anpzfile.)--embed_dim: Dimension of the embedding for training.--hidden_size: Dimension of hidden latent in CSR.--gpu: The GPU used for training.--base_model: Name of the backbone for generating embeddings.--model_suffix: Suffix added to the end of checkpoints for easier search.--epochs: Number of epochs for training.--lr: Learning rate for training.--topk: TopK selection in CSR training.--use_CL: Add contrastive loss to training (i.e. training CSR instead of SAE).--cl_coef: Cofficient of Contrastive Loss.--auxk: Number of auxk in Auxk Loss.--auxk_coef: Coefficient of Auxk Loss.
To add anneal to CSR, we need to define two parameters:
--initial_top: TopK at the start of training.--k_decay_ratio: A float between 0 and 1 to define when K converges to--topk.
To add supervised contrastive loss to training, we need to add parameter:
--use_label_CL
For backbone finetuning, we need to add parameter:
--all_param_finetuning
One training example is as follows:
python main_visual.py \
--pretrained_emb ./CSR-precompute-embeds/FF2048_RN50_Embeds/1K_train_ff2048.npz \
--model_name resnet50d.ra4_e3600_r224_in1k \
--epochs 10 \
--initial_topk 64 \
--topk 32 \
--auxk 1024 \
--auxk_coef 0.03125 \
--cl_coef 0.1 \
--gpu 0 \
--model_suffix demo \
--use_label_CL