Skip to content

EIT-NLP/Layer_Select_Fuse_for_MLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

This repository contains the code for the paper "Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices". All the code is adapted from the official LLaVA repository. Follow the instructions below to set up the environment, prepare datasets, and reproduce the results.

fusion_paradigms


Installation

To set up the environment, run the following commands:

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Preparation

Training Data

For our experiments, we primarily use the LLaVA-1.5 training dataset, which can be prepared following the official guidelines.

Additionally, for the fine-tuning stage, we conducted experiments with the 737k subset from Cambrian-1, part of the Cambrian-10M dataset. This dataset is available on Hugging Face.

Models Used

The following models, all available on Hugging Face, are used in our experiments:

CLIP: OpenAI's CLIP models, such as openai/clip-vit-large-patch14-336. SigLIP: Enhanced visual encoders, such as google/siglip-so400m-patch14-384. MobileLLaMA: Lightweight LLMs, including:

  • mtgv/MobileLLaMA-1.4B-Base
  • mtgv/MobileLLaMA-2.7B-Base

Ensure the datasets and model checkpoints are prepared according to their respective guidelines.

Training

Our training approach consists of two stages: pretraining and fine-tuning. The training process is configured via the following shell script:

Script

The training script is located at:

scripts/MLVF/train.sh

Variables

  • FUSING_STRATEGY: Defines the fusion strategy. Options include:
    • E_D: External Direct Fusion
    • E_M: External Modular Fusion
    • I_D: Internal Direct Fusion
    • I_M: Internal Modular Fusion
  • USING_STRATEGY: Specifies the layer selection strategy. Options include:
    • 18: Selects layer 18.
    • 3-18: Selects layers 3 and 18.
    • 3-18-23: Selects layers 3, 18, and 23.
    • former: Uses the first 12 layers.
    • latter: Uses the last 12 layers.
    • all: Uses all layers.
  • MODEL_NAME: A specific model identifier in the format {Visual Encoder}_{LLM size}_{Dataset size}. For example:
    • siglip_14_665k: SigLIP visual encoder, MobileLLaMA 1.4B, fine-tuning dataset size of 665k.

Example

Below is an example for training:

# Define common variables
FUSING_STRATEGY="I_D"
USING_STRATEGY="3-18-23"
BASE_MODEL_NAME="llava"
MODEL_NAME="siglip_14_665k"

# Define paths
PRETRAIN_DATA_PATH="./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json"
PRETRAIN_IMAGE_FOLDER="./playground/data/LLaVA-Pretrain/images"
MODEL_PATH="mtgv/MobileLLaMA-1.4B-Base"
VISION_TOWER="google/siglip-so400m-patch14-384"
FINETUNE_DATA_PATH="./playground/data/llava_v1_5_mix665k.json"
FINETUNE_IMAGE_FOLDER="./playground/data"

Run the pretraining and fine-tuning commands included in the train.sh script.

Evaluation

Evaluation Framework: The script uses the lmms-eval framework to evaluate models.

Script

The evaluation script is located at:

scripts/MLVF/eval_toolkits/eval.sh

Define your model paths in the models array. Currently, it is set up for a single model:

models=(
     "/path/to/model"  # Replace with your model path
    )

Contact

If you have any questions about this project, or would like to discuss related topics, feel free to reach out via email:

About

[CVPR2025] Official implementation of the paper "Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices". (by Junyan Lin)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors