Skip to content

Cross-Modal Proxy Tokens (CMPTs) approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. It offers a flexible and efficient solution for robust multimodal learning.

Notifications You must be signed in to change notification settings

CSIPlab/Cross-Modal-Proxy-Tokens

Repository files navigation

Robust Multimodal Learning via Cross-Modal Proxy Tokens

Introduction

Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces Cross-Modal Proxy Tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning.

For more details, please check our arXiv paper.

Updates

Model Architecture

Cross-Modal Proxy Tokens Figure: (a) We introduce Cross-Modal Proxy Tokens (CMPTs), a novel approach to address missing modality challenges. CMPTs effectively learn to approximate missing modality class tokens by adapting pretrained encoders through a joint optimization of alignment and task-specific objectives. Our approach accommodates both complete and missing modalities during training and inference, thereby enhancing robustness across varying missing modality scenarios. (b) CMPTs achieve state-of-the-art performance, consistently outperforming recent baseline methods in both complete and missing modality scenarios. The radar plot illustrates F1-macro scores on the MM-IMDb dataset across varying modality availability.

Environment Setup

For Image-Text Datasets (UPMC Food-101 and MM-IMDb)

Create a conda environment by running the following command:

conda create -n cmpt-image-text python=3.8.19
conda activate cmpt-image-text
pip install -r image_text_requirements.txt

For Audio-Video Datasets (Kinetics-Sound, AVE and CREMA-D)

Create a conda environment by running the following command:

conda create -n cmpt-audio-video python=3.8.19
conda activate cmpt-audio-video
pip install -r audio_video_requirements.txt

Configuration File

All the configurations for all the datasets are in the config.py file. Please set the corresponding paths and hyperparameters properly before training or testing on any dataset.

Data Preprocessing

Please follow the steps below to pre-process the datasets.

UPMC Food-101 Dataset

  • Download the UPMC Food-101 dataset.
  • Then run the following command:
python ./utils/data_preprocessing/make_arrow.py --dataset food101 --root [YOUR_DATASET_ROOT]

MM-IMDb

  • Download the MM-IMDb dataset.
  • Then run the following command:
python ./utils/data_preprocessing/make_arrow.py --dataset mmimdb --root [YOUR_DATASET_ROOT]

Kinetics-Sound

  • Download Kinetics-400 dataset.
  • Set correct paths in the following files
./utils/data_preprocessing/kinetics_convert_avi.py

./utils/data_preprocessing/kinetics_arrange_by_class.py

./utils/data_preprocessing/extract_wav_and_frames.py
  • Run the following commands:
python ./utils/data_preprocessing/kinetics_convert_avi.py

python ./utils/data_preprocessing/kinetics_arrange_by_class.py

python ./utils/data_preprocessing/extract_wav_and_frames.py

AVE

  • Download the AVE dataset.
  • Set the paths in ./utils/data_preprocessing/pre_process_ave.py file.
  • Run the following command:
python ./utils/data_preprocessing/pre_process_ave.py

CREMA-D

  • Download the CREMA-D dataset.
  • Set the paths in ./utils/data_preprocessing/preprocess_creamad.py file.
  • Run the following command:
python ./utils/data_preprocessing/preprocess_creamad.py

Training Models

To train CMPT model on any dataset, first set all the paths and hyperparameters properly in the config.py file. Then run the following commands.

For UPMC Food-101 dataset

python -m scripts.train_image_text_model with task_finetune_food101

For MM-IMDb dataset

python -m scripts.train_image_text_model with task_finetune_mmimdb

For Kinetics-Sound dataset

python -m scripts.train_audio_video_model with task_finetune_kinetics_sound

For AVE dataset

python -m scripts.train_audio_video_model with task_finetune_ave

For CREMA-D dataset

python -m scripts.train_audio_video_model with task_finetune_cremad

Testing Models

To evaluate pretrained models on any dataset, set the model_path to the saved checkpoint in the config.py file. Then run the following commands.

For UPMC Food-101 dataset

python -m scripts.test_image_text_model with task_finetune_food101

For MM-IMDb dataset

python -m scripts.test_image_text_model with task_finetune_mmimdb

For Kinetics-Sound dataset

python -m scripts.test_audio_video_model with task_finetune_kinetics_sound

For AVE dataset

python -m scripts.test_audio_video_model with task_finetune_ave

For CREMA-D dataset

python -m scripts.test_audio_video_model with task_finetune_cremad

Citations

If you find the Cross-Modal Proxy Tokens (CMPTs) approach useful in your research, please consider citing the following work.

  • Cross-Modal Proxy Tokens (CMPTs) [arXiv][TMLR]
@article{reza2025robust,
    title={Robust Multimodal Learning via Cross-Modal Proxy Tokens},
    author={Md Kaykobad Reza and Ameya Patil and Mashhour Solh and Salman Asif},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2025},
    url={https://openreview.net/forum?id=Wtc6wvcYJ0},
    note={}
}

Acknowledgements

Our codebase is built upon the Missing Aware Prompts repository. We sincerely thank the authors for making their code publicly available.

Note: This is a research level repository and might contain issues/bugs. Please contact the authors for any query.

About

Cross-Modal Proxy Tokens (CMPTs) approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. It offers a flexible and efficient solution for robust multimodal learning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages