GitHub - CSIPlab/Cross-Modal-Proxy-Tokens: Cross-Modal Proxy Tokens (CMPTs) approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. It offers a flexible and efficient solution for robust multimodal learning.

Robust Multimodal Learning via Cross-Modal Proxy Tokens

Introduction

Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces Cross-Modal Proxy Tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning.

For more details, please check our arXiv paper.

Updates

10/2025: Init repository.
10/2025: Release the code for CMPTs.
10/2025: Accepted by Transactions on Machine Learning Research (TMLR).

Model Architecture

Figure: (a) We introduce Cross-Modal Proxy Tokens (CMPTs), a novel approach to address missing modality challenges. CMPTs effectively learn to approximate missing modality class tokens by adapting pretrained encoders through a joint optimization of alignment and task-specific objectives. Our approach accommodates both complete and missing modalities during training and inference, thereby enhancing robustness across varying missing modality scenarios. (b) CMPTs achieve state-of-the-art performance, consistently outperforming recent baseline methods in both complete and missing modality scenarios. The radar plot illustrates F1-macro scores on the MM-IMDb dataset across varying modality availability.

Environment Setup

For Image-Text Datasets (UPMC Food-101 and MM-IMDb)

Create a conda environment by running the following command:

conda create -n cmpt-image-text python=3.8.19
conda activate cmpt-image-text
pip install -r image_text_requirements.txt

For Audio-Video Datasets (Kinetics-Sound, AVE and CREMA-D)

Create a conda environment by running the following command:

conda create -n cmpt-audio-video python=3.8.19
conda activate cmpt-audio-video
pip install -r audio_video_requirements.txt

Configuration File

All the configurations for all the datasets are in the config.py file. Please set the corresponding paths and hyperparameters properly before training or testing on any dataset.

Data Preprocessing

Please follow the steps below to pre-process the datasets.

UPMC Food-101 Dataset

Download the UPMC Food-101 dataset.
Then run the following command:

python ./utils/data_preprocessing/make_arrow.py --dataset food101 --root [YOUR_DATASET_ROOT]

MM-IMDb

Download the MM-IMDb dataset.
Then run the following command:

python ./utils/data_preprocessing/make_arrow.py --dataset mmimdb --root [YOUR_DATASET_ROOT]

Kinetics-Sound

Download Kinetics-400 dataset.
Set correct paths in the following files

./utils/data_preprocessing/kinetics_convert_avi.py

./utils/data_preprocessing/kinetics_arrange_by_class.py

./utils/data_preprocessing/extract_wav_and_frames.py

Run the following commands:

python ./utils/data_preprocessing/kinetics_convert_avi.py

python ./utils/data_preprocessing/kinetics_arrange_by_class.py

python ./utils/data_preprocessing/extract_wav_and_frames.py

AVE

Download the AVE dataset.
Set the paths in ./utils/data_preprocessing/pre_process_ave.py file.
Run the following command:

python ./utils/data_preprocessing/pre_process_ave.py

CREMA-D

Download the CREMA-D dataset.
Set the paths in ./utils/data_preprocessing/preprocess_creamad.py file.
Run the following command:

python ./utils/data_preprocessing/preprocess_creamad.py

Training Models

To train CMPT model on any dataset, first set all the paths and hyperparameters properly in the config.py file. Then run the following commands.

For UPMC Food-101 dataset

python -m scripts.train_image_text_model with task_finetune_food101

For MM-IMDb dataset

python -m scripts.train_image_text_model with task_finetune_mmimdb

For Kinetics-Sound dataset

python -m scripts.train_audio_video_model with task_finetune_kinetics_sound

For AVE dataset

python -m scripts.train_audio_video_model with task_finetune_ave

For CREMA-D dataset

python -m scripts.train_audio_video_model with task_finetune_cremad

Testing Models

To evaluate pretrained models on any dataset, set the model_path to the saved checkpoint in the config.py file. Then run the following commands.

For UPMC Food-101 dataset

python -m scripts.test_image_text_model with task_finetune_food101

For MM-IMDb dataset

python -m scripts.test_image_text_model with task_finetune_mmimdb

For Kinetics-Sound dataset

python -m scripts.test_audio_video_model with task_finetune_kinetics_sound

For AVE dataset

python -m scripts.test_audio_video_model with task_finetune_ave

For CREMA-D dataset

python -m scripts.test_audio_video_model with task_finetune_cremad

Citations

If you find the Cross-Modal Proxy Tokens (CMPTs) approach useful in your research, please consider citing the following work.

Cross-Modal Proxy Tokens (CMPTs) [arXiv][TMLR]

@article{reza2025robust,
    title={Robust Multimodal Learning via Cross-Modal Proxy Tokens},
    author={Md Kaykobad Reza and Ameya Patil and Mashhour Solh and Salman Asif},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2025},
    url={https://openreview.net/forum?id=Wtc6wvcYJ0},
    note={}
}

Acknowledgements

Our codebase is built upon the Missing Aware Prompts repository. We sincerely thank the authors for making their code publicly available.

Note: This is a research level repository and might contain issues/bugs. Please contact the authors for any query.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset		dataset
figs		figs
models		models
scripts		scripts
transforms		transforms
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
audio_video_requirements.txt		audio_video_requirements.txt
config.py		config.py
image_text_requirements.txt		image_text_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Robust Multimodal Learning via Cross-Modal Proxy Tokens

Introduction

Updates

Model Architecture

Environment Setup

For Image-Text Datasets (UPMC Food-101 and MM-IMDb)

For Audio-Video Datasets (Kinetics-Sound, AVE and CREMA-D)

Configuration File

Data Preprocessing

UPMC Food-101 Dataset

MM-IMDb

Kinetics-Sound

AVE

CREMA-D

Training Models

Testing Models

Citations

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

CSIPlab/Cross-Modal-Proxy-Tokens

Folders and files

Latest commit

History

Repository files navigation

Robust Multimodal Learning via Cross-Modal Proxy Tokens

Introduction

Updates

Model Architecture

Environment Setup

For Image-Text Datasets (UPMC Food-101 and MM-IMDb)

For Audio-Video Datasets (Kinetics-Sound, AVE and CREMA-D)

Configuration File

Data Preprocessing

UPMC Food-101 Dataset

MM-IMDb

Kinetics-Sound

AVE

CREMA-D

Training Models

Testing Models

Citations

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages