Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces Cross-Modal Proxy Tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning.
For more details, please check our arXiv paper.
- 10/2025: Init repository.
- 10/2025: Release the code for CMPTs.
- 10/2025: Accepted by Transactions on Machine Learning Research (TMLR).
Figure: (a) We introduce Cross-Modal Proxy Tokens (CMPTs), a novel approach to address missing modality challenges. CMPTs effectively learn to approximate missing modality class tokens by adapting pretrained encoders through a joint optimization of alignment and task-specific objectives. Our approach accommodates both complete and missing modalities during training and inference, thereby enhancing robustness across varying missing modality scenarios. (b) CMPTs achieve state-of-the-art performance, consistently outperforming recent baseline methods in both complete and missing modality scenarios. The radar plot illustrates F1-macro scores on the MM-IMDb dataset across varying modality availability.
Create a conda environment by running the following command:
conda create -n cmpt-image-text python=3.8.19
conda activate cmpt-image-text
pip install -r image_text_requirements.txt
Create a conda environment by running the following command:
conda create -n cmpt-audio-video python=3.8.19
conda activate cmpt-audio-video
pip install -r audio_video_requirements.txt
All the configurations for all the datasets are in the config.py file. Please set the corresponding paths and hyperparameters properly before training or testing on any dataset.
Please follow the steps below to pre-process the datasets.
- Download the UPMC Food-101 dataset.
- Then run the following command:
python ./utils/data_preprocessing/make_arrow.py --dataset food101 --root [YOUR_DATASET_ROOT]
- Download the MM-IMDb dataset.
- Then run the following command:
python ./utils/data_preprocessing/make_arrow.py --dataset mmimdb --root [YOUR_DATASET_ROOT]
- Download Kinetics-400 dataset.
- Set correct paths in the following files
./utils/data_preprocessing/kinetics_convert_avi.py
./utils/data_preprocessing/kinetics_arrange_by_class.py
./utils/data_preprocessing/extract_wav_and_frames.py
- Run the following commands:
python ./utils/data_preprocessing/kinetics_convert_avi.py
python ./utils/data_preprocessing/kinetics_arrange_by_class.py
python ./utils/data_preprocessing/extract_wav_and_frames.py
- Download the AVE dataset.
- Set the paths in
./utils/data_preprocessing/pre_process_ave.pyfile. - Run the following command:
python ./utils/data_preprocessing/pre_process_ave.py
- Download the CREMA-D dataset.
- Set the paths in
./utils/data_preprocessing/preprocess_creamad.pyfile. - Run the following command:
python ./utils/data_preprocessing/preprocess_creamad.py
To train CMPT model on any dataset, first set all the paths and hyperparameters properly in the config.py file. Then run the following commands.
For UPMC Food-101 dataset
python -m scripts.train_image_text_model with task_finetune_food101
For MM-IMDb dataset
python -m scripts.train_image_text_model with task_finetune_mmimdb
For Kinetics-Sound dataset
python -m scripts.train_audio_video_model with task_finetune_kinetics_sound
For AVE dataset
python -m scripts.train_audio_video_model with task_finetune_ave
For CREMA-D dataset
python -m scripts.train_audio_video_model with task_finetune_cremad
To evaluate pretrained models on any dataset, set the model_path to the saved checkpoint in the config.py file. Then run the following commands.
For UPMC Food-101 dataset
python -m scripts.test_image_text_model with task_finetune_food101
For MM-IMDb dataset
python -m scripts.test_image_text_model with task_finetune_mmimdb
For Kinetics-Sound dataset
python -m scripts.test_audio_video_model with task_finetune_kinetics_sound
For AVE dataset
python -m scripts.test_audio_video_model with task_finetune_ave
For CREMA-D dataset
python -m scripts.test_audio_video_model with task_finetune_cremad
If you find the Cross-Modal Proxy Tokens (CMPTs) approach useful in your research, please consider citing the following work.
@article{reza2025robust,
title={Robust Multimodal Learning via Cross-Modal Proxy Tokens},
author={Md Kaykobad Reza and Ameya Patil and Mashhour Solh and Salman Asif},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=Wtc6wvcYJ0},
note={}
}
Our codebase is built upon the Missing Aware Prompts repository. We sincerely thank the authors for making their code publicly available.
Note: This is a research level repository and might contain issues/bugs. Please contact the authors for any query.