This repository contains the official implementation of our paper:
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, Zhe Zhao
arXiv:2505.19616
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals—particularly in tasks like Visual Question Answering (VQA)—which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem—the model’s inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks—such as image classification or pure text question answering—where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often lead to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem, and we further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to finetune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations via Projected Gradient Descent (PGD), and a consistency regularization strategy applying on model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy and VQA tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method’s effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.
For more details, please refer to our paper.
We provide separate environments for different model families:
For LLaVA-1.5 and InstructBLIP-vicuna
conda env create -n mllm_llava -f MLLM.yml
conda activate mllm_llavaFor Qwen2.5-VL
conda env create -n mllm_qwen -f vllm.yml
conda activate mllm_qwenAll training and evaluation data used in our experiments is publicly available at:
👉 HuggingFace: luisrui/training_data
To train a model with different settings (example: LLaVA-1.5-13B), use the following command:
deepspeed src/llava/llava_consistent.py \
--config configs/model_train/llava-v1.5-7b/args_full_KL_PGD.yamlYou may switch configs to enable: • PGD adversarial training • KL or JS consistency regularization • Image-heavy, Text-heavy, or Mixed dataset sampling ratio settings
More examples and configs are available in configs/model_train/.
Evaluation scripts are provided in analysis/ and support: • Unimodal perturbation analysis • VQA and classification accuracy
You can directly run follwing command for prediction on multiple datatsets:
bash zs_inference.sh
--model_name llava-1.5-7b
--checkpoint_path path_to_your_checkpoint
--batch_size batch_size
--tag your_experiment_setting
--allFor evaluation, you can run
bash evaluate.sh
--model_name llava-1.5-7b
--tag same_tag_as_your_experiment_setting
--allWe also provide pretrained checkpoints and a unified evaluation interface at 👉 HuggingFace: luisrui/
If you find this repository helpful in your research, please cite our paper:\
@article{cai2025diagnosing,
title={Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models},
author={Cai, Rui and Li, Bangzheng and Wen, Xiaofei and Chen, Muhao and Zhao, Zhe},
journal={arXiv preprint arXiv:2505.19616},
year={2025}
}