|
|
|
[2025-12-27] We have updated the code, checkpoints, and predicted results. The training instructions have also been improved!
Multi-modal Encoder Weights:
- download visual encoder openai-clip-vit-large-patch14
- download audio encoder Fine-tuned BEATs_iter3+ (AS2M)
LLM Weights:
- download LLaMA-2-Chat-HF
In this repo, we take the audio-visual-text and visual-text case as an example. Pretrain based on llama2-7b-chat-hf model.
- Download image and video pretrain dataset from Video-LLaVA;
- Download audio pretrain dataset from AudioCaps.
- AVE annotation & JSON: HERE
- AVE raw video: HERE
- MUSIC-AVQA annotation & JSON: HERE
- MUSIC-AVQA raw video: HERE
- Download the used JSON of train data HERE. A small set of multiple-choice type instructions is integrated with the original LLaVA-Instruct-150K.
Read AudioVisualText/README_AVT.md for the detailed information.
Read VisualText/README_VT.md for the detailed information.
@article{wei2025moka,
title={MokA: Multimodal Low-Rank Adaptation for MLLMs},
author={Wei, Yake and Miao, Yu and Zhou, Dongzhan and Hu, Di},
journal={{Advances in Neural Information Processing Systems},
year={2025}
}