Skip to content

GeWu-Lab/MokA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MokA

🌴 News

[2025-12-27] We have updated the code, checkpoints, and predicted results. The training instructions have also been improved!

🚀 Quick Start

🥑 Used pre-trained weights:

Multi-modal Encoder Weights:

LLM Weights:

🌴 Prepare datasets

In this repo, we take the audio-visual-text and visual-text case as an example. Pretrain based on llama2-7b-chat-hf model.

Stage 1 dataset:

  • Download image and video pretrain dataset from Video-LLaVA;
  • Download audio pretrain dataset from AudioCaps.

Stage 2 dataset:

Audio-Visual-Text:
  • AVE annotation & JSON: HERE
  • AVE raw video: HERE
  • MUSIC-AVQA annotation & JSON: HERE
  • MUSIC-AVQA raw video: HERE
Visual-Text:
  • Download the used JSON of train data HERE. A small set of multiple-choice type instructions is integrated with the original LLaVA-Instruct-150K.

🔑 Training

Audio-Visual-Text case

Read AudioVisualText/README_AVT.md for the detailed information.

Visual-Text case

Read VisualText/README_VT.md for the detailed information.

📃 BibTeX

@article{wei2025moka,
  title={MokA: Multimodal Low-Rank Adaptation for MLLMs},
  author={Wei, Yake and Miao, Yu and Zhou, Dongzhan and Hu, Di},
  journal={{Advances in Neural Information Processing Systems},
  year={2025}
}

About

MokA: Multimodal Low-Rank Adaptation for MLLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages