Skip to content

Code of True Multimodal In-Context Learning Needs Attention to the Visual Context (COLM2025)

Notifications You must be signed in to change notification settings

chenxshuo/true-micl

Repository files navigation

🔬 True Multimodal In-Context Learning Needs Attention to the Visual Context

This repo contains code of the paper True Multimodal In-Context Learning Needs Attention to the Visual Context (COLM 2025).

✅ Installation

git clone https://github.com/chenxshuo/true-micl.git
cd true-micl

# this script will install torch 2.5.1 + cu121, make sure to modify to your desired version
bash setup_venv.sh

⚠️ Important: Unsloth version must match your PyTorch, CUDA, and GPU architecture

Unsloth must be installed with the correct tag depending on your hardware and environment. For example:

✅ Our working configuration:

  • GPU: A100 (Ampere)
  • CUDA: 12.1
  • PyTorch: 2.5.1

Use the following:

uv pip install --no-deps 'unsloth[cu121-ampere-torch251] @ git+https://github.com/unslothai/unsloth.git'

If you use a different GPU or CUDA version, refer to Unsloth install guide and adjust accordingly.


📦 Dataset Preparation

huggingface-cli login
hf download ShuoChen99/TrueMICL --repo-type dataset --local-dir dataset --quiet 

Please download the required datasets from our data release and rename them as dataset/ folder. For example:

dataset/
├── operator_induction/
│   ├── support.json
│   └── query.json
├── sudoku/
│   ├── ...
├── shapes_count/
│   ├── ...
...

🧪 Running the Code

Running Training and Inference This project provides a unified shell script run_main.sh for both training and inference. For example, to run model inference on Clock Math:

source .venv/bin/activate
bash run.sh infer clock # baseline inference
bash run.sh dara_infer clock # load pre-trained dara 
bash run.sh lora_infer clock # load pre-trained lora 
Mode Description
infer Run inference with the base model with randomly chosen 4-shot.
dara_infer Run inference with a DARA model.
lora_infer Run inference with a LoRA fine-tuned model.
dara_finetune DARA training mode.
lora_finetune Fine-tune with LoRA. Only part of the parameters are trainable.

✅ Output and Evaluation

After inference, results are saved under ./results/ in JSON format.

To calculate accuracy:

python check_accuracy.py ./results/your_result_file.json

📁 Project Structure

.
├── ckpt/                         # Your LoRA checkpoints
│   ├── ...
├── dataset/                      # Place downloaded datasets here
│   ├── operator_induction/
│   ├── sudoku/
│   └── ...
├── qwen2_vl_replacement/    # model override modules
├── check_accuracy.py
├── data_processing.py
├── modeling_qwen2_vl.py
├── pip_requirements.txt
├── qwen2_finetune_new_model.py
├── run.sh                  # Entry script for training/inference
└── samples_of_training_data.json # sample json for training 

📚 References

BibTeX

@article{chen2025true,
  title={True Multimodal In-Context Learning Needs Attention to the Visual Context},
  author={Chen, Shuo and Liu, Jianzhe and Han, Zhen and Xia, Yan and Cremers, Daniel and Torr, Philip and Tresp, Volker and Gu, Jindong},
  journal={arXiv preprint arXiv:2507.15807},
  year={2025}
}

About

Code of True Multimodal In-Context Learning Needs Attention to the Visual Context (COLM2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published