This repo contains code of the paper True Multimodal In-Context Learning Needs Attention to the Visual Context (COLM 2025).
git clone https://github.com/chenxshuo/true-micl.git
cd true-micl
# this script will install torch 2.5.1 + cu121, make sure to modify to your desired version
bash setup_venv.sh
Unsloth must be installed with the correct tag depending on your hardware and environment. For example:
✅ Our working configuration:
- GPU: A100 (Ampere)
- CUDA: 12.1
- PyTorch: 2.5.1
Use the following:
uv pip install --no-deps 'unsloth[cu121-ampere-torch251] @ git+https://github.com/unslothai/unsloth.git'If you use a different GPU or CUDA version, refer to Unsloth install guide and adjust accordingly.
huggingface-cli login
hf download ShuoChen99/TrueMICL --repo-type dataset --local-dir dataset --quiet Please download the required datasets from our data release and rename them as dataset/ folder. For example:
dataset/
├── operator_induction/
│ ├── support.json
│ └── query.json
├── sudoku/
│ ├── ...
├── shapes_count/
│ ├── ...
...Running Training and Inference This project provides a unified shell script run_main.sh for both training and inference. For example, to run model inference on Clock Math:
source .venv/bin/activate
bash run.sh infer clock # baseline inference
bash run.sh dara_infer clock # load pre-trained dara
bash run.sh lora_infer clock # load pre-trained lora
| Mode | Description |
|---|---|
infer |
Run inference with the base model with randomly chosen 4-shot. |
dara_infer |
Run inference with a DARA model. |
lora_infer |
Run inference with a LoRA fine-tuned model. |
dara_finetune |
DARA training mode. |
lora_finetune |
Fine-tune with LoRA. Only part of the parameters are trainable. |
After inference, results are saved under ./results/ in JSON format.
To calculate accuracy:
python check_accuracy.py ./results/your_result_file.json.
├── ckpt/ # Your LoRA checkpoints
│ ├── ...
├── dataset/ # Place downloaded datasets here
│ ├── operator_induction/
│ ├── sudoku/
│ └── ...
├── qwen2_vl_replacement/ # model override modules
├── check_accuracy.py
├── data_processing.py
├── modeling_qwen2_vl.py
├── pip_requirements.txt
├── qwen2_finetune_new_model.py
├── run.sh # Entry script for training/inference
└── samples_of_training_data.json # sample json for training
@article{chen2025true,
title={True Multimodal In-Context Learning Needs Attention to the Visual Context},
author={Chen, Shuo and Liu, Jianzhe and Han, Zhen and Xia, Yan and Cremers, Daniel and Torr, Philip and Tresp, Volker and Gu, Jindong},
journal={arXiv preprint arXiv:2507.15807},
year={2025}
}