Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text and latent vision. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.

Quick Overview of IVT-LR.

🔥 News

[2026.01] Model files are now available on Hugging Face !
[2025.10] 🎉🎉Initial release of the project.

🚀 Quick Start

1. Installation

Clone repo:

git clone https://github.com/ModalityDance/IVT-LR.git
cd IVT-LR

Setup environment:

conda env create -f environment.yml
conda activate ivtlr

Expected folder structure

IVT-LR/
  ├── chameleon
        ├── args/
        ├── chameleon_dataset.py
        ├── ...
  ├── qwen_vl
        ├── args/
        ├── dataset.py
        ├── ...
  └── environment.yml

2. Data Preparation

Download datasets:

dataset = load_dataset("LightChen2333/M3CoT")
dataset = load_dataset("derek-thomas/ScienceQA")

or download manually from:

3. Training

💡 Skip Training: If you want to skip training and directly run inference, you can download our pretrained models from the IVT-LR Collection on Hugging Face.

Qwen2-VL

To train the Qwen2-VL model with IVT-LR on the M3CoT dataset:

cd qwen_vl
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_P2P_LEVEL=NVL   # if needed
PYTHONUNBUFFERED=1 nohup deepspeed --master_port 29501 qwenvl_run.py args/qwen.yaml --deepspeed --deepspeed_config ds_config.json > qwenvl.log 2>&1 &

To train the Qwen2-VL model with IVT-LR on the ScienceQA dataset:

cd qwen_vl
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_P2P_LEVEL=NVL   # if needed
PYTHONUNBUFFERED=1 nohup deepspeed --master_port 29501 qwenvl_run_sqa.py args/qwen.yaml --deepspeed --deepspeed_config ds_config.json > qwenvl.log 2>&1 &

Chameleon

For Chameleon on M3CoT:

cd chameleon
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_P2P_LEVEL=NVL   # if needed
PYTHONUNBUFFERED=1 nohup deepspeed --master_port 29501 chameleon_run.py args/chameleon.yaml --deepspeed --deepspeed_config ds_config.json > chameleon.log 2>&1 &

For Chameleon on ScienceQA:

cd chameleon
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_P2P_LEVEL=NVL   # if needed
PYTHONUNBUFFERED=1 nohup deepspeed --master_port 29501 chameleon_run_sqa.py args/chameleon.yaml --deepspeed --deepspeed_config ds_config.json > chameleon.log 2>&1 &

Training Arguments

Key parameters in configuration:

save_path: Checkpoint save directory
name: Experiment name
epochs_per_stage: Epochs per latent reasoning stage (default: 4)
max_latent_stage: Maximum latent reasoning stages (default: 5)
resume: Resume epoch number (default: 0)
batch_size_training: Batch size per GPU (default: 4)
gradient_accumulation_steps: Gradient accumulation steps (default: 4)
num_epochs: Total training epochs (default: 16)
lr: Learning rate (default: 4e-5)

4. Inference

To generate the answer on the test split, run the inference code.

Qwen2-VL on M3CoT:

export CUDA_VISIBLE_DEVICES=0
nohup python infer.py > infer.log 2>&1 &

Qwen2-VL on ScienceQA:

export CUDA_VISIBLE_DEVICES=0
nohup python infer_sqa.py > infer.log 2>&1 &

Chameleon on M3CoT:

export CUDA_VISIBLE_DEVICES=0
nohup python infer_chameleon.py > infer.log 2>&1 &

Chameleon on ScienceQA:

export CUDA_VISIBLE_DEVICES=0
nohup python infer_chameleon_scienceqa.py > infer.log 2>&1 &

✨ How It Works

IVT-LR introduces a novel paradigm of multimodal latent reasoning that unifies textual and visual representations within the latent space. Unlike explicit chain-of-thought methods that require labor-intensive vision-text annotations, IVT-LR performs reasoning implicitly, achieving both annotation efficiency and inference speedup.

At a high level, the workflow proceeds as follows:

Interleaved Multimodal Representation — Each reasoning step combines two implicit components: latent text and latent vision. This interleaved structure enables the model to jointly leverage both modalities during reasoning.
Progressive Multi-Stage Training — We employ a curriculum-style training strategy that gradually increases the number of latent reasoning stages. This progressive approach helps MLLMs learn to perform multimodal latent reasoning in a stable and effective manner.
Dynamic Attention Allocation — A key insight from our analysis is that interleaved multimodal reasoning leads to dynamic attention redistribution. As reasoning progresses, the model adaptively shifts attention between visual and textual tokens based on task demands, significantly enhancing visual perception capabilities.

🔗 Related Projects

📄 Related Papers

Coconut: Training Large Language Models to Reason in a Continuous Latent Space
A pioneering work on latent reasoning that uses continuous thought representations for LLM reasoning.

🌟 Awesome Collections

Awesome Latent Space
A curated collection of resources on latent space methods and applications.
Awesome Latent CoT
A comprehensive list of latent chain-of-thought reasoning resources.

📚 Citation

If you use IVT-LR in your research or applications, please consider citing:

@article{chen2025reasoning,
  title={Reasoning in the dark: Interleaved vision-text reasoning in latent space},
  author={Chen, Chao and Ma, Zhixin and Li, Yongqi and Hu, Yupeng and Wei, Yinwei and Li, Wenjie and Nie, Liqiang},
  journal={arXiv preprint arXiv:2510.12603},
  year={2025}
}

⭐ Thank you for visiting IVT-LR! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
chameleon		chameleon
qwen_vl		qwen_vl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

🔥 News

📑 Table of Contents

🚀 Quick Start

1. Installation

2. Data Preparation

3. Training

Qwen2-VL

Chameleon

Training Arguments

4. Inference

✨ How It Works

🔗 Related Projects

📄 Related Papers

🌟 Awesome Collections

📚 Citation

About

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

🔥 News

📑 Table of Contents

🚀 Quick Start

1. Installation

2. Data Preparation

3. Training

Qwen2-VL

Chameleon

Training Arguments

4. Inference

✨ How It Works

🔗 Related Projects

📄 Related Papers

🌟 Awesome Collections

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Languages