MVW: Reasoning with Fewer Eyes

MVW (M-step Vision Withdrawal) is a method for efficient multimodal reasoning that reduces computation and memory usage in vision-language models (VLMs). During autoregressive generation, attention to visual tokens typically drops after the initial reasoning steps. MVW leverages this by withdrawing vision tokens during reasoning after a small number of text tokens, retaining reasoning accuracy while accelerating inference. This approach is training-free, fully compatible with KV caching, FlashAttention and other token pruning strategies such as VTW. It also works across several models such as LLaVA-CoT and Kimi-VL.

Figure 1: Left: possible input to a reasoning VLM with image and text prompt with subsequent reasoning. Right: the total attention assigned to vision tokens as a function of generated token index is shown. The attention to vision tokens drops rapidly after the first few tokens.

Environment Setup

This project requires two separate Conda environments to avoid dependency conflicts:

llava-cot → for running LLaVA-CoT
kimi-vl → for running Kimi-VL

1. Setup LLaVA-CoT Environment

The base LLaVA-CoT setup requires modifications to the transformers libraries processing_mllama.py file. As such, the transformers library was cloned for this project.

Follow these steps:

# Create and activate the environment
conda create -n llava-cot python=3.10 -y
conda activate llava-cot

# Install basic dependencies
pip install -r requirements_llava.txt

# Install the development version of transformers
cd transformers
pip install -e .

2. Setup Kimi-VL Environment

To install the base Kimi-VL setup, follow these steps:

# Create and activate the environment
conda create -n kimi-vl python=3.10 -y
conda activate kimi-vl

# Install dependencies
pip install -r requirements_kimi.txt

Inference

Once your environment is set up, you have two options:

Test simple inference using the inference_kimi.py or inference_llava.py scripts.
Run full evaluations with VLMEvalKit.

To control which GPU is used, set the CUDA_VISIBLE_DEVICES environment variable, and specify the model you want to evaluate.

# Run eval for Kimi-VL
CUDA_VISIBLE_DEVICES=0 python VLMEvalKit/run.py \
    --data WeMath \
    --model Kimi-VL-A3B-Thinking-2506_local \
    --verbose \
    --mode all

# Run eval for LLaVA-CoT
CUDA_VISIBLE_DEVICES=1 python VLMEvalKit/run.py \
    --data WeMath \
    --model LLaVA-CoT_local \
    --verbose \
    --mode all

Acknowledgments

This work is based on the following repositories:

Kimi-VL-A3B-Thinking-2506 📝 • 🤗
LLaVA-CoT 📝 • • 🤗
Transformers 📝 •
VLMEvalKit 📝 •
VTW 📝 •

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Kimi-VL-A3B-Thinking-2506		Kimi-VL-A3B-Thinking-2506
Llama-3.2V-11B-cot		Llama-3.2V-11B-cot
VLMEvalKit		VLMEvalKit
assets		assets
documentation		documentation
transformers		transformers
.gitignore		.gitignore
README.md		README.md
inference_kimi.py		inference_kimi.py
inference_llava.py		inference_llava.py
requirements_kimi.txt		requirements_kimi.txt
requirements_llava.txt		requirements_llava.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MVW: Reasoning with Fewer Eyes

Environment Setup

1. Setup LLaVA-CoT Environment

2. Setup Kimi-VL Environment

Inference

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

tobiashaab/MVW

Folders and files

Latest commit

History

Repository files navigation

MVW: Reasoning with Fewer Eyes

Environment Setup

1. Setup LLaVA-CoT Environment

2. Setup Kimi-VL Environment

Inference

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages