MVW (M-step Vision Withdrawal) is a method for efficient multimodal reasoning that reduces computation and memory usage in vision-language models (VLMs). During autoregressive generation, attention to visual tokens typically drops after the initial reasoning steps. MVW leverages this by withdrawing vision tokens during reasoning after a small number of text tokens, retaining reasoning accuracy while accelerating inference. This approach is training-free, fully compatible with KV caching, FlashAttention and other token pruning strategies such as VTW. It also works across several models such as LLaVA-CoT and Kimi-VL.
Figure 1: Left: possible input to a reasoning VLM with image and text prompt with subsequent
reasoning. Right: the total attention assigned to vision tokens as a function of generated token index
is shown. The attention to vision tokens drops rapidly after the first few tokens.
This project requires two separate Conda environments to avoid dependency conflicts:
llava-cot→ for running LLaVA-CoTkimi-vl→ for running Kimi-VL
The base LLaVA-CoT setup requires modifications to the transformers libraries processing_mllama.py file. As such, the transformers library was cloned for this project.
Follow these steps:
# Create and activate the environment
conda create -n llava-cot python=3.10 -y
conda activate llava-cot
# Install basic dependencies
pip install -r requirements_llava.txt
# Install the development version of transformers
cd transformers
pip install -e .To install the base Kimi-VL setup, follow these steps:
# Create and activate the environment
conda create -n kimi-vl python=3.10 -y
conda activate kimi-vl
# Install dependencies
pip install -r requirements_kimi.txtOnce your environment is set up, you have two options:
- Test simple inference using the
inference_kimi.pyorinference_llava.pyscripts. - Run full evaluations with VLMEvalKit.
To control which GPU is used, set the CUDA_VISIBLE_DEVICES environment variable, and specify the model you want to evaluate.
# Run eval for Kimi-VL
CUDA_VISIBLE_DEVICES=0 python VLMEvalKit/run.py \
--data WeMath \
--model Kimi-VL-A3B-Thinking-2506_local \
--verbose \
--mode all
# Run eval for LLaVA-CoT
CUDA_VISIBLE_DEVICES=1 python VLMEvalKit/run.py \
--data WeMath \
--model LLaVA-CoT_local \
--verbose \
--mode allThis work is based on the following repositories: