Skip to content

tobiashaab/MVW

Repository files navigation

MVW: Reasoning with Fewer Eyes

MVW (M-step Vision Withdrawal) is a method for efficient multimodal reasoning that reduces computation and memory usage in vision-language models (VLMs). During autoregressive generation, attention to visual tokens typically drops after the initial reasoning steps. MVW leverages this by withdrawing vision tokens during reasoning after a small number of text tokens, retaining reasoning accuracy while accelerating inference. This approach is training-free, fully compatible with KV caching, FlashAttention and other token pruning strategies such as VTW. It also works across several models such as LLaVA-CoT and Kimi-VL.

Attention assigned to vision tokens during inference. Figure 1: Left: possible input to a reasoning VLM with image and text prompt with subsequent reasoning. Right: the total attention assigned to vision tokens as a function of generated token index is shown. The attention to vision tokens drops rapidly after the first few tokens.

Environment Setup

This project requires two separate Conda environments to avoid dependency conflicts:

  • llava-cot → for running LLaVA-CoT
  • kimi-vl → for running Kimi-VL

1. Setup LLaVA-CoT Environment

The base LLaVA-CoT setup requires modifications to the transformers libraries processing_mllama.py file. As such, the transformers library was cloned for this project.

Follow these steps:

# Create and activate the environment
conda create -n llava-cot python=3.10 -y
conda activate llava-cot

# Install basic dependencies
pip install -r requirements_llava.txt

# Install the development version of transformers
cd transformers
pip install -e .

2. Setup Kimi-VL Environment

To install the base Kimi-VL setup, follow these steps:

# Create and activate the environment
conda create -n kimi-vl python=3.10 -y
conda activate kimi-vl

# Install dependencies
pip install -r requirements_kimi.txt

Inference

Once your environment is set up, you have two options:

  1. Test simple inference using the inference_kimi.py or inference_llava.py scripts.
  2. Run full evaluations with VLMEvalKit.

To control which GPU is used, set the CUDA_VISIBLE_DEVICES environment variable, and specify the model you want to evaluate.

# Run eval for Kimi-VL
CUDA_VISIBLE_DEVICES=0 python VLMEvalKit/run.py \
    --data WeMath \
    --model Kimi-VL-A3B-Thinking-2506_local \
    --verbose \
    --mode all

# Run eval for LLaVA-CoT
CUDA_VISIBLE_DEVICES=1 python VLMEvalKit/run.py \
    --data WeMath \
    --model LLaVA-CoT_local \
    --verbose \
    --mode all

Acknowledgments

This work is based on the following repositories:

About

Reasoning with Fewer Eyes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published