🎬 VideoThinker: Beyond Perceptual Shortcuts

🤗 Model | 📑 Paper

🚀 CVPR 2026 "Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs" 🚀

Jingze Wu, Quan Zhang*, Hongfei Suo, Zeqiang Cai, Hongbo Chen*

Sun Yat-sen University, China

*Corresponding authors

VideoThinker is a causal-inspired framework that enables lightweight multimodal language models (3B parameters) to achieve robust video reasoning. Using only 1K training samples and no SFT, VideoThinker-R1 surpasses models trained on 110K samples and even outperforms larger 7B models on reasoning-heavy benchmarks.

🤔 The Problem: Perceptual Shortcuts in Video Reasoning

Consider this video reasoning scenario from CLEVRER dataset:

Question: "What will happen if we remove the [object]?"

Two Types of Questions:

🟢 Observational (Easy): "What will happen if we remove the yellow ball?"
→ The yellow ball is causally irrelevant. Model can simply describe what's visible in the video without real reasoning.

🔴 Inferential (Hard): "What will happen if we remove the cyan cylinder?"
→ The cylinder acts as a causal blocker. Model must perform counterfactual reasoning about an alternative scenario.

📊 The Critical Finding

We discovered that 74% of training data consists of observational questions, creating a severe data bias. Our diagnostic experiments reveal:

✅ 3B base model (zero-shot): 73.9% on inferential questions
❌ 3B after GRPO fine-tuning: 63.1% on inferential questions (-10.8%)
✅ 7B models remain robust (parameter capacity compensates for bias)

The problem: Lightweight 3B models learn perceptual shortcuts from biased data, actively unlearning their reasoning abilities during RL fine-tuning!

💡 Our Solution: VideoThinker

We propose a simple yet effective two-stage framework:

Stage 1: Bias Aware Training
Train a "bias model" that explicitly learns perceptual shortcuts (trained only on observational questions).

Stage 2: Causal Debiasing Policy Optimization (CDPO)
Fine-tune the main model to actively push away from the bias model's behavior using a repulsive objective, while simultaneously pulling toward correct answers.

💡 Key Innovation: We use a positive KL-divergence coefficient, which maximizes the distance from the bias model instead of minimizing it. This forces the model to discover genuine reasoning pathways.

Technical Details: See our paper for the formal causal analysis (Structural Causal Model) and complete algorithm.

📊 Main Results

🎥 Demo Video

Comparsion_with_VideoRFT_on_Benchmarks.mp4

Comparison with VideoRFT on multiple video reasoning benchmarks

Performance Summary

Model	Training Data	CLEVRER	MMVU	MVBench	TempCompass	VideoMME
Video-UTR-7B	-	-	-	58.8	59.7	52.6
Qwen2.5-VL-3B (CoT)	Zero-shot	44.7	52.8	49.6	30.0	52.0
VideoRFT-3B	110K SFT+RL	59.3	55.1	59.5	61.0	45.4
Qwen2.5-VL-GRPO	1K RL	64.9	52.0	54.9	41.4	50.3
VideoThinker-R1	1K RL	79.1	56.8	60.9	63.5	52.4

Key Takeaways

✅ Sample Efficiency: Outperforms VideoRFT-3B using only 0.9% of training data (1K vs 110K)
✅ Reasoning Boost: +14.2% over GRPO baseline on CLEVRER (79.1% vs 64.9%)
✅ Cross-Scale Win: Surpasses 7B models on reasoning-heavy benchmarks (MVBench, TempCompass)
✅ No SFT Needed: Direct RL fine-tuning without supervised pre-training

🛠️ Setup

Note

💻 Training is conducted on 2 x NVIDIA RTX A6000 GPUs (48GB each). Training the bias model and main model (500 steps each) takes approximately 8 hours total.

To implement VideoThinker training, complete these three steps: dependency installation, model backbone download, and training dataset download.

🛠️ Step 1: Environment Setup and Dependency Installation

git clone https://github.com/falonss703/VideoThinker
cd VideoThinker
conda create -n videothinker python=3.10
conda activate videothinker
pip3 install -e ".[dev]"
pip3 install flash_attn --no-build-isolation
git clone https://github.com/huggingface/transformers
cd transformers
git checkout v4.50.0
pip install .
cd ../qwen-vl-utils
pip install -e .
pip install decord
cd ..

📥 Step 2: Download Model Backbone

Download the Qwen2.5-VL-3B-Instruct model:

pip install -U huggingface_hub
huggingface-cli download --resume-download Qwen/Qwen2.5-VL-3B-Instruct --local-dir Qwen/Qwen2.5-VL-3B-Instruct

🎥 Step 3: Download Training Dataset

🧩 CLEVRER (Primary Training Dataset)

We use counterfactual tasks from CLEVRER for training. JSON files are in data/CLEVRER, but you need to download videos:

# Create directories
mkdir -p data/CLEVRER/{train_video,validation_video}

# Download training videos
wget -P data/CLEVRER/train_video http://data.csail.mit.edu/clevrer/videos/train/video_train.zip
unzip data/CLEVRER/train_video/video_train.zip -d data/CLEVRER/train_video
rm data/CLEVRER/train_video/video_train.zip

# Download validation videos
wget -P data/CLEVRER/validation_video http://data.csail.mit.edu/clevrer/videos/validation/video_validation.zip
unzip data/CLEVRER/validation_video/video_validation.zip -d data/CLEVRER/validation_video
rm data/CLEVRER/validation_video/video_validation.zip

🌐 Evaluation Benchmarks

The evaluation datasets are provided with download links below. If you only want to reproduce CLEVRER results, you can skip this step.

📊 Dataset	💾 Size	🔗 Link
CLEVRER	30GB	📥 Download
MMVU	0.9GB	📥 Download
Video-Holmes	5GB	📥 Download
MVBench	16GB	📥 Download
TempCompass	0.4GB	📥 Download
Video-MME	94GB	📥 Download

📝 Important Notes:

JSON files are in data/ directory

Organize video data according to standard dataset structures

See data/README.md for detailed directory organization

🏃‍♂️ Training

VideoThinker employs a two-stage training process to effectively debias lightweight MLLMs:

Stage 1: Bias Aware Training

First, train a dedicated "bias model" that embodies perceptual shortcuts:

bash scripts/bias_training.sh

This stage constructs a model that specifically learns to exploit observational shortcuts in the data, serving as a negative exemplar for the debiasing process. The bias model is trained on observational questions with beta=0.0.

Stage 2: Causal Debiasing Policy Optimization (CDPO)

Once the bias model is trained, run CDPO to train the main model while actively pushing it away from biased reasoning:

bash scripts/grpo.sh

CDPO employs an innovative repulsive objective using negative beta coefficient (beta=-0.01), transforming the KL regularizer into a repulsive force that steers the model away from the bias model's shortcuts while attracting it toward correct reasoning.

Tip

Key Configuration Differences:

Bias Training: Uses beta=0.0 and trains on observational questions only
CDPO: Uses beta=-0.01 (negative!) and references the bias model via --ref_model_path

⚙️ Training Configuration Options

We provide convenient parameter settings to help you verify the effects of different designs proposed in the paper:

🎚️ Beta Value (--beta):
- Bias Training: 0.0 (no KL regularization)
- CDPO: -0.01 (negative for repulsive objective)
- Controls the strength of the repulsive force
- More negative values increase repulsion from the bias model
📊 Dataset Selection (--jsonl_path):
- Bias Training: data/CLEVRER/clevrer_counterfactual_train_observational_bias_training.json
- CDPO: data/CLEVRER/clevrer_counterfactual_train.json
🎯 Reference Model (--ref_model_path):
- Only used in CDPO stage
- Path to the pre-trained bias model
- Example: $PRIVATE_DATA_ROOT/Training/Qwen2.5-VL-3B-Instruct_clevrer_counterfactual_bias_model/checkpoint-500
💾 Output Directory (--output_dir):
- Bias Training: $PRIVATE_DATA_ROOT/Training/Qwen2.5-VL-3B-Instruct_clevrer_counterfactual_bias_model
- CDPO: $PRIVATE_DATA_ROOT/Training/Qwen2.5-VL-3B-Instruct_clevrer_counterfactual_videothinker_r1

📊 Evaluation

After downloading the datasets and completing training, or downloading our pre-trained model (available at HuggingFace), you can evaluate VideoThinker-R1 using:

bash scripts/eval_bench.sh

🙏 References & Acknowledgements

We sincerely thank the contributions from the open source community, including the awesome works of Qwen2-VL, Video-R1, VideoRFT, and CLEVRER.

📖 Citation

If you find VideoThinker useful in your research, please consider citing our CVPR 2026 paper:

@inproceedings{wu2026videothinker,
  title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
  author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

🌟 Star this repo if you find it helpful! 🌟

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎬 VideoThinker: Beyond Perceptual Shortcuts

🤔 The Problem: Perceptual Shortcuts in Video Reasoning

📊 The Critical Finding

💡 Our Solution: VideoThinker

📊 Main Results

🎥 Demo Video

Performance Summary

Key Takeaways

🛠️ Setup

🛠️ Step 1: Environment Setup and Dependency Installation

📥 Step 2: Download Model Backbone

🎥 Step 3: Download Training Dataset

🧩 CLEVRER (Primary Training Dataset)

🌐 Evaluation Benchmarks

🏃‍♂️ Training

Stage 1: Bias Aware Training

Stage 2: Causal Debiasing Policy Optimization (CDPO)

⚙️ Training Configuration Options

📊 Evaluation

🙏 References & Acknowledgements

📖 Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🎬 VideoThinker: Beyond Perceptual Shortcuts

🤔 The Problem: Perceptual Shortcuts in Video Reasoning

📊 The Critical Finding

💡 Our Solution: VideoThinker

📊 Main Results

🎥 Demo Video

Performance Summary

Key Takeaways

🛠️ Setup

🛠️ Step 1: Environment Setup and Dependency Installation

📥 Step 2: Download Model Backbone

🎥 Step 3: Download Training Dataset

🧩 CLEVRER (Primary Training Dataset)

🌐 Evaluation Benchmarks

🏃‍♂️ Training

Stage 1: Bias Aware Training

Stage 2: Causal Debiasing Policy Optimization (CDPO)

⚙️ Training Configuration Options

📊 Evaluation

🙏 References & Acknowledgements

📖 Citation