π€ Model Β Β | Β Β π Paper Β Β
π CVPR 2026 "Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs" π
Jingze Wu, Quan Zhang*, Hongfei Suo, Zeqiang Cai, Hongbo Chen*
Sun Yat-sen University, China
*Corresponding authors
VideoThinker is a causal-inspired framework that enables lightweight multimodal language models (3B parameters) to achieve robust video reasoning. Using only 1K training samples and no SFT, VideoThinker-R1 surpasses models trained on 110K samples and even outperforms larger 7B models on reasoning-heavy benchmarks.
Consider this video reasoning scenario from CLEVRER dataset:
Question: "What will happen if we remove the [object]?"
Two Types of Questions:
π’ Observational (Easy): "What will happen if we remove the yellow ball?"
β The yellow ball is causally irrelevant. Model can simply describe what's visible in the video without real reasoning.
π΄ Inferential (Hard): "What will happen if we remove the cyan cylinder?"
β The cylinder acts as a causal blocker. Model must perform counterfactual reasoning about an alternative scenario.
We discovered that 74% of training data consists of observational questions, creating a severe data bias. Our diagnostic experiments reveal:
- β 3B base model (zero-shot): 73.9% on inferential questions
- β 3B after GRPO fine-tuning: 63.1% on inferential questions (-10.8%)
- β 7B models remain robust (parameter capacity compensates for bias)
The problem: Lightweight 3B models learn perceptual shortcuts from biased data, actively unlearning their reasoning abilities during RL fine-tuning!
We propose a simple yet effective two-stage framework:
Stage 1: Bias Aware Training
Train a "bias model" that explicitly learns perceptual shortcuts (trained only on observational questions).
Stage 2: Causal Debiasing Policy Optimization (CDPO)
Fine-tune the main model to actively push away from the bias model's behavior using a repulsive objective, while simultaneously pulling toward correct answers.
π‘ Key Innovation: We use a positive KL-divergence coefficient, which maximizes the distance from the bias model instead of minimizing it. This forces the model to discover genuine reasoning pathways.
Technical Details: See our paper for the formal causal analysis (Structural Causal Model) and complete algorithm.
Comparsion_with_VideoRFT_on_Benchmarks.mp4
Comparison with VideoRFT on multiple video reasoning benchmarks
| Model | Training Data | CLEVRER | MMVU | MVBench | TempCompass | VideoMME |
|---|---|---|---|---|---|---|
| Video-UTR-7B | - | - | - | 58.8 | 59.7 | 52.6 |
| Qwen2.5-VL-3B (CoT) | Zero-shot | 44.7 | 52.8 | 49.6 | 30.0 | 52.0 |
| VideoRFT-3B | 110K SFT+RL | 59.3 | 55.1 | 59.5 | 61.0 | 45.4 |
| Qwen2.5-VL-GRPO | 1K RL | 64.9 | 52.0 | 54.9 | 41.4 | 50.3 |
| VideoThinker-R1 | 1K RL | 79.1 | 56.8 | 60.9 | 63.5 | 52.4 |
β
Sample Efficiency: Outperforms VideoRFT-3B using only 0.9% of training data (1K vs 110K)
β
Reasoning Boost: +14.2% over GRPO baseline on CLEVRER (79.1% vs 64.9%)
β
Cross-Scale Win: Surpasses 7B models on reasoning-heavy benchmarks (MVBench, TempCompass)
β
No SFT Needed: Direct RL fine-tuning without supervised pre-training
Note
π» Training is conducted on 2 x NVIDIA RTX A6000 GPUs (48GB each). Training the bias model and main model (500 steps each) takes approximately 8 hours total.
To implement VideoThinker training, complete these three steps: dependency installation, model backbone download, and training dataset download.
git clone https://github.com/falonss703/VideoThinker
cd VideoThinker
conda create -n videothinker python=3.10
conda activate videothinker
pip3 install -e ".[dev]"
pip3 install flash_attn --no-build-isolation
git clone https://github.com/huggingface/transformers
cd transformers
git checkout v4.50.0
pip install .
cd ../qwen-vl-utils
pip install -e .
pip install decord
cd ..Download the Qwen2.5-VL-3B-Instruct model:
pip install -U huggingface_hub
huggingface-cli download --resume-download Qwen/Qwen2.5-VL-3B-Instruct --local-dir Qwen/Qwen2.5-VL-3B-InstructWe use counterfactual tasks from CLEVRER for training. JSON files are in data/CLEVRER, but you need to download videos:
# Create directories
mkdir -p data/CLEVRER/{train_video,validation_video}
# Download training videos
wget -P data/CLEVRER/train_video http://data.csail.mit.edu/clevrer/videos/train/video_train.zip
unzip data/CLEVRER/train_video/video_train.zip -d data/CLEVRER/train_video
rm data/CLEVRER/train_video/video_train.zip
# Download validation videos
wget -P data/CLEVRER/validation_video http://data.csail.mit.edu/clevrer/videos/validation/video_validation.zip
unzip data/CLEVRER/validation_video/video_validation.zip -d data/CLEVRER/validation_video
rm data/CLEVRER/validation_video/video_validation.zipThe evaluation datasets are provided with download links below. If you only want to reproduce CLEVRER results, you can skip this step.
| π Dataset | πΎ Size | π Link |
|---|---|---|
| CLEVRER | 30GB | π₯ Download |
| MMVU | 0.9GB | π₯ Download |
| Video-Holmes | 5GB | π₯ Download |
| MVBench | 16GB | π₯ Download |
| TempCompass | 0.4GB | π₯ Download |
| Video-MME | 94GB | π₯ Download |
π Important Notes:
- JSON files are in
data/directory- Organize video data according to standard dataset structures
- See
data/README.mdfor detailed directory organization
VideoThinker employs a two-stage training process to effectively debias lightweight MLLMs:
First, train a dedicated "bias model" that embodies perceptual shortcuts:
bash scripts/bias_training.shThis stage constructs a model that specifically learns to exploit observational shortcuts in the data, serving as a negative exemplar for the debiasing process. The bias model is trained on observational questions with beta=0.0.
Once the bias model is trained, run CDPO to train the main model while actively pushing it away from biased reasoning:
bash scripts/grpo.shCDPO employs an innovative repulsive objective using negative beta coefficient (beta=-0.01), transforming the KL regularizer into a repulsive force that steers the model away from the bias model's shortcuts while attracting it toward correct reasoning.
Tip
Key Configuration Differences:
- Bias Training: Uses
beta=0.0and trains on observational questions only - CDPO: Uses
beta=-0.01(negative!) and references the bias model via--ref_model_path
We provide convenient parameter settings to help you verify the effects of different designs proposed in the paper:
-
ποΈ Beta Value (
--beta):- Bias Training:
0.0(no KL regularization) - CDPO:
-0.01(negative for repulsive objective) - Controls the strength of the repulsive force
- More negative values increase repulsion from the bias model
- Bias Training:
-
π Dataset Selection (
--jsonl_path):- Bias Training:
data/CLEVRER/clevrer_counterfactual_train_observational_bias_training.json - CDPO:
data/CLEVRER/clevrer_counterfactual_train.json
- Bias Training:
-
π― Reference Model (
--ref_model_path):- Only used in CDPO stage
- Path to the pre-trained bias model
- Example:
$PRIVATE_DATA_ROOT/Training/Qwen2.5-VL-3B-Instruct_clevrer_counterfactual_bias_model/checkpoint-500
-
πΎ Output Directory (
--output_dir):- Bias Training:
$PRIVATE_DATA_ROOT/Training/Qwen2.5-VL-3B-Instruct_clevrer_counterfactual_bias_model - CDPO:
$PRIVATE_DATA_ROOT/Training/Qwen2.5-VL-3B-Instruct_clevrer_counterfactual_videothinker_r1
- Bias Training:
After downloading the datasets and completing training, or downloading our pre-trained model (available at HuggingFace), you can evaluate VideoThinker-R1 using:
bash scripts/eval_bench.shWe sincerely thank the contributions from the open source community, including the awesome works of Qwen2-VL, Video-R1, VideoRFT, and CLEVRER.
If you find VideoThinker useful in your research, please consider citing our CVPR 2026 paper:
@inproceedings{wu2026videothinker,
title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}


