SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Official implementation of "SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding"

sdar ## 🔍 Overview

SDAR-VL is the first large-scale block-wise discrete diffusion framework for vision-language understanding (VLU).
It provides a stable and efficient alternative to autoregressive (AR) decoders by introducing an integrated training framework that improves convergence, stability, and efficiency.

Key Features

🧩 Block-wise Diffusion Backbone — Parallel intra-block denoising with causal inter-block dependencies
🔄 Asynchronous Block-wise Noise Scheduling — Diversifies supervision and smooths optimization
⚖️ Effective Mask Ratio Scaling — Unbiased loss normalization under stochastic masking
📈 Progressive Beta Noise Curriculum — Improves convergence and coverage over training
📊 SOTA Performance — Matches or surpasses AR models like LLaVA-OneVision under matched setups, and achieves state-of-the-art results among diffusion-based multimodal models.

🤗 Model Zoo

Model	Type	Link
SDAR-VL-Instruct-4B	Instruct	https://huggingface.co/JetLM/SDAR-VL-Instruct-4B
SDAR-VL-Instruct-8B	Instruct	https://huggingface.co/JetLM/SDAR-VL-Instruct-8B
SDAR-VL-Think-4B	Think	https://huggingface.co/JetLM/SDAR-VL-Think-4B
SDAR-VL-Think-8B	Think	https://huggingface.co/JetLM/SDAR-VL-Think-8B

⚙️ Usage

Inference

python generate.py

Training

For detailed instructions on how to fine-tune the model on your own dataset, please refer to the guide in the training directory: training/README.md.

📬 Contact

For issues or inquiries:

Shuang Cheng, Shanghai AI Lab (chengshuang@pjlab.org.cn)
Biqing Qi (Corrsponding Author), Shanghai AI Lab (qibiqing@pjlab.org.cn)

🔬 Citation

@misc{cheng2025sdarvl,
  title        = {SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding},
  author       = {Cheng, Shuang and Jiang, Yuhua and Zhou, Zineng and Liu, Dawei and Wang, Tao and Zhang, Linfeng and Qi, Biqing and Zhou, Bowen},
  year         = {2025},
  note         = {Zhejiang University, Shanghai AI Laboratory, Tsinghua University, Shanghai Jiao Tong University, ByteDance}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

🤗 Model Zoo

⚙️ Usage

Inference

Training

📬 Contact

🔬 Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

🤗 Model Zoo

⚙️ Usage

Inference

Training

📬 Contact

🔬 Citation