Skip to content

Latest commit

Β 

History

History
66 lines (42 loc) Β· 3.23 KB

File metadata and controls

66 lines (42 loc) Β· 3.23 KB

SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Official implementation of "SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding"

License: MIT HuggingFace: Models Paper: Arxiv

sdar ## πŸ” Overview

SDAR-VL is the first large-scale block-wise discrete diffusion framework for vision-language understanding (VLU).
It provides a stable and efficient alternative to autoregressive (AR) decoders by introducing an integrated training framework that improves convergence, stability, and efficiency.

Key Features

  • 🧩 Block-wise Diffusion Backbone β€” Parallel intra-block denoising with causal inter-block dependencies
  • πŸ”„ Asynchronous Block-wise Noise Scheduling β€” Diversifies supervision and smooths optimization
  • βš–οΈ Effective Mask Ratio Scaling β€” Unbiased loss normalization under stochastic masking
  • πŸ“ˆ Progressive Beta Noise Curriculum β€” Improves convergence and coverage over training
  • πŸ“Š SOTA Performance β€” Matches or surpasses AR models like LLaVA-OneVision under matched setups, and achieves state-of-the-art results among diffusion-based multimodal models.

πŸ€— Model Zoo

Model Type Link
SDAR-VL-Instruct-4B Instruct https://huggingface.co/JetLM/SDAR-VL-Instruct-4B
SDAR-VL-Instruct-8B Instruct https://huggingface.co/JetLM/SDAR-VL-Instruct-8B
SDAR-VL-Think-4B Think https://huggingface.co/JetLM/SDAR-VL-Think-4B
SDAR-VL-Think-8B Think https://huggingface.co/JetLM/SDAR-VL-Think-8B

βš™οΈ Usage

Inference

python generate.py

Training

For detailed instructions on how to fine-tune the model on your own dataset, please refer to the guide in the training directory: training/README.md.

πŸ“¬ Contact

For issues or inquiries:

πŸ”¬ Citation

@misc{cheng2025sdarvl,
  title        = {SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding},
  author       = {Cheng, Shuang and Jiang, Yuhua and Zhou, Zineng and Liu, Dawei and Wang, Tao and Zhang, Linfeng and Qi, Biqing and Zhou, Bowen},
  year         = {2025},
  note         = {Zhejiang University, Shanghai AI Laboratory, Tsinghua University, Shanghai Jiao Tong University, ByteDance}
}