Skip to content
/ DARE Public

Official repository of DARE: dLLM Alignment and Reinforcement Executor

License

Notifications You must be signed in to change notification settings

yjyddq/DARE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DARE: dLLM Alignment and Reinforcement Executor

License Stars Open Issues Closed Issues

🎯 Overview

We introduce DARE (dLLM Alignment and Reinforcement Executor), a flexible and efficient supervised-finetuning (SFT) and reinforcement learning (RL) training framework designed specifically for diffusion large language models (dLLMs). DARE also integrates dLLMs into a comprehensive evaluation platform. It aims to be both flexible and user-friendly to use with:

  • Easy extension of diverse RL algorithms for dLLMs
  • Easy extension of extra benchmark evaluations for dLLMs
  • Easy integration of popular and upcoming dLLM infras and HuggingFace weights

DARE is a work in progress, we plan to support more models and algorithm for training and evaluation. We warmly welcome the research community to collaborations, give feedback and share suggestions. Let's advance the diffusion large language models together !!!πŸ‘Š

Optimization Plan in RL Pipeline

Tip

For MDLMs (LLaDA or Dream), we decouple the attention backend used during training from that used during rollout. Rollout uses flash_attn_func or flash_attn_with_kvcache for KV-cache, training adopts flash_attn_varlen_func to skip meaningless computation on padding tokens. The entire pipeline speed-up approximately ⚑️ 4Γ—.

Tip

For BDLMs like SDAR: We rollout with compatible lmdeploy inference and adopt SDAR's logits-free fused_linear_cross_entropy to cut memory usage, enable online weights update for rollout policy. The entire pipeline will be accelerated more than ⚑️ 14Γ—.

πŸ“’ News

  • [2025-12-28]: Important❗️ Several errors/bugs/updates in dp_actor_algorithm have been fixed/adapted. If you encounter issues before updating, pull repo after the timestamp 12-28 first.
  • [2025-12-24]: Support online rl (online weight update of rollout) for SDAR.
  • [2025-12-23]: Support vrpo (preference optimization) for Dream.
  • [2025-12-16]: Support vrpo (preference optimization) for LLaDA.
  • [2025-12-12]: Support sft/peft of SDAR.
  • [2025-12-11]: Support evaluation of LLaDAMoE and LLaDA2.0-mini.
  • [2025-12-08]: Support coupled-grpo, cj-grpo and spg algorithm.
  • [2025-12-03]: Support sequence parallel to enable longer generation ability for dLLMs.
  • [2025-12-01]: We initialize the codebase of DARE (dLLM Alignment and Reinforcement Executor), including faster sft/peft/rl (d1, bgpo) training (LLaDA/Dream) and evaluation (LLaDA/Dream/SDAR).

πŸ” Catalogue

πŸ† Key Features

  • Acceleration Inference/Rollout for dLLMs
    • Block cache (Fast-dLLM) for LLaDAs and Dreams 2.2x faster rollout
    • Inference engine (lmdeploy) for SDARs 2-4Γ— faster rollout
  • Parallelism for dLLMs
    • Support sequence parallel
  • Attention Backend
    • Support flash_attn backend
    • Support flash_attn_varlen backend
    • Support flash_attn_with_kvcache backend
  • Model Diversity
    • dLLM that trained from scratch (e.g., LLaDA)
    • dLLM that continuous trained from AR, i.e., AR-to-Diffusion (e.g., Dream, SDAR)
    • Masked diffusion language models (e.g., LLaDA/Dream), block diffusion language model (e.g., SDAR)
  • Comprehensive Evaluation for dLLMs
  • Upcoming Features
    • Support sglang inference engine, MoE, Multi-Modal, Omni, etc.

πŸ› οΈ Installation and Setup

Our training framework is built on top of verl, providing a robust foundation for supervised finetuning and reinforcement learning experiments, and our evaluation framework is built on the top of opencompass, providing a comprehensive and fast evaluations.

Note

Due to some irreconcilable dependency conflicts between packages, we strongly recommend using two separate virtual environments, for training and evaluation, respectively.

πŸš€ Quick Installation

Clone the DARE repo:

git clone https://github.com/yjyddq/DARE

Build training vitual environment:

# Create and activate environment
conda create -n DARE python=3.10 -y
conda activate DARE

# Install dependencies
cd DARE
pip install -r requirements.txt
pip install flash-attn==2.8.3 --no-build-isolation
# or (Recommend)
# install from whl
# wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Build evaluation vitual environment:

# Create and activate environment
conda create --name opencompass python=3.10 -y
conda activate opencompass

# Install dependencies
cd DARE/opencompass
pip install -e .

# For HumanEval evaluation, install the additional dependency:
git clone https://github.com/open-compass/human-eval.git
cd human-eval && pip install -e .
cd ..

# For Math evaluation, pip install the additional dependency:
pip install math_verify latex2sympy2_extended

## Full installation (with support for more datasets)
# pip install "opencompass[full]"

## Environment with model acceleration frameworks
# pip install "opencompass[lmdeploy]"
# or
# pip install lmdeploy==0.10.1

πŸ”§ Model Setup

After downloading LLaDA-8B-Instruct, replace the source files with our modified versions to enable several key features:

# Copy modified files to your LLaDA model directory
cp models/xxx/* <path_to_llada_model>/

Or you can move the model weights (.safetensors) to model/xxx/*

# Copy weights to models/xxx/ directory
cp <path_to_llada_model>/*.safetensors models/xxx/

Also for Dream, SDAR, etc.

Note

Since optimization plan in RL pipeline (various attention-computation backend), this step is indispensable.

πŸ—„οΈ Dataset Setup

Preprocessed datasets is under data/preprocessed. Please refer verl.utils.preprocess to organize datasets.

πŸ‹οΈ Training

πŸš€ SFT Quick Start

bash scripts/run_sft.sh # | scripts/run_sft_peft.sh

Alternatively, use/write scripts in recipe/xxx/run_xxx.sh

# peft for llada_8b_instruct
bash recipe/run_sft_peft_llada_8b_instruct.sh 

# sft for dream_7b_instruct
bash recipe/run_sft_dream_7b_instruct.sh 

# peft for sdar_8b_chat
bash recipe/run_sft_peft_sdar_8b_chat.sh 

πŸš€ RL Quick Start

# online rl for llada_8b_instruct
bash recipe/run_d1_llada_8b_instruct.sh --task math # use Fast-dLLM for rollout acceleration

# online rl for dream_7b_instruct
bash recipe/run_coupled_grpo_dream_7b_instruct.sh --task math # use Fast-dLLM for rollout acceleration

# online rl for sdar_8b_chat
bash recipe/run_bgpo_sdar_8b_chat.sh --task math # use lmdeploy engine for rollout acceleration

πŸš€ DPO/VRPO Quick Start

Run an example for preference optimization. First download argilla/ultrafeedback-binarized-preferences-cleaned, then run scripts/preprocess_dpo_dataset.sh to save ultrafeedback.parquet under data/preprocessed/dpo/train and data/preprocessed/dpo/test

# preference optimization for llada_8b_instruct
bash recipe/run_vrpo_llada_8b_instruct.sh --task ultrafeedback

# preference optimization for dream_7b_instruct
bash recipe/run_vrpo_dream_7b_instruct.sh --task ultrafeedback

πŸ“Š Evaluation

πŸš€ Quick Start

First, please follow opencompass for benchmark dataset preparation. Then, you need to specify the model path in opencompass/opencompass/configs/models/dllm/*. For example llada_instruct_8b.py:

from opencompass.models import LLaDAModel

models = [
    dict(
        type=LLaDAModel,
        abbr='llada-8b-instruct',
        path='/TO/YOUR/PATH', # Need to modify
        max_out_len=1024,
        batch_size=1,
        run_cfg=dict(num_gpus=1),
    )
]

Evaluation of LLaDA-8B-Instruct on mmlu with hf backend:

bash scripts/eval_llada.sh --task mmlu

Evaluation of SDAR-8B-Chat on mmlu with lmdeploy backend:

bash scripts/eval_sdar_8b_chat.sh --task mmlu --engine lmdeploy

If you want to add more benchmarks, models, or custom datasets, please refer to the Evaluation Guideline.

πŸ“¦ Supported Models

Model Params Training Support Evaluation Support Inference Acceleration
LLaDA-8B-Base 8B sft/rl βœ… hf Fast-dLLM
LLaDA-8B-Instruct 8B sft/rl βœ… hf Fast-dLLM
LLaDA-1.5 8B sft/rl βœ… hf Fast-dLLM
Dream-7B-Instruct 7B sft/rl βœ… hf Fast-dLLM
SDAR-1.7B-Chat 1.7B sft/rl βœ… lmdeploy
SDAR-4B-Chat 4B sft/rl βœ… lmdeploy
SDAR-8B-Chat 8B sft/rl βœ… lmdeploy

🌱 Supported RL Algorithms

Algorithm Arxiv Source Code
d1 2504.12216 dllm-reasoning/d1
vrpo 2505.19223 ML-GSAI/LLaDA-1.5 (closed source)
coupled-grpo 2506.20639 apple/ml-diffucoder
mdpo (todo) 2508.13148 autonomousvision/mdpo
cj-grpo 2509.23924 yjyddq/EOSER-ASS-RL
spg 2510.09541 facebookresearch/SPG
bgpo 2510.11683 THU-KEG/BGPO

πŸ“ˆ Performance

Evaluation Result Reproduction

Bench\Model LLaDA-8B-Instruct LLaDA-8B-Instruct + Fast-dLLM Dream-7B-Instruct SDAR-8B-Chat SDAR-8B-Chat + lmdeploy
MMLU 65.24 65.17 66.83 75.40 75.40
MMLU-Pro 36.82 34.58 31.89 52.07 52.07
Hellaswag 75.30 74.41 63.23 67.67 87.59
ARC-C 87.80 87.80 81.36 69.83 86.78
GSM8k 79.68 78.39 83.24 88.10 87.95
MATH 41.08 40.58 48.02 48.10 52.80
GPQA 30.81 31.82 26.77 28.28 36.36
AIME24 0.83 2.08 0.83 8.75 6.67
AIME25 0.42 0.00 0.00 10.00 6.67
Olympiad 8.95 9.70 12.22 17.81 17.35
HumanEval 46.34 43.29 78.05 73.17 73.17
MBPP 38.80 20.00 56.40 53.80 55.40

πŸ“§ Contact

For any questions or collaboration inquiries, feel free to reach out Jingyi Yang at: yangjingyi946@gmail.com.

πŸ‘·β€β™‚οΈ Contributor

Waiting for your joining and contribution.

πŸ“š Citation

If you find our work useful, please consider citing:

@article{yang2025dare,
  title={DARE: dLLM Alignment and Reinforcement Executor},
  author={Yang, Jingyi, Jiang Yuxian, Hu Xuhao, Shao Jing},
  journal={URL https://github.com/yjyddq/DARE}
}

@article{yang2025taming,
  title={Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step},
  author={Yang, Jingyi and Chen, Guanxu and Hu, Xuhao and Shao, Jing},
  journal={arXiv preprint arXiv:2509.23924},
  year={2025}
}

πŸ™ Acknowledgments

We thank the open-source community for their wonderful work and valuable contributions:

About

Official repository of DARE: dLLM Alignment and Reinforcement Executor

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •