Skip to content

Code for paper "d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models"

License

Notifications You must be signed in to change notification settings

THU-BPM/d-TreeRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

arXiv Hugging Face

Introduction

d-TreeRPO is a reliable reinforcement learning framework for diffusion language models (dLLMs). It improves policy optimization by:

  • organizing rollouts into a tree structure and propagating verifiable outcome rewards bottom-up to obtain fine-grained step-wise advantages
  • introducing a time-scheduled self-distillation loss that gradually increases model determinism in later training, enabling more accurate transition probability estimation for any-order decoding.

d-TreeRPO achieves strong gains on reasoning benchmarks such as Sudoku, Countdown, GSM8K, and Math500.

Environment Setup

conda env create -f env.yml
conda activate dtreerpo

Training

We provide several training scripts for LLaDA-8B-Instruct and LLaDA-MoE-7BA1B-Instruct in dtreerpo/. For example, to train LLaDA-8B-Instruct on Sudoku task, run

cd dtreerpo
bash run_llada_sudoku_dtreerpo.sh

Note: following previous works (Diffu-GRPO, wd1), we use lora training.

Below are the reward curves for d-TreeRPO on different tasks (LLaDA-8B-Instruct), with and without self-distillation loss, as well as a variant with diversity-promoting loss (see the paper for details):

Evaluation

We provide several evaluation scripts for LLaDA-8B-Instruct and LLaDA-MoE-7BA1B-Instruct in eval/. For example, to evaluate checkpoints using LLaDA-8B-Instruct as the base model, run:

cd eval
bash eval_llada_dtreerpo.sh

This script will load all saved LoRA adapters from the specified directory, merge (attach) each adapter into the base model, and then run evaluation.

After generating the results, run the following script to acquire the final scores:

python parse_and_get_acc.py

Performance Comparison

Below are the performance comparison with previous works: d-TreeRPO outperform several baselines under most settings, especially on the planning tasks such as Sudoku and Countdown.

Acknowledgements

This repository is built upon the Diffu-GRPO and wd1 repositories.

Citation

@article{pan2025d,
    title={d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models},
    author={Pan, Leyi and Tao, Shuchang and Zhai, Yunpeng and Fu, Zheyu and Fang, Liancheng and He, Minghua and Zhang, Lingzhe and Liu, Zhaoyang and Ding, Bolin and Liu, Aiwei and others},
    journal={arXiv preprint arXiv:2512.09675},
    year={2025}
}

About

Code for paper "d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published