GitHub - THU-BPM/d-TreeRPO: Code for paper "d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models"

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Introduction

d-TreeRPO is a reliable reinforcement learning framework for diffusion language models (dLLMs). It improves policy optimization by:

organizing rollouts into a tree structure and propagating verifiable outcome rewards bottom-up to obtain fine-grained step-wise advantages
introducing a time-scheduled self-distillation loss that gradually increases model determinism in later training, enabling more accurate transition probability estimation for any-order decoding.

d-TreeRPO achieves strong gains on reasoning benchmarks such as Sudoku, Countdown, GSM8K, and Math500.

Environment Setup

conda env create -f env.yml
conda activate dtreerpo

Training

We provide several training scripts for LLaDA-8B-Instruct and LLaDA-MoE-7BA1B-Instruct in dtreerpo/. For example, to train LLaDA-8B-Instruct on Sudoku task, run

cd dtreerpo
bash run_llada_sudoku_dtreerpo.sh

Note: following previous works (Diffu-GRPO, wd1), we use lora training.

Below are the reward curves for d-TreeRPO on different tasks (LLaDA-8B-Instruct), with and without self-distillation loss, as well as a variant with diversity-promoting loss (see the paper for details):

Evaluation

We provide several evaluation scripts for LLaDA-8B-Instruct and LLaDA-MoE-7BA1B-Instruct in eval/. For example, to evaluate checkpoints using LLaDA-8B-Instruct as the base model, run:

cd eval
bash eval_llada_dtreerpo.sh

This script will load all saved LoRA adapters from the specified directory, merge (attach) each adapter into the base model, and then run evaluation.

After generating the results, run the following script to acquire the final scores:

python parse_and_get_acc.py

Performance Comparison

Below are the performance comparison with previous works: d-TreeRPO outperform several baselines under most settings, especially on the planning tasks such as Sudoku and Countdown.

Acknowledgements

This repository is built upon the Diffu-GRPO and wd1 repositories.

Citation

@article{pan2025d,
    title={d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models},
    author={Pan, Leyi and Tao, Shuchang and Zhai, Yunpeng and Fu, Zheyu and Fang, Liancheng and He, Minghua and Zhang, Lingzhe and Liu, Zhaoyang and Ding, Bolin and Liu, Aiwei and others},
    journal={arXiv preprint arXiv:2512.09675},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
dataset		dataset
dtreerpo		dtreerpo
eval		eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Introduction

Environment Setup

Training

Evaluation

Performance Comparison

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

THU-BPM/d-TreeRPO

Folders and files

Latest commit

History

Repository files navigation

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Introduction

Environment Setup

Training

Evaluation

Performance Comparison

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages