📚 [Paper] | 🤗 [Hugging Face]
We provide the full list of dependencies required to run and reproduce our experiments with the requirements.txt file, which can be installed into any Python environment via pip:
pip install -r requirements.txtIn the cfgs/ folder, we provide the full list of configurations and hyper-parameters used in our work to train and evaluate L2D. In particular, the cfgs/model/ subfolder contains model-specific configurations named as:
{base_model}_lad.cfgfor L2D full diffusion path finetuning.{base_model}_lad_lora.cfgfor L2D diffusion path finetuning with LoRA.
For instance: llama_3.1_8b_instruct_lad_lora.cfg.
However, you can train and evaluate any existing local models or ones hosted on Huggingface by simply modifying:
pretrained_model_dir = "my/model/name/or/path"
tokenizer_dir = "my/model/name/or/path"While we make use of distributed training and evaluation setups with the deepspeed library, our experiments should be reproducible even with small computation budgets and a single GPU by regulating the micro_batch_size parameters. In the scripts/ folder, we provide further scripts to facilitate running experiments with our repository.
By default, checkpoints and results are saved in the experiments folder.
Please, use the scripts/run_training.sh script feeding as the first argument the GPUs available to utilize (e.g., 0 or 0,1 or 0,1,2,3... etc.) and as the second argument a path to the relevant config file (e.g., llama_3.2_1b_instruct_lad_lora.cfg):
scripts/run_training.sh 0,1 cfgs/model/llama_3.2_1b_instruct_lad_lora.cfgBy default, this training phase uses a subset of the Smoltalk dataset. However, it can be easily extended to any custom dataset by making another traning task following the example structure in tasks/smoltalk.py.
Please, use the scripts/run_bench_full.sh script feeding as the first argument the GPUs available to utilize (e.g., 0 or 0,1 or 0,1,2,3... etc.), as second argument a path to the relevant config file (e.g., cfgs/model/llama_3.2_1b_lad_lora.cfg), and as third argument the path to the saved PyTorch checkpoint file after training:
scripts/run_bench_full.sh 0,1 cfgs/model/llama_3.2_1b_lad_lora.cfg $CHECKPOINT_PATHIn our experiments, we made use of the lighteval/MATH dataset for our results on the MATH task. Since this dataset has been temporarily removed from Huggingface, our default configuration files forego this setting. Please, add an equivalent local or hosted dataset back to cfgs\benchmark.cfg to reactivate MATH evaluation.
Running experiments requires downloading models and datasets hosted on Huggingface. Hence, it requires logging into a Huggingface account with an access token, as explained here, with the following command:
huggingface-cli loginThe default logging functionality saves results locally via TensorBoard. Furthermore, Weights & Biases logging is also supported. To use this, please modify the provided configuration files by adding:
save_wandb = TrueTo cite our work, you can use the following:
@article{sakana2025l2d,
title={Large Language Models to Diffusion Finetuning},
author={Cetin, Edoardo and Zhao, Tianyu and Tang, Yujin},
journal={arXiv preprint arXiv:2501.15781},
year={2025}
}
