Skip to content

gjq100/Bohdi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏆 NeurIPS 2025 Main Conference Paper

arXiv Hugging Face

Junqi Gao 1, Zhichang Guo 1, Dazhi Zhang 1, Dong Li 1, Runze Liu 3, Pengfei Li 1,5, Kai Tian 4, Biqing Qi2,†

1 School of Mathematics, Harbin Institute of Technology

2 Shanghai Artificial Intelligence Laboratory

3 Tsinghua Shenzhen International Graduate School, Tsinghua University

4 Department of Electronic Engineering, Tsinghua University

5 Shanghai Innovation Institute

Corresponding Author

📄 Introduction

Bohdi is a novel framework for heterogeneous Large Language Model (LLM) fusion that integrates the strengths of multiple source LLMs into a target LLM through adaptive knowledge exploration and automatic data generation. Unlike existing methods that rely on real data from limited domains and use fixed data allocation proportions, Bohdi dynamically adjusts sampling based on the target LLM's performance and generates data automatically through a hierarchical knowledge tree structure. This ensures comprehensive domain coverage and balanced capability enhancement without the need for real data.

✨ Features

🚀 Synthetic-Data-Only Fusion: Bohdi operates without relying on real data, making it highly efficient and versatile.

🌳 Dynamic Domain Exploration: Through the hierarchical knowledge tree and Sprout/Harvest operations, Bohdi explores new domains and generates data automatically.

🔄 Adaptive Data Allocation: The DynaBranches mechanism with IR ensures dynamic adjustment of data sampling proportions based on the target LLM’s capabilities.

⚙️ Installation

Main Environment for Distillation

conda env create -f environment_Bohdi.yaml

Environment for Evaluation

conda env create -f opencompass_env.yaml

Preparation for Evaluation Suite

# The version we used: opencompass 0.3.4
git clone https://github.com/open-compass/opencompass opencompass
cd [your project path]/opencompass
pip install -e .

⏳ Distillation Training

To train the target LLM using Bohdi, follow these steps:

  1. Prepare Source LLMs: Ensure you have access to the source LLMs you want to fuse. If you want to follow our setup, please download the following models:
    # Source Models
    Qwen/Qwen2.5-14B-Instruct
    mistralai/Mistral-Small-24B-Instruct-2501
    microsoft/phi-4
    # Target Models
    meta-llama/Llama-3.2-3B-Instruct
    meta-llama/Llama-3.1-8B-Instruct
    Qwen/Qwen2.5-7B-Instruct
    google/gemma-2-9b-it
  2. Run Bohdi For Distillation Please first configure the relevant paths in run_bohdi.sh according to your actual paths, and then run:
    source activate bohdi
    cd [your project path]/Bohdi
    bash run_bohdi.sh

📏 Evaluation

We use OpenCompass for evaluation and perform inference based on VLLM. To evaluate your model, please configure the relevant paths in eval_opencompass.sh according to your actual paths, and then run:

source activate opencompass
cd [your project path]/opencompass
bash eval_opencompass.sh

Direct Download and Usage

If you would like to directly use the distilled models for evaluation, our distilled models can be found directly on Hugging Face:

ChetKao/Bohdi-Llama-3.2-3B-Instruct
ChetKao/Bohdi-Llama-3.1-8B-Instruct
ChetKao/Bohdi-Qwen2.5-7B-Instruct
ChetKao/Bohdi-gemma-2-9b-it

📚 Citation

@article{gao2025bohdi,
  title={Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration},
  author={Junqi Gao and Zhichang Guo and Dazhi Zhang and Dong Li and Runze Liu and Pengfei Li and Kai Tian and Biqing Qi},
  journal={arXiv preprint arXiv:2506.15721},
  year={2025},
  url={https://doi.org/10.48550/arXiv.2506.15721}
}

About

Bohdi is a novel framework for ​​heterogeneous Large Language Model (LLM) fusion​​, enabling efficient knowledge transfer from multiple source LLMs to a compact target model ​​without relying on real-world data​​.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors