Tiansheng Wen*1,2,
Yifei Wang*3,
Aosong Feng4,
Long Ma2,
Xinyang Liu5,
Yifan Wang1,
Lixuan Guo2,
Bo Chen2,
Stefanie Jegelka6,3,
Chenyu You1
1Stony Brook University
2Xidian University
3MIT
4Yale University
5University of Texas at Austin
6TU Munich
- 2025-11-09: We uploaded our work to arXiv.
NOTE : This repo is heavily built on top of Densemixer. We followed their training paradigm with minor modifications to the MoE routing process.
Currently, SeqTopK supports two MoE models: OLMoE (7B parameters) and Qwen1.5-MoE (14B parameters).
We construct experiments with open-instruct on OLMoE model.
Set up your environment as follows:
cd experiments/train/open-instruct
conda create -n openinstruct python=3.12
conda activate openinstruct
conda install -c conda-forge cuda-nvcc=12.1 -y
bash init_env.sh
To active Densemixer and SeqTopK, you need one additional setup:
conda activate openinstruct
pip install densemixer
densemixer setup
For convenience, we uses environment variables for configuration:
| Variable | Description | Default |
|---|---|---|
DENSEMIXER_ENABLED |
Master switch (set to 1 to enable) |
0 |
DENSEMIXER_QWEN2 |
Enable for Qwen1.5-MoE models | 1 |
DENSEMIXER_OLMOE |
Enable for OLMoE models | 1 |
DENSEMIXER_TOPK_MODE |
Selects the routing type (seq_topk for sequence-level, topk for token-level) |
seq_topk |
We use the datasets from Deepseek's ESFT paper for both training and evaluation. Pre-processed datasets are available on Hugging Face: GSM (math reasoning), CodeAlpaca (code generation), ESFT-law (legal reasoning), ESFT-summary (summarization).
For code generation evaluation, we use MBPP and HumanEval.
We support the following fine-tuning methods. For each method, replace {dataset_name} with your target dataset (gsm, codealpaca, law, summary).
cd experiments/train/open-instruct
export WANDB_API_KEY="YOUR_WANDB_API_KEY"
bash run/densemixer_full/train_{dataset_name}.sh
Evaluation scripts are in the eval/ directory. See eval/README.md for details and environment setup.
We construct experiments with LLaMA-Factory on Qwen1.5-MoE.
Set up your environment as follows:
cd experiments/train/LLaMA-Factory
bash installation.sh
To active Densemixer and SeqTopK, you need one additional setup:
conda activate openinstruct
pip install densemixer
densemixer setup
We use the GSM (math reasoning), CodeAlpaca (code generation), ESFT-law (legal reasoning), ESFT-summary (summarization). For code generation evaluation, we use MBPP and HumanEval.
We support the following fine-tuning methods. For each method, replace {dataset_name} with your target dataset (gsm, codealpaca,law,summary).
cd experiments/train/LLaMA-Factory
export WANDB_API_KEY="YOUR_WANDB_API_KEY"
bash run/qwen1.5/densemixer_full/train_{dataset_name}_densemixer.sh
Evaluation scripts are in the eval/ directory. See eval/README.md for details and environment setup.
If you find this work useful, please cite the accompanying paper: