Skip to content

Y-Research-SBU/SeqTopK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

✨ Route Experts by Sequence, Not by Token ✨


Tiansheng Wen*1,2, Yifei Wang*3, Aosong Feng4, Long Ma2, Xinyang Liu5, Yifan Wang1, Lixuan Guo2,
Bo Chen2, Stefanie Jegelka6,3, Chenyu You1

1Stony Brook University   2Xidian University   3MIT  
4Yale University   5University of Texas at Austin   6TU Munich


Paper License Python

🚀 News

  • 2025-11-09: We uploaded our work to arXiv.

Usage Guide

NOTE : This repo is heavily built on top of Densemixer. We followed their training paradigm with minor modifications to the MoE routing process.

Currently, SeqTopK supports two MoE models: OLMoE (7B parameters) and Qwen1.5-MoE (14B parameters).

Getting started with OLMoE.

1. Installation

We construct experiments with open-instruct on OLMoE model.
Set up your environment as follows:

cd experiments/train/open-instruct
conda create -n openinstruct python=3.12
conda activate openinstruct
conda install -c conda-forge cuda-nvcc=12.1 -y
bash init_env.sh

To active Densemixer and SeqTopK, you need one additional setup:

conda activate openinstruct
pip install densemixer
densemixer setup

For convenience, we uses environment variables for configuration:

Variable Description Default
DENSEMIXER_ENABLED Master switch (set to 1 to enable) 0
DENSEMIXER_QWEN2 Enable for Qwen1.5-MoE models 1
DENSEMIXER_OLMOE Enable for OLMoE models 1
DENSEMIXER_TOPK_MODE Selects the routing type
(seq_topk for sequence-level, topk for token-level)
seq_topk

2. Data Preparation

We use the datasets from Deepseek's ESFT paper for both training and evaluation. Pre-processed datasets are available on Hugging Face: GSM (math reasoning), CodeAlpaca (code generation), ESFT-law (legal reasoning), ESFT-summary (summarization).

For code generation evaluation, we use MBPP and HumanEval.


3. Training

We support the following fine-tuning methods. For each method, replace {dataset_name} with your target dataset (gsm, codealpaca, law, summary).

cd experiments/train/open-instruct
export WANDB_API_KEY="YOUR_WANDB_API_KEY"
bash run/densemixer_full/train_{dataset_name}.sh

4. Evaluation

Evaluation scripts are in the eval/ directory. See eval/README.md for details and environment setup.


Getting started with Qwen1.5-MoE

1. Installation

We construct experiments with LLaMA-Factory on Qwen1.5-MoE.
Set up your environment as follows:

cd experiments/train/LLaMA-Factory
bash installation.sh

To active Densemixer and SeqTopK, you need one additional setup:

conda activate openinstruct
pip install densemixer
densemixer setup

2. Data Preparation

We use the GSM (math reasoning), CodeAlpaca (code generation), ESFT-law (legal reasoning), ESFT-summary (summarization). For code generation evaluation, we use MBPP and HumanEval.


3. Training

We support the following fine-tuning methods. For each method, replace {dataset_name} with your target dataset (gsm, codealpaca,law,summary).

cd experiments/train/LLaMA-Factory
export WANDB_API_KEY="YOUR_WANDB_API_KEY"
bash run/qwen1.5/densemixer_full/train_{dataset_name}_densemixer.sh

4. Evaluation

Evaluation scripts are in the eval/ directory. See eval/README.md for details and environment setup.

Citing this paper

If you find this work useful, please cite the accompanying paper:

About

Official Repository for SeqTopK

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages