HookMoE: A learnable performance compensation strategy of Mixture-of-Experts for LLM inference acceleration

Longkai Cheng*, Along He*, Mulin Li, Xueshuo Xie†, Tao Li†

Abstract

Mixture of Experts (MoE) architectures have emerged as a promising paradigm for scaling model capacity through top-k routing mechanisms. Although reducing the number of activated experts inherently enables inference acceleration, this efficiency gain typically comes at the cost of significant performance degradation. To address this trade-off between efficiency and performance, we propose HookMoE, a plug-and-play single-layer compensation framework that effectively restores performance using only a small post-training calibration set. Our method strategically inserts a lightweight trainable Hook module immediately preceding selected transformer blocks. Comprehensive evaluations on four popular MoE models, with an average performance degradation of only 2.5% across various benchmarks, our method reduces the number of activated experts by more than 50% and achieves a 1.42x inference speed-up during the prefill stage. Through systematic analysis, we further reveal that the upper layers require fewer active experts, offering actionable insights for refining dynamic expert selection strategies and enhancing the overall efficiency of MoE models.

Overview

Train

We use LLaMA-Factory for train.

Install `llama-factory` package

cd ./train/LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

Patch `Hook` Module in `loader.py`

for l, layer in enumerate(model.model.layers):
    layer.block_sparse_moe = LlamaMLPWrapper(layer.block_sparse_moe, l)

print(model) 
for param in model.parameters():
    param.requires_grad = False
for param in model.model.layers[14].block_sparse_moe.transform.parameters(): 
    param.requires_grad = True

Run train

cd ./train/LLaMA-Factory
llamafactory-cli train ./examples/train_hook/mixtral_pt.yaml

Eval

We use lm-evaluation-harness for evaluation.

Install `lm-eval` package

cd ./eval/lm-evaluation-harness
pip install -e .

Custom `pt` path in `custom_mixtral8_7b.py`

if layer_id == 14:
    self.transform = nn.Sequential(
        nn.Linear(input_dim, hidden_dim, bias=False, dtype=torch.bfloat16),
        nn.ReLU(),
        nn.Linear(hidden_dim, input_dim, bias=False, dtype=torch.bfloat16)
    )
    path_0 = '/path/to/14_transform_0_weight.pt'
    path_2 = '/path/to/14_transform_2_weight.pt'
    self.transform[0].weight = torch.nn.Parameter(torch.load(path_0))
    self.transform[2].weight = torch.nn.Parameter(torch.load(path_2))

Run evaluation

cd ./eval/lm-evaluation-harness
bash eval_model.sh

Questions

Feel free to discuss papers/code with us through issues/emails

Longkai Cheng: chenglk@mail.nankai.edu.cn

Citation

If you find our paper and code useful in your research, please cite

@inproceedings{longkai-etal-2025-hookmoe,
    title = "{H}ook{M}o{E}: A learnable performance compensation strategy of Mixture-of-Experts for {LLM} inference acceleration",
    author = "Longkai, Cheng  and
      He, Along  and
      Li, Mulin  and
      Xueshuo, Xie  and
      Li, Tao",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1610/",
    pages = "31582--31594",
    ISBN = "979-8-89176-332-6",

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
asset		asset
eval/lm-evaluation-harness		eval/lm-evaluation-harness
results		results
train/LLaMA-Factory		train/LLaMA-Factory
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HookMoE: A learnable performance compensation strategy of Mixture-of-Experts for LLM inference acceleration

Abstract

Overview

Train

Install `llama-factory` package

Patch `Hook` Module in `loader.py`

Run train

Eval

Install `lm-eval` package

Custom `pt` path in `custom_mixtral8_7b.py`

Run evaluation

Questions

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HookMoE: A learnable performance compensation strategy of Mixture-of-Experts for LLM inference acceleration

Abstract

Overview

Train

Install llama-factory package

Patch Hook Module in loader.py

Run train

Eval

Install lm-eval package

Custom pt path in custom_mixtral8_7b.py

Run evaluation

Questions

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Install `llama-factory` package

Patch `Hook` Module in `loader.py`

Install `lm-eval` package

Custom `pt` path in `custom_mixtral8_7b.py`

Packages