-
Notifications
You must be signed in to change notification settings - Fork 364
[Feature] Support Qwen3-MoE model via fused MoE #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
support qwen3_moe
support moe
|
@DarkSharpness Could you review my PR? |
Delete .ds_store
update __init_.py
| from typing import Tuple | ||
| from minisgl.layers.moe.topk import select_experts | ||
| from sgl_kernel import moe_align_block_size as sgl_moe_align_block_size | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please format this file by:
pip install pre-commitpre-commit installpre-commit run -a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I’ll do it right now
|
|
||
|
|
||
| @triton.jit | ||
| def fused_moe_kernel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move all triton kernels to minisgl/kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkSharpness
I have migrated the Triton kernels to minisgl/kernel and executed pre-commit run -a; please see the screenshots below. Following the migration, I also performed deployments, and both qwen3-8b and qwen3-30b were deployed successfully.
Please let me know if there are any issues or if further modifications are needed. I am more than happy to make any necessary adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. MoE is a complex module and I will take a detailed look and test tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkSharpness Alternatively, could we set up a Discord channel? It would facilitate real-time communication and make it easier for more contributors to get involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkSharpness I've created a Discord channel for mini-sgl. Some of my friends are also interested in helping develop and improve the project, so I thought this would be a great opportunity for everyone to join and collaborate together. Here is the invite link: https://discord.gg/wA5g4msx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkSharpness Got it, I've joined. Thanks
pre-commit
Decoupling
fix pre-commit run -a
fix pre-commit
|
@DarkSharpness I have added the --moe-backend argument to args and abstracted the MoE module. This provides a more convenient extension for future MoE implementations and prevents code coupling. |
| import torch | ||
| import triton | ||
| import triton.language as tl | ||
| from minisgl.kernel.triton.fused_moe import fused_moe_kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to put all the triton kernel in this file.
|
|
||
| if isinstance(param, torch.Tensor): | ||
| item = state_dict.pop(_concat_prefix(prefix, name)) | ||
| if "experts" in prefix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to refactor the weight loading logic. Current state_dict and load_state_dict implementation in main branch is terrible. This needs to cleaned up in future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts exactly
python/minisgl/layers/base.py
Outdated
| if not _internal and state_dict: | ||
| raise RuntimeError(f"Unexpected keys in state_dict: {list(state_dict.keys())}") | ||
| keys = list(state_dict.keys()) | ||
| raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the old logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll make the changes right away
python/minisgl/layers/base.py
Outdated
| if not _internal and state_dict: | ||
| _ = prefix | ||
| raise RuntimeError(f"Unexpected keys in state_dict: {list(state_dict.keys())}") | ||
| keys = list(state_dict.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the old logic
python/minisgl/layers/base.py
Outdated
|
|
||
| if not _internal and state_dict: | ||
| raise RuntimeError(f"Unexpected keys in state_dict: {list(state_dict.keys())}") | ||
| keys = list(state_dict.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the old logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll make the changes right away
python/minisgl/models/__init__.py
Outdated
| from .llama import LlamaForCausalLM | ||
|
|
||
| return LlamaForCausalLM(model_config) | ||
| elif "qwen3" in model_name and "30b" in model_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to weight_loading, this model class dispatch logic is also terrible... We must refactor this later
(Don't do the refactor within this PR, this is just some irrelevant comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would put this into moe/__init__.py (and import FusedMoE within the get_moe_backend function).
| else: | ||
| out_hidden_states = torch.empty_like(hidden_states) | ||
|
|
||
| for chunk in range((num_tokens // CHUNK_SIZE) + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this chunking logic. Modern LLM systems will chunk requests by default, and you can assume that num_tokens will never exceed CHUNK_SIZE (i.e. only 1 iteration). This loop will mess things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
python/minisgl/models/config.py
Outdated
| rotary_dim=head_dim, | ||
| max_position=config.max_position_embeddings, | ||
| base=config.rope_theta, | ||
| max_position=getattr(config, "max_position_embeddings", 2048), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need this change here? could you share an example where max_position_embeddings is not provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I’m just overthinking it,I think I should have left this part as it was
Refactoring the moe backend
keep old logic
keep old logic
keep old logic
keep old logic
add moe.py
pre-commit run -a
pre-commit run -a
|
@DarkSharpness I have refactored the moe_backend using attn_backend as a reference, making the overall MoE implementation more abstract and decoupled. I also added the functionality to read modules from the HF config.json and load the corresponding models. Could you please take a look and review it? |
I have implemented support for the Qwen3-MoE model in mini-sglang. It now supports direct loading and deployment. According to results from bench_qwen.py, the MoE throughput in mini-sglang is comparable to that of sglang. The Fused MoE implementation refers to Sglang's approach and now supports multi-GPU Tensor Parallel (TP) deployment.
support #9
In summary, this PR achieves the following:
Implemented Fused MoE at the layer level.
Added support for the Qwen3-MoE model.
Enabled Multi-GPU Tensor Parallelism for Qwen3-MoE.
Verified that throughput performance is on par with Sglang through benchmarks.
Next, I will present the benchmark results. The tests were initially conducted on a machine equipped with H800 GPUs.
First, here is the benchmark comparison of Qwen3-30B-MoE between mini-sglang and sglang when TP=1.
The launch command is:
mini-sglang tp=1 benchmark:
The launch command is:
sglang tp=1 benchmark:
For a more intuitive comparison, I have created a line chart to visualize the results.
Next, here is the comparison when TP=2.
The launch command is:
mini-sglang tp=2 benchmark:
The launch command is:
sglang tp=2 benchmark:
Finally, I ran an accuracy test to ensure the model's generation is correct. We used the GSM8K dataset for benchmarking, but for a quick verification, we only tested the first 100 samples.
results:
While this MoE implementation is currently quite simple and only supports Tensor Parallelism (without Expert Parallelism yet), I truly appreciate the mini-sglang project. I hope to contribute to its growth through this PR and continue to help refine the project over time.