[Feature] Support Qwen3-MoE model via fused MoE #59

jiahe7ay · 2026-01-05T18:27:59Z

I have implemented support for the Qwen3-MoE model in mini-sglang. It now supports direct loading and deployment. According to results from bench_qwen.py, the MoE throughput in mini-sglang is comparable to that of sglang. The Fused MoE implementation refers to Sglang's approach and now supports multi-GPU Tensor Parallel (TP) deployment.

support #9

In summary, this PR achieves the following:

Implemented Fused MoE at the layer level.
Added support for the Qwen3-MoE model.
Enabled Multi-GPU Tensor Parallelism for Qwen3-MoE.
Verified that throughput performance is on par with Sglang through benchmarks.

Next, I will present the benchmark results. The tests were initially conducted on a machine equipped with H800 GPUs.

First, here is the benchmark comparison of Qwen3-30B-MoE between mini-sglang and sglang when TP=1.

The launch command is:

python -m minisgl --model /mnt/models/Qwen/qwen3_30b_moe  --tp 1 --disable-pynccl   --cuda-graph-max-bs 256

mini-sglang tp=1 benchmark:

[2026-01-05|16:23:22] INFO     Start benchmarking with 1000 requests using model /mnt/models/Qwen/qwen3_30b_moe...
Requests sent 100%|██████|     1000/1000 [    6.49it/s    02:34/00:00   ]
Requests done 100%|██████|     1000/1000 [    6.49it/s    02:34/00:00   ]
Prefill token 100%|██████|     1000/1000 [    6.49it/s    02:34/00:00   ]
Decode token  100%|██████| 357310/357310 [ 2318.57it/s    02:34/00:00   ]
[2026-01-05|16:25:57] INFO     Max inflight requests: 157, Max queued requests: 10
[2026-01-05|16:25:57] INFO     Num requests: #1000, Num tokens: #359491
[2026-01-05|16:25:57] INFO     TTFT: 126.89 ms (p50: 109.70 ms, p90: 158.57 ms, p99: 659.98 ms, max:   1035 ms)
[2026-01-05|16:25:57] INFO     TPOT:  37.06 ms (p50:  28.90 ms, p90:  65.07 ms, p99: 157.24 ms, max: 807.71 ms)
[2026-01-05|16:25:57] INFO     E2E:   13.37  s (p50:  11.21  s, p90:  26.39  s, p99:  53.11  s, max:  99.36  s)
[2026-01-05|16:25:57] INFO     Duration: 153.08 s
[2026-01-05|16:25:57] INFO     Throughput:   2348 token/s, 6.5324 req/s
Requests sent 100%|██████|     1000/1000 [    5.52it/s    03:01/00:00   ]
Requests done 100%|██████|     1000/1000 [    5.52it/s    03:01/00:00   ]
Prefill token 100%|██████|     1000/1000 [    5.52it/s    03:01/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1970.72it/s    03:01/00:00   ]
[2026-01-05|16:28:58] INFO     Max inflight requests: 88, Max queued requests: 8
[2026-01-05|16:28:58] INFO     Num requests: #1000, Num tokens: #359391
[2026-01-05|16:28:58] INFO     TTFT:  95.31 ms (p50:  87.96 ms, p90: 132.29 ms, p99: 235.77 ms, max: 374.48 ms)
[2026-01-05|16:28:58] INFO     TPOT:  18.17 ms (p50:  12.60 ms, p90:  39.25 ms, p99:  90.83 ms, max: 343.63 ms)
[2026-01-05|16:28:58] INFO     E2E:  6.5895  s (p50: 5.1240  s, p90:  13.51  s, p99:  29.79  s, max:  49.54  s)
[2026-01-05|16:28:58] INFO     Duration: 180.30 s
[2026-01-05|16:28:58] INFO     Throughput:   1993 token/s, 5.5463 req/s
Requests sent 100%|██████|     1000/1000 [    4.87it/s    03:25/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.87it/s    03:25/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.87it/s    03:25/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1739.20it/s    03:25/00:00   ]
[2026-01-05|16:32:24] INFO     Max inflight requests: 49, Max queued requests: 8
[2026-01-05|16:32:24] INFO     Num requests: #1000, Num tokens: #359359
[2026-01-05|16:32:24] INFO     TTFT:  88.81 ms (p50:  81.10 ms, p90: 125.00 ms, p99: 187.38 ms, max: 318.41 ms)
[2026-01-05|16:32:24] INFO     TPOT:  14.67 ms (p50:  11.49 ms, p90:  23.64 ms, p99:  69.33 ms, max: 261.67 ms)
[2026-01-05|16:32:24] INFO     E2E:  5.3326  s (p50: 4.3726  s, p90:  10.69  s, p99:  22.89  s, max:  44.93  s)
[2026-01-05|16:32:24] INFO     Duration: 204.44 s
[2026-01-05|16:32:24] INFO     Throughput:   1757 token/s, 4.8915 req/s
Requests sent 100%|██████|     1000/1000 [    4.25it/s    03:55/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.25it/s    03:55/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.25it/s    03:55/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1517.84it/s    03:55/00:00   ]
[2026-01-05|16:36:19] INFO     Max inflight requests: 40, Max queued requests: 8
[2026-01-05|16:36:19] INFO     Num requests: #1000, Num tokens: #359356
[2026-01-05|16:36:19] INFO     TTFT:  83.47 ms (p50:  73.03 ms, p90: 120.09 ms, p99: 192.51 ms, max: 266.95 ms)
[2026-01-05|16:36:19] INFO     TPOT:  11.96 ms (p50: 9.6328 ms, p90:  13.97 ms, p99:  60.65 ms, max: 242.92 ms)
[2026-01-05|16:36:19] INFO     E2E:  4.3572  s (p50: 3.5358  s, p90: 8.5287  s, p99:  20.56  s, max:  45.49  s)
[2026-01-05|16:36:19] INFO     Duration: 234.40 s
[2026-01-05|16:36:19] INFO     Throughput:   1533 token/s, 4.2662 req/s
Requests sent 100%|██████|     1000/1000 [    3.77it/s    04:25/00:00   ]
Requests done 100%|██████|     1000/1000 [    3.77it/s    04:25/00:00   ]
Prefill token 100%|██████|     1000/1000 [    3.77it/s    04:25/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1346.27it/s    04:25/00:00   ]
[2026-01-05|16:40:45] INFO     Max inflight requests: 34, Max queued requests: 6
[2026-01-05|16:40:45] INFO     Num requests: #1000, Num tokens: #359345
[2026-01-05|16:40:45] INFO     TTFT:  82.11 ms (p50:  70.75 ms, p90: 116.98 ms, p99: 180.83 ms, max: 309.87 ms)
[2026-01-05|16:40:45] INFO     TPOT:  11.08 ms (p50: 9.2071 ms, p90:  11.65 ms, p99:  49.46 ms, max: 267.36 ms)
[2026-01-05|16:40:45] INFO     E2E:  4.0406  s (p50: 3.3624  s, p90: 7.8921  s, p99:  20.07  s, max:  45.44  s)
[2026-01-05|16:40:45] INFO     Duration: 264.40 s
[2026-01-05|16:40:45] INFO     Throughput:   1359 token/s, 3.7821 req/s
Requests sent 100%|██████|     1000/1000 [    1.96it/s    08:29/00:00   ]
Requests done 100%|██████|     1000/1000 [    1.96it/s    08:29/00:00   ]
Prefill token 100%|██████|     1000/1000 [    1.96it/s    08:29/00:00   ]
Decode token  100%|██████| 357310/357310 [  701.60it/s    08:29/00:00   ]
[2026-01-05|16:49:14] INFO     Max inflight requests: 21, Max queued requests: 3
[2026-01-05|16:49:15] INFO     Num requests: #1000, Num tokens: #359324
[2026-01-05|16:49:15] INFO     TTFT:  74.35 ms (p50:  63.94 ms, p90: 106.86 ms, p99: 150.30 ms, max: 268.74 ms)
[2026-01-05|16:49:15] INFO     TPOT: 8.7083 ms (p50: 8.0260 ms, p90: 9.0444 ms, p99:  43.70 ms, max: 148.49 ms)
[2026-01-05|16:49:15] INFO     E2E:  3.1860  s (p50: 2.6301  s, p90: 6.1256  s, p99:  15.87  s, max:  38.94  s)
[2026-01-05|16:49:15] INFO     Duration: 508.27 s
[2026-01-05|16:49:15] INFO     Throughput: 706.96 token/s, 1.9675 req/s
[2026-01-05|16:49:15] INFO     Benchmarking completed.

The launch command is:

python3 -m sglang.launch_server --model-path /mnt/models/Qwen/qwen3_30b_moe  --tp 1  --port 1919

sglang tp=1 benchmark:

[2026-01-05|17:02:12] INFO     Start benchmarking with 1000 requests using model /mnt/models/Qwen/qwen3_30b_moe...
Requests sent 100%|██████|     1000/1000 [    6.36it/s    02:37/00:00   ]
Requests done 100%|██████|     1000/1000 [    6.36it/s    02:37/00:00   ]
Prefill token 100%|██████|     1000/1000 [    6.36it/s    02:37/00:00   ]
Decode token   99%|█████▉| 352801/357310 [ 2242.38it/s    02:37/00:02   ]
[2026-01-05|17:04:50] INFO     Max inflight requests: 128, Max queued requests: 10
[2026-01-05|17:04:50] INFO     Num requests: #1000, Num tokens: #355313
[2026-01-05|17:04:50] INFO     TTFT:  98.45 ms (p50:  94.70 ms, p90: 135.84 ms, p99: 220.60 ms, max: 605.24 ms)
[2026-01-05|17:04:50] INFO     TPOT:  28.61 ms (p50:  22.15 ms, p90:  47.71 ms, p99: 139.73 ms, max: 611.94 ms)
[2026-01-05|17:04:50] INFO     E2E:   10.21  s (p50: 8.3017  s, p90:  20.85  s, p99:  42.45  s, max:  75.36  s)
[2026-01-05|17:04:50] INFO     Duration: 156.31 s
[2026-01-05|17:04:50] INFO     Throughput:   2273 token/s, 6.3974 req/s
Requests sent 100%|██████|     1000/1000 [    5.40it/s    03:05/00:00   ]
Requests done 100%|██████|     1000/1000 [    5.40it/s    03:05/00:00   ]
Prefill token 100%|██████|     1000/1000 [    5.40it/s    03:05/00:00   ]
Decode token   99%|█████▉| 353027/357310 [ 1907.33it/s    03:05/00:02   ]
[2026-01-05|17:07:55] INFO     Max inflight requests: 92, Max queued requests: 10
[2026-01-05|17:07:55] INFO     Num requests: #1000, Num tokens: #355433
[2026-01-05|17:07:55] INFO     TTFT:  94.01 ms (p50:  86.31 ms, p90: 132.91 ms, p99: 198.02 ms, max: 518.89 ms)
[2026-01-05|17:07:55] INFO     TPOT:  24.70 ms (p50:  20.04 ms, p90:  43.99 ms, p99: 111.82 ms, max: 792.54 ms)
[2026-01-05|17:07:55] INFO     E2E:  8.8244  s (p50: 7.2252  s, p90:  17.73  s, p99:  38.48  s, max:  69.10  s)
[2026-01-05|17:07:55] INFO     Duration: 184.07 s
[2026-01-05|17:07:55] INFO     Throughput:   1930 token/s, 5.4327 req/s
Requests sent 100%|██████|     1000/1000 [    4.69it/s    03:33/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.69it/s    03:33/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.69it/s    03:33/00:00   ]
Decode token   99%|█████▉| 353100/357310 [ 1654.73it/s    03:33/00:02   ]
[2026-01-05|17:11:29] INFO     Max inflight requests: 67, Max queued requests: 8
[2026-01-05|17:11:29] INFO     Num requests: #1000, Num tokens: #355503
[2026-01-05|17:11:29] INFO     TTFT:  85.74 ms (p50:  78.55 ms, p90: 117.02 ms, p99: 173.26 ms, max: 423.73 ms)
[2026-01-05|17:11:29] INFO     TPOT:  21.82 ms (p50:  18.38 ms, p90:  42.30 ms, p99:  82.36 ms, max: 422.90 ms)
[2026-01-05|17:11:29] INFO     E2E:  7.7997  s (p50: 6.3952  s, p90:  15.67  s, p99:  35.94  s, max:  59.08  s)
[2026-01-05|17:11:29] INFO     Duration: 212.37 s
[2026-01-05|17:11:29] INFO     Throughput:   1673 token/s, 4.7087 req/s
Requests sent 100%|██████|     1000/1000 [    4.15it/s    04:00/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.15it/s    04:00/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.15it/s    04:00/00:00   ]
Decode token   99%|█████▉| 353391/357310 [ 1467.86it/s    04:00/00:02   ]
[2026-01-05|17:15:29] INFO     Max inflight requests: 56, Max queued requests: 10
[2026-01-05|17:15:30] INFO     Num requests: #1000, Num tokens: #355819
[2026-01-05|17:15:30] INFO     TTFT:  83.01 ms (p50:  72.48 ms, p90: 115.51 ms, p99: 207.54 ms, max: 496.66 ms)
[2026-01-05|17:15:30] INFO     TPOT:  19.77 ms (p50:  17.01 ms, p90:  30.92 ms, p99:  69.58 ms, max: 624.89 ms)
[2026-01-05|17:15:30] INFO     E2E:  7.0763  s (p50: 5.8383  s, p90:  14.12  s, p99:  33.07  s, max:  56.50  s)
[2026-01-05|17:15:30] INFO     Duration: 239.74 s
[2026-01-05|17:15:30] INFO     Throughput:   1484 token/s, 4.1712 req/s
Requests sent 100%|██████|     1000/1000 [    3.71it/s    04:29/00:00   ]
Requests done 100%|██████|     1000/1000 [    3.71it/s    04:29/00:00   ]
Prefill token 100%|██████|     1000/1000 [    3.71it/s    04:29/00:00   ]
Decode token   99%|█████▉| 353598/357310 [ 1310.23it/s    04:29/00:02   ]
[2026-01-05|17:20:00] INFO     Max inflight requests: 48, Max queued requests: 6
[2026-01-05|17:20:00] INFO     Num requests: #1000, Num tokens: #356067
[2026-01-05|17:20:00] INFO     TTFT:  79.51 ms (p50:  69.99 ms, p90: 111.69 ms, p99: 163.54 ms, max: 236.67 ms)
[2026-01-05|17:20:00] INFO     TPOT:  18.36 ms (p50:  16.08 ms, p90:  23.47 ms, p99:  64.71 ms, max: 378.16 ms)
[2026-01-05|17:20:00] INFO     E2E:  6.5792  s (p50: 5.3202  s, p90:  13.08  s, p99:  31.60  s, max:  57.75  s)
[2026-01-05|17:20:00] INFO     Duration: 268.86 s
[2026-01-05|17:20:00] INFO     Throughput:   1324 token/s, 3.7194 req/s
Requests sent 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Requests done 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Prefill token 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Decode token  100%|█████▉| 355573/357310 [  696.19it/s    08:30/00:02   ]
[2026-01-05|17:28:30] INFO     Max inflight requests: 25, Max queued requests: 4
[2026-01-05|17:28:31] INFO     Num requests: #1000, Num tokens: #358371
[2026-01-05|17:28:31] INFO     TTFT:  66.15 ms (p50:  58.42 ms, p90:  98.00 ms, p99: 130.92 ms, max: 174.77 ms)
[2026-01-05|17:28:31] INFO     TPOT:  12.65 ms (p50:  11.80 ms, p90:  14.25 ms, p99:  46.01 ms, max: 433.40 ms)
[2026-01-05|17:28:31] INFO     E2E:  4.5748  s (p50: 3.7151  s, p90: 8.8337  s, p99:  22.33  s, max:  56.86  s)
[2026-01-05|17:28:31] INFO     Duration: 509.73 s
[2026-01-05|17:28:31] INFO     Throughput: 703.06 token/s, 1.9618 req/s
[2026-01-05|17:28:31] INFO     Benchmarking completed.

For a more intuitive comparison, I have created a line chart to visualize the results.

Next, here is the comparison when TP=2.

The launch command is:

python -m minisgl --model /mnt/models/Qwen/qwen3_30b_moe  --tp 2 --disable-pynccl   --cuda-graph-max-bs 256

mini-sglang tp=2 benchmark:

[2026-01-05|14:58:06] INFO     Start benchmarking with 1000 requests using model /mnt/models/Qwen/qwen3_30b_moe...
Requests sent 100%|██████|     1000/1000 [    6.20it/s    02:41/00:00   ]
Requests done 100%|██████|     1000/1000 [    6.20it/s    02:41/00:00   ]
Prefill token 100%|██████|     1000/1000 [    6.20it/s    02:41/00:00   ]
Decode token  100%|██████| 357310/357310 [ 2214.73it/s    02:41/00:00   ]
[2026-01-05|15:00:48] INFO     Max inflight requests: 218, Max queued requests: 17
[2026-01-05|15:00:48] INFO     Num requests: #1000, Num tokens: #359404
[2026-01-05|15:00:48] INFO     TTFT: 257.57 ms (p50: 247.60 ms, p90: 336.94 ms, p99: 724.43 ms, max:   1496 ms)
[2026-01-05|15:00:48] INFO     TPOT:  44.05 ms (p50:  17.33 ms, p90:  95.13 ms, p99: 546.03 ms, max:   1073 ms)
[2026-01-05|15:00:48] INFO     E2E:   16.00  s (p50:  12.56  s, p90:  33.39  s, p99:  66.53  s, max: 107.83  s)
[2026-01-05|15:00:48] INFO     Duration: 160.31 s
[2026-01-05|15:00:48] INFO     Throughput:   2241 token/s, 6.2379 req/s
Requests sent 100%|██████|     1000/1000 [    5.37it/s    03:06/00:00   ]
Requests done 100%|██████|     1000/1000 [    5.37it/s    03:06/00:00   ]
Prefill token 100%|██████|     1000/1000 [    5.37it/s    03:06/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1918.42it/s    03:06/00:00   ]
[2026-01-05|15:03:54] INFO     Max inflight requests: 125, Max queued requests: 12
[2026-01-05|15:03:54] INFO     Num requests: #1000, Num tokens: #359382
[2026-01-05|15:03:54] INFO     TTFT: 230.52 ms (p50: 229.52 ms, p90: 312.51 ms, p99: 364.28 ms, max: 495.91 ms)
[2026-01-05|15:03:54] INFO     TPOT:  32.14 ms (p50:  16.99 ms, p90:  79.21 ms, p99: 351.51 ms, max: 909.16 ms)
[2026-01-05|15:03:54] INFO     E2E:   11.72  s (p50: 8.9679  s, p90:  24.59  s, p99:  49.76  s, max:  88.37  s)
[2026-01-05|15:03:54] INFO     Duration: 185.24 s
[2026-01-05|15:03:54] INFO     Throughput:   1940 token/s, 5.3984 req/s
Requests sent 100%|██████|     1000/1000 [    4.70it/s    03:32/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.70it/s    03:32/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.70it/s    03:32/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1678.05it/s    03:32/00:00   ]
[2026-01-05|15:07:27] INFO     Max inflight requests: 89, Max queued requests: 10
[2026-01-05|15:07:27] INFO     Num requests: #1000, Num tokens: #359372
[2026-01-05|15:07:27] INFO     TTFT: 212.17 ms (p50: 215.79 ms, p90: 301.82 ms, p99: 365.31 ms, max: 538.08 ms)
[2026-01-05|15:07:27] INFO     TPOT:  25.30 ms (p50:  13.13 ms, p90:  45.20 ms, p99: 244.20 ms, max: 820.31 ms)
[2026-01-05|15:07:27] INFO     E2E:  9.2530  s (p50: 7.3189  s, p90:  18.79  s, p99:  42.27  s, max:  72.86  s)
[2026-01-05|15:07:27] INFO     Duration: 211.92 s
[2026-01-05|15:07:27] INFO     Throughput:   1695 token/s, 4.7187 req/s
Requests sent 100%|██████|     1000/1000 [    4.18it/s    03:59/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.18it/s    03:59/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.18it/s    03:59/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1494.15it/s    03:59/00:00   ]
[2026-01-05|15:11:26] INFO     Max inflight requests: 61, Max queued requests: 8
[2026-01-05|15:11:27] INFO     Num requests: #1000, Num tokens: #359354
[2026-01-05|15:11:27] INFO     TTFT: 202.93 ms (p50: 206.54 ms, p90: 293.34 ms, p99: 346.54 ms, max: 376.01 ms)
[2026-01-05|15:11:27] INFO     TPOT:  19.43 ms (p50:  10.79 ms, p90:  26.91 ms, p99: 141.81 ms, max:   1118 ms)
[2026-01-05|15:11:27] INFO     E2E:  7.1471  s (p50: 5.7765  s, p90:  14.33  s, p99:  33.95  s, max:  55.06  s)
[2026-01-05|15:11:27] INFO     Duration: 238.13 s
[2026-01-05|15:11:27] INFO     Throughput:   1509 token/s, 4.1994 req/s
Requests sent 100%|██████|     1000/1000 [    3.72it/s    04:28/00:00   ]
Requests done 100%|██████|     1000/1000 [    3.72it/s    04:28/00:00   ]
Prefill token 100%|██████|     1000/1000 [    3.72it/s    04:28/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1330.23it/s    04:28/00:00   ]
[2026-01-05|15:15:55] INFO     Max inflight requests: 51, Max queued requests: 8
[2026-01-05|15:15:55] INFO     Num requests: #1000, Num tokens: #359361
[2026-01-05|15:15:55] INFO     TTFT: 194.56 ms (p50: 193.30 ms, p90: 279.30 ms, p99: 349.36 ms, max: 462.76 ms)
[2026-01-05|15:15:55] INFO     TPOT:  17.26 ms (p50:  10.27 ms, p90:  20.00 ms, p99: 133.23 ms, max: 671.54 ms)
[2026-01-05|15:15:55] INFO     E2E:  6.3640  s (p50: 5.1628  s, p90:  12.55  s, p99:  29.83  s, max:  55.41  s)
[2026-01-05|15:15:55] INFO     Duration: 267.60 s
[2026-01-05|15:15:55] INFO     Throughput:   1342 token/s, 3.7369 req/s
Requests sent 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Requests done 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Prefill token 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Decode token  100%|██████| 357310/357310 [  699.87it/s    08:30/00:00   ]
[2026-01-05|14:24:49] INFO     Max inflight requests: 24, Max queued requests: 5
[2026-01-05|14:24:49] INFO     Num requests: #1000, Num tokens: #359332
[2026-01-05|14:24:49] INFO     TTFT: 129.77 ms (p50: 109.58 ms, p90: 186.49 ms, p99: 256.30 ms, max: 280.30 ms)
[2026-01-05|14:24:49] INFO     TPOT: 9.9564 ms (p50: 8.1882 ms, p90:  10.01 ms, p99:  84.50 ms, max: 380.30 ms)
[2026-01-05|14:24:49] INFO     E2E:  3.6875  s (p50: 3.0475  s, p90: 7.1581  s, p99:  18.33  s, max:  44.84  s)
[2026-01-05|14:24:49] INFO     Duration: 509.53 s
[2026-01-05|14:24:49] INFO     Throughput: 705.22 token/s, 1.9626 req/s
[2026-01-05|14:24:49] INFO     Benchmarking completed.

The launch command is:

python3 -m sglang.launch_server --model-path /mnt/models/Qwen/qwen3_30b_moe  --tp 2  --port 1919

sglang tp=2 benchmark:

[2026-01-05|15:56:08] INFO     Start benchmarking with 1000 requests using model /mnt/models/Qwen/qwen3_30b_moe...
Requests sent 100%|██████|     1000/1000 [    6.78it/s    02:27/00:00   ]
Requests done 100%|██████|     1000/1000 [    6.78it/s    02:27/00:00   ]
Prefill token 100%|██████|     1000/1000 [    6.78it/s    02:27/00:00   ]
Decode token   98%|█████▉| 350789/357310 [ 2376.95it/s    02:27/00:02   ]
[2026-01-05|15:58:35] INFO     Max inflight requests: 108, Max queued requests: 9
[2026-01-05|15:58:35] INFO     Num requests: #1000, Num tokens: #353074
[2026-01-05|15:58:35] INFO     TTFT: 117.59 ms (p50:  95.63 ms, p90: 207.66 ms, p99: 379.78 ms, max: 566.05 ms)
[2026-01-05|15:58:35] INFO     TPOT:  21.22 ms (p50:  13.90 ms, p90:  41.41 ms, p99: 179.75 ms, max: 668.99 ms)
[2026-01-05|15:58:35] INFO     E2E:  7.5683  s (p50: 6.0035  s, p90:  15.57  s, p99:  32.35  s, max:  60.51  s)
[2026-01-05|15:58:35] INFO     Duration: 146.56 s
[2026-01-05|15:58:35] INFO     Throughput:   2409 token/s, 6.8232 req/s
Requests sent 100%|██████|     1000/1000 [    5.72it/s    02:54/00:00   ]
Requests done 100%|██████|     1000/1000 [    5.72it/s    02:54/00:00   ]
Prefill token 100%|██████|     1000/1000 [    5.72it/s    02:54/00:00   ]
Decode token   98%|█████▊| 349537/357310 [ 1998.08it/s    02:54/00:03   ]
[2026-01-05|16:01:30] INFO     Max inflight requests: 69, Max queued requests: 9
[2026-01-05|16:01:30] INFO     Num requests: #1000, Num tokens: #351813
[2026-01-05|16:01:30] INFO     TTFT: 101.33 ms (p50:  82.33 ms, p90: 173.59 ms, p99: 380.05 ms, max: 478.26 ms)
[2026-01-05|16:01:30] INFO     TPOT:  17.10 ms (p50:  12.27 ms, p90:  34.32 ms, p99: 112.50 ms, max: 614.90 ms)
[2026-01-05|16:01:30] INFO     E2E:  6.0838  s (p50: 4.8755  s, p90:  12.21  s, p99:  29.63  s, max:  45.74  s)
[2026-01-05|16:01:30] INFO     Duration: 173.91 s
[2026-01-05|16:01:30] INFO     Throughput:   2022 token/s, 5.7500 req/s
Requests sent 100%|██████|     1000/1000 [    4.92it/s    03:23/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.92it/s    03:23/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.92it/s    03:23/00:00   ]
Decode token   99%|█████▉| 352661/357310 [ 1734.92it/s    03:23/00:02   ]
[2026-01-05|16:04:54] INFO     Max inflight requests: 55, Max queued requests: 8
[2026-01-05|16:04:54] INFO     Num requests: #1000, Num tokens: #355074
[2026-01-05|16:04:54] INFO     TTFT:  91.40 ms (p50:  70.75 ms, p90: 148.32 ms, p99: 328.66 ms, max: 485.49 ms)
[2026-01-05|16:04:54] INFO     TPOT:  14.70 ms (p50:  11.31 ms, p90:  19.73 ms, p99:  83.82 ms, max: 591.45 ms)
[2026-01-05|16:04:54] INFO     E2E:  5.2833  s (p50: 4.2166  s, p90:  10.53  s, p99:  25.51  s, max:  44.58  s)
[2026-01-05|16:04:54] INFO     Duration: 202.26 s
[2026-01-05|16:04:54] INFO     Throughput:   1755 token/s, 4.9442 req/s
Requests sent 100%|██████|     1000/1000 [    4.29it/s    03:53/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.29it/s    03:53/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.29it/s    03:53/00:00   ]
Decode token   99%|█████▉| 352795/357310 [ 1513.04it/s    03:53/00:02   ]
[2026-01-05|16:08:47] INFO     Max inflight requests: 41, Max queued requests: 7
[2026-01-05|16:08:47] INFO     Num requests: #1000, Num tokens: #355312
[2026-01-05|16:08:47] INFO     TTFT:  86.12 ms (p50:  63.49 ms, p90: 143.87 ms, p99: 308.34 ms, max: 479.04 ms)
[2026-01-05|16:08:47] INFO     TPOT:  13.03 ms (p50:  10.31 ms, p90:  13.76 ms, p99:  63.56 ms, max: 565.53 ms)
[2026-01-05|16:08:47] INFO     E2E:  4.6905  s (p50: 3.8275  s, p90: 9.1918  s, p99:  22.62  s, max:  45.81  s)
[2026-01-05|16:08:47] INFO     Duration: 232.15 s
[2026-01-05|16:08:47] INFO     Throughput:   1530 token/s, 4.3075 req/s
Requests sent 100%|██████|     1000/1000 [    3.80it/s    04:22/00:00   ]
Requests done 100%|██████|     1000/1000 [    3.80it/s    04:22/00:00   ]
Prefill token 100%|██████|     1000/1000 [    3.80it/s    04:22/00:00   ]
Decode token   99%|█████▉| 353054/357310 [ 1343.18it/s    04:22/00:03   ]
[2026-01-05|16:13:10] INFO     Max inflight requests: 35, Max queued requests: 6
[2026-01-05|16:13:10] INFO     Num requests: #1000, Num tokens: #355519
[2026-01-05|16:13:10] INFO     TTFT:  82.83 ms (p50:  60.43 ms, p90: 136.34 ms, p99: 322.03 ms, max: 455.08 ms)
[2026-01-05|16:13:10] INFO     TPOT:  11.77 ms (p50: 9.6149 ms, p90:  11.98 ms, p99:  53.01 ms, max: 569.36 ms)
[2026-01-05|16:13:10] INFO     E2E:  4.2447  s (p50: 3.4942  s, p90: 8.2890  s, p99:  20.53  s, max:  46.70  s)
[2026-01-05|16:13:10] INFO     Duration: 261.83 s
[2026-01-05|16:13:10] INFO     Throughput:   1357 token/s, 3.8192 req/s
Requests sent 100%|██████|     1000/1000 [    1.98it/s    08:26/00:00   ]
Requests done 100%|██████|     1000/1000 [    1.98it/s    08:26/00:00   ]
Prefill token 100%|██████|     1000/1000 [    1.98it/s    08:26/00:00   ]
Decode token   99%|█████▉| 354607/357310 [  700.51it/s    08:26/00:03   ]
[2026-01-05|16:21:37] INFO     Max inflight requests: 21, Max queued requests: 5
[2026-01-05|16:21:37] INFO     Num requests: #1000, Num tokens: #357302
[2026-01-05|16:21:37] INFO     TTFT:  69.70 ms (p50:  50.63 ms, p90: 118.17 ms, p99: 289.39 ms, max: 387.65 ms)
[2026-01-05|16:21:37] INFO     TPOT: 7.9237 ms (p50: 7.1553 ms, p90: 8.6552 ms, p99:  40.34 ms, max: 532.50 ms)
[2026-01-05|16:21:37] INFO     E2E:  2.8850  s (p50: 2.4000  s, p90: 5.5705  s, p99:  14.27  s, max:  36.88  s)
[2026-01-05|16:21:37] INFO     Duration: 505.20 s
[2026-01-05|16:21:37] INFO     Throughput: 707.25 token/s, 1.9794 req/s
[2026-01-05|16:21:37] INFO     Benchmarking completed.

Finally, I ran an accuracy test to ensure the model's generation is correct. We used the GSM8K dataset for benchmarking, but for a quick verification, we only tested the first 100 samples.

results:

Reading file: gsm8k.parquet
Starting evaluation for qwen-30b-moe, total 100 samples...
Accuracy: 96.77%:  93%|███████████████▊ | 93/100 [10:30<00:44,  6.30s/it]
Accuracy: 96.00%: 100%|████████████████| 100/100 [12:04<00:00,  7.24s/it]

Evaluation finished! Accuracy: 96.00% (96/100)

While this MoE implementation is currently quite simple and only supports Tensor Parallelism (without Expert Parallelism yet), I truly appreciate the mini-sglang project. I hope to contribute to its growth through this PR and continue to help refine the project over time.

support qwen3_moe

support moe

jiahe7ay · 2026-01-05T18:29:29Z

@DarkSharpness Could you review my PR?

Delete .ds_store

update __init_.py

DarkSharpness · 2026-01-06T08:54:25Z

python/minisgl/layers/moe/fused_moe/fused_moe_impl.py

+from typing import Tuple
+from minisgl.layers.moe.topk import select_experts
+from sgl_kernel import moe_align_block_size as sgl_moe_align_block_size
+


Please format this file by:

pip install pre-commit

pre-commit install

pre-commit run -a

Okay, I’ll do it right now

DarkSharpness · 2026-01-06T08:55:02Z

python/minisgl/layers/moe/fused_moe.py

+
+
+@triton.jit
+def fused_moe_kernel(


Move all triton kernels to minisgl/kernel

@DarkSharpness
I have migrated the Triton kernels to minisgl/kernel and executed pre-commit run -a; please see the screenshots below. Following the migration, I also performed deployments, and both qwen3-8b and qwen3-30b were deployed successfully.

Please let me know if there are any issues or if further modifications are needed. I am more than happy to make any necessary adjustments.

Great. MoE is a complex module and I will take a detailed look and test tomorrow.

@DarkSharpness Alternatively, could we set up a Discord channel? It would facilitate real-time communication and make it easier for more contributors to get involved.

@DarkSharpness I've created a Discord channel for mini-sgl. Some of my friends are also interested in helping develop and improve the project, so I thought this would be a great opportunity for everyone to join and collaborate together. Here is the invite link: https://discord.gg/wA5g4msx

@jiahe7ay I opened a channel #mini-sglang in SGLang slack. You can join it via this link

@DarkSharpness Got it, I've joined. Thanks

pre-commit

Decoupling

fix

fix init

fix pre-commit run -a

fix pre-commit

jiahe7ay · 2026-01-10T16:59:50Z

@DarkSharpness I have added the --moe-backend argument to args and abstracted the MoE module. This provides a more convenient extension for future MoE implementations and prevents code coupling.

DarkSharpness · 2026-01-19T15:52:08Z

python/minisgl/kernel/moe_impl.py

+import torch
+import triton
+import triton.language as tl
+from minisgl.kernel.triton.fused_moe import fused_moe_kernel


It's ok to put all the triton kernel in this file.

DarkSharpness · 2026-01-19T15:55:12Z

python/minisgl/layers/base.py

+
            if isinstance(param, torch.Tensor):
-                item = state_dict.pop(_concat_prefix(prefix, name))
+                if "experts" in prefix:


I guess we need to refactor the weight loading logic. Current state_dict and load_state_dict implementation in main branch is terrible. This needs to cleaned up in future PRs.

My thoughts exactly

DarkSharpness · 2026-01-19T15:55:58Z

python/minisgl/layers/base.py

        if not _internal and state_dict:
-            raise RuntimeError(f"Unexpected keys in state_dict: {list(state_dict.keys())}")
+            keys = list(state_dict.keys())
+            raise RuntimeError(


keep the old logic

Got it. I'll make the changes right away

DarkSharpness · 2026-01-19T15:56:13Z

python/minisgl/layers/base.py

        if not _internal and state_dict:
            _ = prefix
-            raise RuntimeError(f"Unexpected keys in state_dict: {list(state_dict.keys())}")
+            keys = list(state_dict.keys())


keep the old logic

DarkSharpness · 2026-01-19T15:56:25Z

python/minisgl/layers/base.py

+
        if not _internal and state_dict:
-            raise RuntimeError(f"Unexpected keys in state_dict: {list(state_dict.keys())}")
+            keys = list(state_dict.keys())


keep the old logic

Got it. I'll make the changes right away

DarkSharpness · 2026-01-19T15:57:57Z

python/minisgl/models/__init__.py

        from .llama import LlamaForCausalLM

        return LlamaForCausalLM(model_config)
+    elif "qwen3" in model_name and "30b" in model_name:


Similar to weight_loading, this model class dispatch logic is also terrible... We must refactor this later

(Don't do the refactor within this PR, this is just some irrelevant comment)

DarkSharpness · 2026-01-19T16:00:18Z

python/minisgl/layers/moe/moe_backend.py

Personally, I would put this into moe/__init__.py (and import FusedMoE within the get_moe_backend function).

DarkSharpness · 2026-01-19T16:23:36Z

python/minisgl/layers/moe/fused_moe/fused_moe_impl.py

+    else:
+        out_hidden_states = torch.empty_like(hidden_states)
+
+    for chunk in range((num_tokens // CHUNK_SIZE) + 1):


Remove this chunking logic. Modern LLM systems will chunk requests by default, and you can assume that num_tokens will never exceed CHUNK_SIZE (i.e. only 1 iteration). This loop will mess things up.

DarkSharpness · 2026-01-19T16:28:45Z

python/minisgl/models/config.py

                rotary_dim=head_dim,
-                max_position=config.max_position_embeddings,
-                base=config.rope_theta,
+                max_position=getattr(config, "max_position_embeddings", 2048),


why we need this change here? could you share an example where max_position_embeddings is not provided?

Maybe I’m just overthinking it，I think I should have left this part as it was

Refactoring the moe backend

fix

keep old logic

add moe.py

pre-commit run -a

fix

jiahe7ay · 2026-01-20T17:13:00Z

@DarkSharpness I have refactored the moe_backend using attn_backend as a reference, making the overall MoE implementation more abstract and decoupled. I also added the functionality to read modules from the HF config.json and load the corresponding models. Could you please take a look and review it?

jiahe7ay added 2 commits January 6, 2026 01:53

support qwen3_moe

50941dc

support qwen3_moe

support moe

5a68015

support moe

Delete .ds_store

6c53d35

Delete .ds_store

jiahe7ay mentioned this pull request Jan 6, 2026

MOE Support #9

Open

update __init_.py

c6629b9

update __init_.py

DarkSharpness self-assigned this Jan 6, 2026

DarkSharpness reviewed Jan 6, 2026

View reviewed changes

jiahe7ay added 6 commits January 6, 2026 17:50

pre-commit

0b95016

pre-commit

Decoupling

f5ee2b3

Decoupling

fix

91d6196

fix

fix init

db0a783

fix init

fix pre-commit run -a

9dc6314

fix pre-commit run -a

fix pre-commit

6c68b28

fix pre-commit

DarkSharpness reviewed Jan 19, 2026

View reviewed changes

jiahe7ay added 5 commits January 20, 2026 23:54

Refactoring the moe backend

58ba615

Refactoring the moe backend

fix

946a6f0

fix

keep old logic

389b63a

keep old logic

keep old logic

f0f7ea5

keep old logic

keep old logic

d8b3c04

keep old logic

jiahe7ay added 5 commits January 21, 2026 00:11

keep old logic

cdef347

keep old logic

add moe.py

a079e7d

add moe.py

pre-commit run -a

c7f6f86

pre-commit run -a

pre-commit run -a

7873686

pre-commit run -a

fix

c125dc6

fix



		@triton.jit
		def fused_moe_kernel(

[Feature] Support Qwen3-MoE model via fused MoE #59

Are you sure you want to change the base?

[Feature] Support Qwen3-MoE model via fused MoE #59

Uh oh!

Conversation

jiahe7ay commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiahe7ay commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiahe7ay Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiahe7ay Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiahe7ay commented Jan 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiahe7ay commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiahe7ay commented Jan 5, 2026 •

edited

Loading

jiahe7ay commented Jan 5, 2026 •

edited

Loading

jiahe7ay Jan 8, 2026 •

edited

Loading

jiahe7ay Jan 9, 2026 •

edited

Loading

jiahe7ay commented Jan 20, 2026 •

edited

Loading