Skip to content

Conversation

@jiahe7ay
Copy link
Contributor

@jiahe7ay jiahe7ay commented Jan 5, 2026

I have implemented support for the Qwen3-MoE model in mini-sglang. It now supports direct loading and deployment. According to results from bench_qwen.py, the MoE throughput in mini-sglang is comparable to that of sglang. The Fused MoE implementation refers to Sglang's approach and now supports multi-GPU Tensor Parallel (TP) deployment.

support #9

In summary, this PR achieves the following:

Implemented Fused MoE at the layer level.
Added support for the Qwen3-MoE model.
Enabled Multi-GPU Tensor Parallelism for Qwen3-MoE.
Verified that throughput performance is on par with Sglang through benchmarks.

Next, I will present the benchmark results. The tests were initially conducted on a machine equipped with H800 GPUs.

First, here is the benchmark comparison of Qwen3-30B-MoE between mini-sglang and sglang when TP=1.

The launch command is:

python -m minisgl --model /mnt/models/Qwen/qwen3_30b_moe  --tp 1 --disable-pynccl   --cuda-graph-max-bs 256

mini-sglang tp=1 benchmark:

[2026-01-05|16:23:22] INFO     Start benchmarking with 1000 requests using model /mnt/models/Qwen/qwen3_30b_moe...
Requests sent 100%|██████|     1000/1000 [    6.49it/s    02:34/00:00   ]
Requests done 100%|██████|     1000/1000 [    6.49it/s    02:34/00:00   ]
Prefill token 100%|██████|     1000/1000 [    6.49it/s    02:34/00:00   ]
Decode token  100%|██████| 357310/357310 [ 2318.57it/s    02:34/00:00   ]
[2026-01-05|16:25:57] INFO     Max inflight requests: 157, Max queued requests: 10
[2026-01-05|16:25:57] INFO     Num requests: #1000, Num tokens: #359491
[2026-01-05|16:25:57] INFO     TTFT: 126.89 ms (p50: 109.70 ms, p90: 158.57 ms, p99: 659.98 ms, max:   1035 ms)
[2026-01-05|16:25:57] INFO     TPOT:  37.06 ms (p50:  28.90 ms, p90:  65.07 ms, p99: 157.24 ms, max: 807.71 ms)
[2026-01-05|16:25:57] INFO     E2E:   13.37  s (p50:  11.21  s, p90:  26.39  s, p99:  53.11  s, max:  99.36  s)
[2026-01-05|16:25:57] INFO     Duration: 153.08 s
[2026-01-05|16:25:57] INFO     Throughput:   2348 token/s, 6.5324 req/s
Requests sent 100%|██████|     1000/1000 [    5.52it/s    03:01/00:00   ]
Requests done 100%|██████|     1000/1000 [    5.52it/s    03:01/00:00   ]
Prefill token 100%|██████|     1000/1000 [    5.52it/s    03:01/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1970.72it/s    03:01/00:00   ]
[2026-01-05|16:28:58] INFO     Max inflight requests: 88, Max queued requests: 8
[2026-01-05|16:28:58] INFO     Num requests: #1000, Num tokens: #359391
[2026-01-05|16:28:58] INFO     TTFT:  95.31 ms (p50:  87.96 ms, p90: 132.29 ms, p99: 235.77 ms, max: 374.48 ms)
[2026-01-05|16:28:58] INFO     TPOT:  18.17 ms (p50:  12.60 ms, p90:  39.25 ms, p99:  90.83 ms, max: 343.63 ms)
[2026-01-05|16:28:58] INFO     E2E:  6.5895  s (p50: 5.1240  s, p90:  13.51  s, p99:  29.79  s, max:  49.54  s)
[2026-01-05|16:28:58] INFO     Duration: 180.30 s
[2026-01-05|16:28:58] INFO     Throughput:   1993 token/s, 5.5463 req/s
Requests sent 100%|██████|     1000/1000 [    4.87it/s    03:25/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.87it/s    03:25/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.87it/s    03:25/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1739.20it/s    03:25/00:00   ]
[2026-01-05|16:32:24] INFO     Max inflight requests: 49, Max queued requests: 8
[2026-01-05|16:32:24] INFO     Num requests: #1000, Num tokens: #359359
[2026-01-05|16:32:24] INFO     TTFT:  88.81 ms (p50:  81.10 ms, p90: 125.00 ms, p99: 187.38 ms, max: 318.41 ms)
[2026-01-05|16:32:24] INFO     TPOT:  14.67 ms (p50:  11.49 ms, p90:  23.64 ms, p99:  69.33 ms, max: 261.67 ms)
[2026-01-05|16:32:24] INFO     E2E:  5.3326  s (p50: 4.3726  s, p90:  10.69  s, p99:  22.89  s, max:  44.93  s)
[2026-01-05|16:32:24] INFO     Duration: 204.44 s
[2026-01-05|16:32:24] INFO     Throughput:   1757 token/s, 4.8915 req/s
Requests sent 100%|██████|     1000/1000 [    4.25it/s    03:55/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.25it/s    03:55/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.25it/s    03:55/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1517.84it/s    03:55/00:00   ]
[2026-01-05|16:36:19] INFO     Max inflight requests: 40, Max queued requests: 8
[2026-01-05|16:36:19] INFO     Num requests: #1000, Num tokens: #359356
[2026-01-05|16:36:19] INFO     TTFT:  83.47 ms (p50:  73.03 ms, p90: 120.09 ms, p99: 192.51 ms, max: 266.95 ms)
[2026-01-05|16:36:19] INFO     TPOT:  11.96 ms (p50: 9.6328 ms, p90:  13.97 ms, p99:  60.65 ms, max: 242.92 ms)
[2026-01-05|16:36:19] INFO     E2E:  4.3572  s (p50: 3.5358  s, p90: 8.5287  s, p99:  20.56  s, max:  45.49  s)
[2026-01-05|16:36:19] INFO     Duration: 234.40 s
[2026-01-05|16:36:19] INFO     Throughput:   1533 token/s, 4.2662 req/s
Requests sent 100%|██████|     1000/1000 [    3.77it/s    04:25/00:00   ]
Requests done 100%|██████|     1000/1000 [    3.77it/s    04:25/00:00   ]
Prefill token 100%|██████|     1000/1000 [    3.77it/s    04:25/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1346.27it/s    04:25/00:00   ]
[2026-01-05|16:40:45] INFO     Max inflight requests: 34, Max queued requests: 6
[2026-01-05|16:40:45] INFO     Num requests: #1000, Num tokens: #359345
[2026-01-05|16:40:45] INFO     TTFT:  82.11 ms (p50:  70.75 ms, p90: 116.98 ms, p99: 180.83 ms, max: 309.87 ms)
[2026-01-05|16:40:45] INFO     TPOT:  11.08 ms (p50: 9.2071 ms, p90:  11.65 ms, p99:  49.46 ms, max: 267.36 ms)
[2026-01-05|16:40:45] INFO     E2E:  4.0406  s (p50: 3.3624  s, p90: 7.8921  s, p99:  20.07  s, max:  45.44  s)
[2026-01-05|16:40:45] INFO     Duration: 264.40 s
[2026-01-05|16:40:45] INFO     Throughput:   1359 token/s, 3.7821 req/s
Requests sent 100%|██████|     1000/1000 [    1.96it/s    08:29/00:00   ]
Requests done 100%|██████|     1000/1000 [    1.96it/s    08:29/00:00   ]
Prefill token 100%|██████|     1000/1000 [    1.96it/s    08:29/00:00   ]
Decode token  100%|██████| 357310/357310 [  701.60it/s    08:29/00:00   ]
[2026-01-05|16:49:14] INFO     Max inflight requests: 21, Max queued requests: 3
[2026-01-05|16:49:15] INFO     Num requests: #1000, Num tokens: #359324
[2026-01-05|16:49:15] INFO     TTFT:  74.35 ms (p50:  63.94 ms, p90: 106.86 ms, p99: 150.30 ms, max: 268.74 ms)
[2026-01-05|16:49:15] INFO     TPOT: 8.7083 ms (p50: 8.0260 ms, p90: 9.0444 ms, p99:  43.70 ms, max: 148.49 ms)
[2026-01-05|16:49:15] INFO     E2E:  3.1860  s (p50: 2.6301  s, p90: 6.1256  s, p99:  15.87  s, max:  38.94  s)
[2026-01-05|16:49:15] INFO     Duration: 508.27 s
[2026-01-05|16:49:15] INFO     Throughput: 706.96 token/s, 1.9675 req/s
[2026-01-05|16:49:15] INFO     Benchmarking completed.

The launch command is:

python3 -m sglang.launch_server --model-path /mnt/models/Qwen/qwen3_30b_moe  --tp 1  --port 1919 

sglang tp=1 benchmark:

[2026-01-05|17:02:12] INFO     Start benchmarking with 1000 requests using model /mnt/models/Qwen/qwen3_30b_moe...
Requests sent 100%|██████|     1000/1000 [    6.36it/s    02:37/00:00   ]
Requests done 100%|██████|     1000/1000 [    6.36it/s    02:37/00:00   ]
Prefill token 100%|██████|     1000/1000 [    6.36it/s    02:37/00:00   ]
Decode token   99%|█████▉| 352801/357310 [ 2242.38it/s    02:37/00:02   ]
[2026-01-05|17:04:50] INFO     Max inflight requests: 128, Max queued requests: 10
[2026-01-05|17:04:50] INFO     Num requests: #1000, Num tokens: #355313
[2026-01-05|17:04:50] INFO     TTFT:  98.45 ms (p50:  94.70 ms, p90: 135.84 ms, p99: 220.60 ms, max: 605.24 ms)
[2026-01-05|17:04:50] INFO     TPOT:  28.61 ms (p50:  22.15 ms, p90:  47.71 ms, p99: 139.73 ms, max: 611.94 ms)
[2026-01-05|17:04:50] INFO     E2E:   10.21  s (p50: 8.3017  s, p90:  20.85  s, p99:  42.45  s, max:  75.36  s)
[2026-01-05|17:04:50] INFO     Duration: 156.31 s
[2026-01-05|17:04:50] INFO     Throughput:   2273 token/s, 6.3974 req/s
Requests sent 100%|██████|     1000/1000 [    5.40it/s    03:05/00:00   ]
Requests done 100%|██████|     1000/1000 [    5.40it/s    03:05/00:00   ]
Prefill token 100%|██████|     1000/1000 [    5.40it/s    03:05/00:00   ]
Decode token   99%|█████▉| 353027/357310 [ 1907.33it/s    03:05/00:02   ]
[2026-01-05|17:07:55] INFO     Max inflight requests: 92, Max queued requests: 10
[2026-01-05|17:07:55] INFO     Num requests: #1000, Num tokens: #355433
[2026-01-05|17:07:55] INFO     TTFT:  94.01 ms (p50:  86.31 ms, p90: 132.91 ms, p99: 198.02 ms, max: 518.89 ms)
[2026-01-05|17:07:55] INFO     TPOT:  24.70 ms (p50:  20.04 ms, p90:  43.99 ms, p99: 111.82 ms, max: 792.54 ms)
[2026-01-05|17:07:55] INFO     E2E:  8.8244  s (p50: 7.2252  s, p90:  17.73  s, p99:  38.48  s, max:  69.10  s)
[2026-01-05|17:07:55] INFO     Duration: 184.07 s
[2026-01-05|17:07:55] INFO     Throughput:   1930 token/s, 5.4327 req/s
Requests sent 100%|██████|     1000/1000 [    4.69it/s    03:33/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.69it/s    03:33/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.69it/s    03:33/00:00   ]
Decode token   99%|█████▉| 353100/357310 [ 1654.73it/s    03:33/00:02   ]
[2026-01-05|17:11:29] INFO     Max inflight requests: 67, Max queued requests: 8
[2026-01-05|17:11:29] INFO     Num requests: #1000, Num tokens: #355503
[2026-01-05|17:11:29] INFO     TTFT:  85.74 ms (p50:  78.55 ms, p90: 117.02 ms, p99: 173.26 ms, max: 423.73 ms)
[2026-01-05|17:11:29] INFO     TPOT:  21.82 ms (p50:  18.38 ms, p90:  42.30 ms, p99:  82.36 ms, max: 422.90 ms)
[2026-01-05|17:11:29] INFO     E2E:  7.7997  s (p50: 6.3952  s, p90:  15.67  s, p99:  35.94  s, max:  59.08  s)
[2026-01-05|17:11:29] INFO     Duration: 212.37 s
[2026-01-05|17:11:29] INFO     Throughput:   1673 token/s, 4.7087 req/s
Requests sent 100%|██████|     1000/1000 [    4.15it/s    04:00/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.15it/s    04:00/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.15it/s    04:00/00:00   ]
Decode token   99%|█████▉| 353391/357310 [ 1467.86it/s    04:00/00:02   ]
[2026-01-05|17:15:29] INFO     Max inflight requests: 56, Max queued requests: 10
[2026-01-05|17:15:30] INFO     Num requests: #1000, Num tokens: #355819
[2026-01-05|17:15:30] INFO     TTFT:  83.01 ms (p50:  72.48 ms, p90: 115.51 ms, p99: 207.54 ms, max: 496.66 ms)
[2026-01-05|17:15:30] INFO     TPOT:  19.77 ms (p50:  17.01 ms, p90:  30.92 ms, p99:  69.58 ms, max: 624.89 ms)
[2026-01-05|17:15:30] INFO     E2E:  7.0763  s (p50: 5.8383  s, p90:  14.12  s, p99:  33.07  s, max:  56.50  s)
[2026-01-05|17:15:30] INFO     Duration: 239.74 s
[2026-01-05|17:15:30] INFO     Throughput:   1484 token/s, 4.1712 req/s
Requests sent 100%|██████|     1000/1000 [    3.71it/s    04:29/00:00   ]
Requests done 100%|██████|     1000/1000 [    3.71it/s    04:29/00:00   ]
Prefill token 100%|██████|     1000/1000 [    3.71it/s    04:29/00:00   ]
Decode token   99%|█████▉| 353598/357310 [ 1310.23it/s    04:29/00:02   ]
[2026-01-05|17:20:00] INFO     Max inflight requests: 48, Max queued requests: 6
[2026-01-05|17:20:00] INFO     Num requests: #1000, Num tokens: #356067
[2026-01-05|17:20:00] INFO     TTFT:  79.51 ms (p50:  69.99 ms, p90: 111.69 ms, p99: 163.54 ms, max: 236.67 ms)
[2026-01-05|17:20:00] INFO     TPOT:  18.36 ms (p50:  16.08 ms, p90:  23.47 ms, p99:  64.71 ms, max: 378.16 ms)
[2026-01-05|17:20:00] INFO     E2E:  6.5792  s (p50: 5.3202  s, p90:  13.08  s, p99:  31.60  s, max:  57.75  s)
[2026-01-05|17:20:00] INFO     Duration: 268.86 s
[2026-01-05|17:20:00] INFO     Throughput:   1324 token/s, 3.7194 req/s
Requests sent 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Requests done 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Prefill token 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Decode token  100%|█████▉| 355573/357310 [  696.19it/s    08:30/00:02   ]
[2026-01-05|17:28:30] INFO     Max inflight requests: 25, Max queued requests: 4
[2026-01-05|17:28:31] INFO     Num requests: #1000, Num tokens: #358371
[2026-01-05|17:28:31] INFO     TTFT:  66.15 ms (p50:  58.42 ms, p90:  98.00 ms, p99: 130.92 ms, max: 174.77 ms)
[2026-01-05|17:28:31] INFO     TPOT:  12.65 ms (p50:  11.80 ms, p90:  14.25 ms, p99:  46.01 ms, max: 433.40 ms)
[2026-01-05|17:28:31] INFO     E2E:  4.5748  s (p50: 3.7151  s, p90: 8.8337  s, p99:  22.33  s, max:  56.86  s)
[2026-01-05|17:28:31] INFO     Duration: 509.73 s
[2026-01-05|17:28:31] INFO     Throughput: 703.06 token/s, 1.9618 req/s
[2026-01-05|17:28:31] INFO     Benchmarking completed.

For a more intuitive comparison, I have created a line chart to visualize the results.

tp1

Next, here is the comparison when TP=2.

The launch command is:

python -m minisgl --model /mnt/models/Qwen/qwen3_30b_moe  --tp 2 --disable-pynccl   --cuda-graph-max-bs 256

mini-sglang tp=2 benchmark:

[2026-01-05|14:58:06] INFO     Start benchmarking with 1000 requests using model /mnt/models/Qwen/qwen3_30b_moe...
Requests sent 100%|██████|     1000/1000 [    6.20it/s    02:41/00:00   ]
Requests done 100%|██████|     1000/1000 [    6.20it/s    02:41/00:00   ]
Prefill token 100%|██████|     1000/1000 [    6.20it/s    02:41/00:00   ]
Decode token  100%|██████| 357310/357310 [ 2214.73it/s    02:41/00:00   ]
[2026-01-05|15:00:48] INFO     Max inflight requests: 218, Max queued requests: 17
[2026-01-05|15:00:48] INFO     Num requests: #1000, Num tokens: #359404
[2026-01-05|15:00:48] INFO     TTFT: 257.57 ms (p50: 247.60 ms, p90: 336.94 ms, p99: 724.43 ms, max:   1496 ms)
[2026-01-05|15:00:48] INFO     TPOT:  44.05 ms (p50:  17.33 ms, p90:  95.13 ms, p99: 546.03 ms, max:   1073 ms)
[2026-01-05|15:00:48] INFO     E2E:   16.00  s (p50:  12.56  s, p90:  33.39  s, p99:  66.53  s, max: 107.83  s)
[2026-01-05|15:00:48] INFO     Duration: 160.31 s
[2026-01-05|15:00:48] INFO     Throughput:   2241 token/s, 6.2379 req/s
Requests sent 100%|██████|     1000/1000 [    5.37it/s    03:06/00:00   ]
Requests done 100%|██████|     1000/1000 [    5.37it/s    03:06/00:00   ]
Prefill token 100%|██████|     1000/1000 [    5.37it/s    03:06/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1918.42it/s    03:06/00:00   ]
[2026-01-05|15:03:54] INFO     Max inflight requests: 125, Max queued requests: 12
[2026-01-05|15:03:54] INFO     Num requests: #1000, Num tokens: #359382
[2026-01-05|15:03:54] INFO     TTFT: 230.52 ms (p50: 229.52 ms, p90: 312.51 ms, p99: 364.28 ms, max: 495.91 ms)
[2026-01-05|15:03:54] INFO     TPOT:  32.14 ms (p50:  16.99 ms, p90:  79.21 ms, p99: 351.51 ms, max: 909.16 ms)
[2026-01-05|15:03:54] INFO     E2E:   11.72  s (p50: 8.9679  s, p90:  24.59  s, p99:  49.76  s, max:  88.37  s)
[2026-01-05|15:03:54] INFO     Duration: 185.24 s
[2026-01-05|15:03:54] INFO     Throughput:   1940 token/s, 5.3984 req/s
Requests sent 100%|██████|     1000/1000 [    4.70it/s    03:32/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.70it/s    03:32/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.70it/s    03:32/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1678.05it/s    03:32/00:00   ]
[2026-01-05|15:07:27] INFO     Max inflight requests: 89, Max queued requests: 10
[2026-01-05|15:07:27] INFO     Num requests: #1000, Num tokens: #359372
[2026-01-05|15:07:27] INFO     TTFT: 212.17 ms (p50: 215.79 ms, p90: 301.82 ms, p99: 365.31 ms, max: 538.08 ms)
[2026-01-05|15:07:27] INFO     TPOT:  25.30 ms (p50:  13.13 ms, p90:  45.20 ms, p99: 244.20 ms, max: 820.31 ms)
[2026-01-05|15:07:27] INFO     E2E:  9.2530  s (p50: 7.3189  s, p90:  18.79  s, p99:  42.27  s, max:  72.86  s)
[2026-01-05|15:07:27] INFO     Duration: 211.92 s
[2026-01-05|15:07:27] INFO     Throughput:   1695 token/s, 4.7187 req/s
Requests sent 100%|██████|     1000/1000 [    4.18it/s    03:59/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.18it/s    03:59/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.18it/s    03:59/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1494.15it/s    03:59/00:00   ]
[2026-01-05|15:11:26] INFO     Max inflight requests: 61, Max queued requests: 8
[2026-01-05|15:11:27] INFO     Num requests: #1000, Num tokens: #359354
[2026-01-05|15:11:27] INFO     TTFT: 202.93 ms (p50: 206.54 ms, p90: 293.34 ms, p99: 346.54 ms, max: 376.01 ms)
[2026-01-05|15:11:27] INFO     TPOT:  19.43 ms (p50:  10.79 ms, p90:  26.91 ms, p99: 141.81 ms, max:   1118 ms)
[2026-01-05|15:11:27] INFO     E2E:  7.1471  s (p50: 5.7765  s, p90:  14.33  s, p99:  33.95  s, max:  55.06  s)
[2026-01-05|15:11:27] INFO     Duration: 238.13 s
[2026-01-05|15:11:27] INFO     Throughput:   1509 token/s, 4.1994 req/s
Requests sent 100%|██████|     1000/1000 [    3.72it/s    04:28/00:00   ]
Requests done 100%|██████|     1000/1000 [    3.72it/s    04:28/00:00   ]
Prefill token 100%|██████|     1000/1000 [    3.72it/s    04:28/00:00   ]
Decode token  100%|██████| 357310/357310 [ 1330.23it/s    04:28/00:00   ]
[2026-01-05|15:15:55] INFO     Max inflight requests: 51, Max queued requests: 8
[2026-01-05|15:15:55] INFO     Num requests: #1000, Num tokens: #359361
[2026-01-05|15:15:55] INFO     TTFT: 194.56 ms (p50: 193.30 ms, p90: 279.30 ms, p99: 349.36 ms, max: 462.76 ms)
[2026-01-05|15:15:55] INFO     TPOT:  17.26 ms (p50:  10.27 ms, p90:  20.00 ms, p99: 133.23 ms, max: 671.54 ms)
[2026-01-05|15:15:55] INFO     E2E:  6.3640  s (p50: 5.1628  s, p90:  12.55  s, p99:  29.83  s, max:  55.41  s)
[2026-01-05|15:15:55] INFO     Duration: 267.60 s
[2026-01-05|15:15:55] INFO     Throughput:   1342 token/s, 3.7369 req/s
Requests sent 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Requests done 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Prefill token 100%|██████|     1000/1000 [    1.96it/s    08:30/00:00   ]
Decode token  100%|██████| 357310/357310 [  699.87it/s    08:30/00:00   ]
[2026-01-05|14:24:49] INFO     Max inflight requests: 24, Max queued requests: 5
[2026-01-05|14:24:49] INFO     Num requests: #1000, Num tokens: #359332
[2026-01-05|14:24:49] INFO     TTFT: 129.77 ms (p50: 109.58 ms, p90: 186.49 ms, p99: 256.30 ms, max: 280.30 ms)
[2026-01-05|14:24:49] INFO     TPOT: 9.9564 ms (p50: 8.1882 ms, p90:  10.01 ms, p99:  84.50 ms, max: 380.30 ms)
[2026-01-05|14:24:49] INFO     E2E:  3.6875  s (p50: 3.0475  s, p90: 7.1581  s, p99:  18.33  s, max:  44.84  s)
[2026-01-05|14:24:49] INFO     Duration: 509.53 s
[2026-01-05|14:24:49] INFO     Throughput: 705.22 token/s, 1.9626 req/s
[2026-01-05|14:24:49] INFO     Benchmarking completed.

The launch command is:

python3 -m sglang.launch_server --model-path /mnt/models/Qwen/qwen3_30b_moe  --tp 2  --port 1919 

sglang tp=2 benchmark:

[2026-01-05|15:56:08] INFO     Start benchmarking with 1000 requests using model /mnt/models/Qwen/qwen3_30b_moe...
Requests sent 100%|██████|     1000/1000 [    6.78it/s    02:27/00:00   ]
Requests done 100%|██████|     1000/1000 [    6.78it/s    02:27/00:00   ]
Prefill token 100%|██████|     1000/1000 [    6.78it/s    02:27/00:00   ]
Decode token   98%|█████▉| 350789/357310 [ 2376.95it/s    02:27/00:02   ]
[2026-01-05|15:58:35] INFO     Max inflight requests: 108, Max queued requests: 9
[2026-01-05|15:58:35] INFO     Num requests: #1000, Num tokens: #353074
[2026-01-05|15:58:35] INFO     TTFT: 117.59 ms (p50:  95.63 ms, p90: 207.66 ms, p99: 379.78 ms, max: 566.05 ms)
[2026-01-05|15:58:35] INFO     TPOT:  21.22 ms (p50:  13.90 ms, p90:  41.41 ms, p99: 179.75 ms, max: 668.99 ms)
[2026-01-05|15:58:35] INFO     E2E:  7.5683  s (p50: 6.0035  s, p90:  15.57  s, p99:  32.35  s, max:  60.51  s)
[2026-01-05|15:58:35] INFO     Duration: 146.56 s
[2026-01-05|15:58:35] INFO     Throughput:   2409 token/s, 6.8232 req/s
Requests sent 100%|██████|     1000/1000 [    5.72it/s    02:54/00:00   ]
Requests done 100%|██████|     1000/1000 [    5.72it/s    02:54/00:00   ]
Prefill token 100%|██████|     1000/1000 [    5.72it/s    02:54/00:00   ]
Decode token   98%|█████▊| 349537/357310 [ 1998.08it/s    02:54/00:03   ]
[2026-01-05|16:01:30] INFO     Max inflight requests: 69, Max queued requests: 9
[2026-01-05|16:01:30] INFO     Num requests: #1000, Num tokens: #351813
[2026-01-05|16:01:30] INFO     TTFT: 101.33 ms (p50:  82.33 ms, p90: 173.59 ms, p99: 380.05 ms, max: 478.26 ms)
[2026-01-05|16:01:30] INFO     TPOT:  17.10 ms (p50:  12.27 ms, p90:  34.32 ms, p99: 112.50 ms, max: 614.90 ms)
[2026-01-05|16:01:30] INFO     E2E:  6.0838  s (p50: 4.8755  s, p90:  12.21  s, p99:  29.63  s, max:  45.74  s)
[2026-01-05|16:01:30] INFO     Duration: 173.91 s
[2026-01-05|16:01:30] INFO     Throughput:   2022 token/s, 5.7500 req/s
Requests sent 100%|██████|     1000/1000 [    4.92it/s    03:23/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.92it/s    03:23/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.92it/s    03:23/00:00   ]
Decode token   99%|█████▉| 352661/357310 [ 1734.92it/s    03:23/00:02   ]
[2026-01-05|16:04:54] INFO     Max inflight requests: 55, Max queued requests: 8
[2026-01-05|16:04:54] INFO     Num requests: #1000, Num tokens: #355074
[2026-01-05|16:04:54] INFO     TTFT:  91.40 ms (p50:  70.75 ms, p90: 148.32 ms, p99: 328.66 ms, max: 485.49 ms)
[2026-01-05|16:04:54] INFO     TPOT:  14.70 ms (p50:  11.31 ms, p90:  19.73 ms, p99:  83.82 ms, max: 591.45 ms)
[2026-01-05|16:04:54] INFO     E2E:  5.2833  s (p50: 4.2166  s, p90:  10.53  s, p99:  25.51  s, max:  44.58  s)
[2026-01-05|16:04:54] INFO     Duration: 202.26 s
[2026-01-05|16:04:54] INFO     Throughput:   1755 token/s, 4.9442 req/s
Requests sent 100%|██████|     1000/1000 [    4.29it/s    03:53/00:00   ]
Requests done 100%|██████|     1000/1000 [    4.29it/s    03:53/00:00   ]
Prefill token 100%|██████|     1000/1000 [    4.29it/s    03:53/00:00   ]
Decode token   99%|█████▉| 352795/357310 [ 1513.04it/s    03:53/00:02   ]
[2026-01-05|16:08:47] INFO     Max inflight requests: 41, Max queued requests: 7
[2026-01-05|16:08:47] INFO     Num requests: #1000, Num tokens: #355312
[2026-01-05|16:08:47] INFO     TTFT:  86.12 ms (p50:  63.49 ms, p90: 143.87 ms, p99: 308.34 ms, max: 479.04 ms)
[2026-01-05|16:08:47] INFO     TPOT:  13.03 ms (p50:  10.31 ms, p90:  13.76 ms, p99:  63.56 ms, max: 565.53 ms)
[2026-01-05|16:08:47] INFO     E2E:  4.6905  s (p50: 3.8275  s, p90: 9.1918  s, p99:  22.62  s, max:  45.81  s)
[2026-01-05|16:08:47] INFO     Duration: 232.15 s
[2026-01-05|16:08:47] INFO     Throughput:   1530 token/s, 4.3075 req/s
Requests sent 100%|██████|     1000/1000 [    3.80it/s    04:22/00:00   ]
Requests done 100%|██████|     1000/1000 [    3.80it/s    04:22/00:00   ]
Prefill token 100%|██████|     1000/1000 [    3.80it/s    04:22/00:00   ]
Decode token   99%|█████▉| 353054/357310 [ 1343.18it/s    04:22/00:03   ]
[2026-01-05|16:13:10] INFO     Max inflight requests: 35, Max queued requests: 6
[2026-01-05|16:13:10] INFO     Num requests: #1000, Num tokens: #355519
[2026-01-05|16:13:10] INFO     TTFT:  82.83 ms (p50:  60.43 ms, p90: 136.34 ms, p99: 322.03 ms, max: 455.08 ms)
[2026-01-05|16:13:10] INFO     TPOT:  11.77 ms (p50: 9.6149 ms, p90:  11.98 ms, p99:  53.01 ms, max: 569.36 ms)
[2026-01-05|16:13:10] INFO     E2E:  4.2447  s (p50: 3.4942  s, p90: 8.2890  s, p99:  20.53  s, max:  46.70  s)
[2026-01-05|16:13:10] INFO     Duration: 261.83 s
[2026-01-05|16:13:10] INFO     Throughput:   1357 token/s, 3.8192 req/s
Requests sent 100%|██████|     1000/1000 [    1.98it/s    08:26/00:00   ]
Requests done 100%|██████|     1000/1000 [    1.98it/s    08:26/00:00   ]
Prefill token 100%|██████|     1000/1000 [    1.98it/s    08:26/00:00   ]
Decode token   99%|█████▉| 354607/357310 [  700.51it/s    08:26/00:03   ]
[2026-01-05|16:21:37] INFO     Max inflight requests: 21, Max queued requests: 5
[2026-01-05|16:21:37] INFO     Num requests: #1000, Num tokens: #357302
[2026-01-05|16:21:37] INFO     TTFT:  69.70 ms (p50:  50.63 ms, p90: 118.17 ms, p99: 289.39 ms, max: 387.65 ms)
[2026-01-05|16:21:37] INFO     TPOT: 7.9237 ms (p50: 7.1553 ms, p90: 8.6552 ms, p99:  40.34 ms, max: 532.50 ms)
[2026-01-05|16:21:37] INFO     E2E:  2.8850  s (p50: 2.4000  s, p90: 5.5705  s, p99:  14.27  s, max:  36.88  s)
[2026-01-05|16:21:37] INFO     Duration: 505.20 s
[2026-01-05|16:21:37] INFO     Throughput: 707.25 token/s, 1.9794 req/s
[2026-01-05|16:21:37] INFO     Benchmarking completed.

Finally, I ran an accuracy test to ensure the model's generation is correct. We used the GSM8K dataset for benchmarking, but for a quick verification, we only tested the first 100 samples.

results:

Reading file: gsm8k.parquet
Starting evaluation for qwen-30b-moe, total 100 samples...
Accuracy: 96.77%:  93%|███████████████▊ | 93/100 [10:30<00:44,  6.30s/it]
Accuracy: 96.00%: 100%|████████████████| 100/100 [12:04<00:00,  7.24s/it]

Evaluation finished! Accuracy: 96.00% (96/100)

While this MoE implementation is currently quite simple and only supports Tensor Parallelism (without Expert Parallelism yet), I truly appreciate the mini-sglang project. I hope to contribute to its growth through this PR and continue to help refine the project over time.

support qwen3_moe
support moe
@jiahe7ay
Copy link
Contributor Author

jiahe7ay commented Jan 5, 2026

@DarkSharpness Could you review my PR?

Delete .ds_store
@jiahe7ay jiahe7ay mentioned this pull request Jan 6, 2026
update __init_.py
@DarkSharpness DarkSharpness self-assigned this Jan 6, 2026
from typing import Tuple
from minisgl.layers.moe.topk import select_experts
from sgl_kernel import moe_align_block_size as sgl_moe_align_block_size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format this file by:

  1. pip install pre-commit
  2. pre-commit install
  3. pre-commit run -a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I’ll do it right now



@triton.jit
def fused_moe_kernel(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all triton kernels to minisgl/kernel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkSharpness
I have migrated the Triton kernels to minisgl/kernel and executed pre-commit run -a; please see the screenshots below. Following the migration, I also performed deployments, and both qwen3-8b and qwen3-30b were deployed successfully.

image image

Please let me know if there are any issues or if further modifications are needed. I am more than happy to make any necessary adjustments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. MoE is a complex module and I will take a detailed look and test tomorrow.

Copy link
Contributor Author

@jiahe7ay jiahe7ay Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkSharpness Alternatively, could we set up a Discord channel? It would facilitate real-time communication and make it easier for more contributors to get involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkSharpness I've created a Discord channel for mini-sgl. Some of my friends are also interested in helping develop and improve the project, so I thought this would be a great opportunity for everyone to join and collaborate together. Here is the invite link: https://discord.gg/wA5g4msx

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiahe7ay I opened a channel #mini-sglang in SGLang slack. You can join it via this link

Copy link
Contributor Author

@jiahe7ay jiahe7ay Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkSharpness Got it, I've joined. Thanks

pre-commit
Decoupling
fix
fix init
fix pre-commit run -a
fix pre-commit
@jiahe7ay
Copy link
Contributor Author

@DarkSharpness I have added the --moe-backend argument to args and abstracted the MoE module. This provides a more convenient extension for future MoE implementations and prevents code coupling.

import torch
import triton
import triton.language as tl
from minisgl.kernel.triton.fused_moe import fused_moe_kernel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to put all the triton kernel in this file.


if isinstance(param, torch.Tensor):
item = state_dict.pop(_concat_prefix(prefix, name))
if "experts" in prefix:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to refactor the weight loading logic. Current state_dict and load_state_dict implementation in main branch is terrible. This needs to cleaned up in future PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts exactly

if not _internal and state_dict:
raise RuntimeError(f"Unexpected keys in state_dict: {list(state_dict.keys())}")
keys = list(state_dict.keys())
raise RuntimeError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the old logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll make the changes right away

if not _internal and state_dict:
_ = prefix
raise RuntimeError(f"Unexpected keys in state_dict: {list(state_dict.keys())}")
keys = list(state_dict.keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the old logic


if not _internal and state_dict:
raise RuntimeError(f"Unexpected keys in state_dict: {list(state_dict.keys())}")
keys = list(state_dict.keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the old logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll make the changes right away

from .llama import LlamaForCausalLM

return LlamaForCausalLM(model_config)
elif "qwen3" in model_name and "30b" in model_name:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to weight_loading, this model class dispatch logic is also terrible... We must refactor this later

(Don't do the refactor within this PR, this is just some irrelevant comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would put this into moe/__init__.py (and import FusedMoE within the get_moe_backend function).

else:
out_hidden_states = torch.empty_like(hidden_states)

for chunk in range((num_tokens // CHUNK_SIZE) + 1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this chunking logic. Modern LLM systems will chunk requests by default, and you can assume that num_tokens will never exceed CHUNK_SIZE (i.e. only 1 iteration). This loop will mess things up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

rotary_dim=head_dim,
max_position=config.max_position_embeddings,
base=config.rope_theta,
max_position=getattr(config, "max_position_embeddings", 2048),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this change here? could you share an example where max_position_embeddings is not provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I’m just overthinking it,I think I should have left this part as it was

Refactoring the moe backend
fix
keep old logic
keep old logic
keep old logic
keep old logic
add moe.py
pre-commit run -a
pre-commit run -a
fix
@jiahe7ay
Copy link
Contributor Author

jiahe7ay commented Jan 20, 2026

@DarkSharpness I have refactored the moe_backend using attn_backend as a reference, making the overall MoE implementation more abstract and decoupled. I also added the functionality to read modules from the HF config.json and load the corresponding models. Could you please take a look and review it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants