SemiAnalysisAI · functionstackx · Mar 5, 2026 · Mar 5, 2026
diff --git a/.github/workflows/claude-pr-review.yml b/.github/workflows/claude-pr-review.yml
@@ -129,6 +129,26 @@ jobs:
             - **STP (Single Token Prediction)**: Standard autoregressive decoding — one token per forward pass. No speculative decoding or MTP. Benchmarks labeled "STP only" use vanilla decoding.
             - **MTP (Multi-Token Prediction)**: Predicts multiple tokens per forward pass using speculative decoding (e.g., EAGLE, NEXTN).
 
+            ## Expert Parallelism Validation:
+            When reviewing benchmark scripts for MoE models, verify that expert parallelism flags are used correctly:
+
+            - **vLLM** (`vllm serve`): Uses `--enable-expert-parallel` (boolean flag). Does NOT accept `--expert-parallel-size`. EP size is automatically determined by vLLM.
+            - **SGLang** (`sglang.launch_server`): Uses `--expert-parallel-size N` (explicit integer).
+            - **ATOM** (AMD vLLM fork): Uses `--enable-expert-parallel` (same as vLLM).
+
+            **Required pattern for vLLM/ATOM scripts:**
+            Scripts must NOT hardcode `--enable-expert-parallel`. Instead, they should conditionally enable it based on the `EP_SIZE` env var:
+            ```bash
+            if [ "$EP_SIZE" -gt 1 ]; then
+              EP=" --enable-expert-parallel"
+            else
+              EP=" "
+            fi
+            ```
+            If a script hardcodes `--enable-expert-parallel` without checking `EP_SIZE`:
+            - This is a 🟡 **WARNING** issue
+            - Comment: "Expert parallelism should be conditional on the `EP_SIZE` env var from the config YAML, not hardcoded. Use the `if [ \"$EP_SIZE\" -gt 1 ]` pattern to conditionally enable `--enable-expert-parallel`."
+
             Remember: Silence is golden. No comment is better than a low-value comment.
 
             ## Container Image Accessibility Validation:

diff --git a/.github/workflows/claude.yml b/.github/workflows/claude.yml
@@ -274,3 +274,21 @@ jobs:
             - **STP (Single Token Prediction)**: Standard autoregressive decoding — one token per forward pass. No speculative decoding or MTP. Benchmarks labeled "STP only" use vanilla decoding.
             - **MTP (Multi-Token Prediction)**: Predicts multiple tokens per forward pass using speculative decoding (e.g., EAGLE, NEXTN).
 
+            ### Expert Parallelism in Benchmark Scripts
+            vLLM and SGLang handle expert parallelism differently. When writing or reviewing benchmark scripts for MoE models:
+
+            - **vLLM** (`vllm serve`): Uses `--enable-expert-parallel` (a boolean flag). vLLM does NOT accept `--expert-parallel-size`. When EP is enabled, vLLM automatically determines the EP size based on TP and the number of available GPUs.
+            - **SGLang** (`sglang.launch_server`): Uses `--expert-parallel-size N` (an explicit integer). Pass the `EP_SIZE` env var value directly.
+            - **ATOM** (AMD vLLM fork): Uses `--enable-expert-parallel` (same as vLLM).
+
+            **Required pattern for vLLM/ATOM scripts:** Scripts must conditionally enable `--enable-expert-parallel` based on the `EP_SIZE` env var from the config YAML, rather than hardcoding it:
+            ```bash
+            if [ "$EP_SIZE" -gt 1 ]; then
+              EP=" --enable-expert-parallel"
+            else
+              EP=" "
+            fi
+            # Then use $EP in the vllm serve command
+            ```
+            This ensures the script respects the `ep` setting in the master config YAML's search-space.
+