From 694d147c347899fe4eb68e594278292e170e4b46 Mon Sep 17 00:00:00 2001 From: "claude[bot]" <41898282+claude[bot]@users.noreply.github.com> Date: Thu, 5 Mar 2026 04:06:44 +0000 Subject: [PATCH] Add expert parallelism documentation to Claude workflow configs Document the differences between vLLM's --enable-expert-parallel (boolean flag) and SGLang's --expert-parallel-size N (explicit integer) in both claude.yml and claude-pr-review.yml. Scripts should conditionally enable --enable-expert-parallel based on the EP_SIZE env var rather than hardcoding it. Co-authored-by: functionstackx --- .github/workflows/claude-pr-review.yml | 20 ++++++++++++++++++++ .github/workflows/claude.yml | 18 ++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/.github/workflows/claude-pr-review.yml b/.github/workflows/claude-pr-review.yml index 1b5e3f96e..8c848b733 100644 --- a/.github/workflows/claude-pr-review.yml +++ b/.github/workflows/claude-pr-review.yml @@ -129,6 +129,26 @@ jobs: - **STP (Single Token Prediction)**: Standard autoregressive decoding — one token per forward pass. No speculative decoding or MTP. Benchmarks labeled "STP only" use vanilla decoding. - **MTP (Multi-Token Prediction)**: Predicts multiple tokens per forward pass using speculative decoding (e.g., EAGLE, NEXTN). + ## Expert Parallelism Validation: + When reviewing benchmark scripts for MoE models, verify that expert parallelism flags are used correctly: + + - **vLLM** (`vllm serve`): Uses `--enable-expert-parallel` (boolean flag). Does NOT accept `--expert-parallel-size`. EP size is automatically determined by vLLM. + - **SGLang** (`sglang.launch_server`): Uses `--expert-parallel-size N` (explicit integer). + - **ATOM** (AMD vLLM fork): Uses `--enable-expert-parallel` (same as vLLM). + + **Required pattern for vLLM/ATOM scripts:** + Scripts must NOT hardcode `--enable-expert-parallel`. Instead, they should conditionally enable it based on the `EP_SIZE` env var: + ```bash + if [ "$EP_SIZE" -gt 1 ]; then + EP=" --enable-expert-parallel" + else + EP=" " + fi + ``` + If a script hardcodes `--enable-expert-parallel` without checking `EP_SIZE`: + - This is a 🟡 **WARNING** issue + - Comment: "Expert parallelism should be conditional on the `EP_SIZE` env var from the config YAML, not hardcoded. Use the `if [ \"$EP_SIZE\" -gt 1 ]` pattern to conditionally enable `--enable-expert-parallel`." + Remember: Silence is golden. No comment is better than a low-value comment. ## Container Image Accessibility Validation: diff --git a/.github/workflows/claude.yml b/.github/workflows/claude.yml index c280ea3ca..1be4b1b98 100644 --- a/.github/workflows/claude.yml +++ b/.github/workflows/claude.yml @@ -274,3 +274,21 @@ jobs: - **STP (Single Token Prediction)**: Standard autoregressive decoding — one token per forward pass. No speculative decoding or MTP. Benchmarks labeled "STP only" use vanilla decoding. - **MTP (Multi-Token Prediction)**: Predicts multiple tokens per forward pass using speculative decoding (e.g., EAGLE, NEXTN). + ### Expert Parallelism in Benchmark Scripts + vLLM and SGLang handle expert parallelism differently. When writing or reviewing benchmark scripts for MoE models: + + - **vLLM** (`vllm serve`): Uses `--enable-expert-parallel` (a boolean flag). vLLM does NOT accept `--expert-parallel-size`. When EP is enabled, vLLM automatically determines the EP size based on TP and the number of available GPUs. + - **SGLang** (`sglang.launch_server`): Uses `--expert-parallel-size N` (an explicit integer). Pass the `EP_SIZE` env var value directly. + - **ATOM** (AMD vLLM fork): Uses `--enable-expert-parallel` (same as vLLM). + + **Required pattern for vLLM/ATOM scripts:** Scripts must conditionally enable `--enable-expert-parallel` based on the `EP_SIZE` env var from the config YAML, rather than hardcoding it: + ```bash + if [ "$EP_SIZE" -gt 1 ]; then + EP=" --enable-expert-parallel" + else + EP=" " + fi + # Then use $EP in the vllm serve command + ``` + This ensures the script respects the `ep` setting in the master config YAML's search-space. +