Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/workflows/claude-pr-review.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,26 @@ jobs:
- **STP (Single Token Prediction)**: Standard autoregressive decoding — one token per forward pass. No speculative decoding or MTP. Benchmarks labeled "STP only" use vanilla decoding.
- **MTP (Multi-Token Prediction)**: Predicts multiple tokens per forward pass using speculative decoding (e.g., EAGLE, NEXTN).

## Expert Parallelism Validation:
When reviewing benchmark scripts for MoE models, verify that expert parallelism flags are used correctly:

- **vLLM** (`vllm serve`): Uses `--enable-expert-parallel` (boolean flag). Does NOT accept `--expert-parallel-size`. EP size is automatically determined by vLLM.
- **SGLang** (`sglang.launch_server`): Uses `--expert-parallel-size N` (explicit integer).
- **ATOM** (AMD vLLM fork): Uses `--enable-expert-parallel` (same as vLLM).

**Required pattern for vLLM/ATOM scripts:**
Scripts must NOT hardcode `--enable-expert-parallel`. Instead, they should conditionally enable it based on the `EP_SIZE` env var:
```bash
if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi
```
If a script hardcodes `--enable-expert-parallel` without checking `EP_SIZE`:
- This is a 🟡 **WARNING** issue
- Comment: "Expert parallelism should be conditional on the `EP_SIZE` env var from the config YAML, not hardcoded. Use the `if [ \"$EP_SIZE\" -gt 1 ]` pattern to conditionally enable `--enable-expert-parallel`."

Remember: Silence is golden. No comment is better than a low-value comment.

## Container Image Accessibility Validation:
Expand Down
18 changes: 18 additions & 0 deletions .github/workflows/claude.yml
Original file line number Diff line number Diff line change
Expand Up @@ -274,3 +274,21 @@ jobs:
- **STP (Single Token Prediction)**: Standard autoregressive decoding — one token per forward pass. No speculative decoding or MTP. Benchmarks labeled "STP only" use vanilla decoding.
- **MTP (Multi-Token Prediction)**: Predicts multiple tokens per forward pass using speculative decoding (e.g., EAGLE, NEXTN).

### Expert Parallelism in Benchmark Scripts
vLLM and SGLang handle expert parallelism differently. When writing or reviewing benchmark scripts for MoE models:

- **vLLM** (`vllm serve`): Uses `--enable-expert-parallel` (a boolean flag). vLLM does NOT accept `--expert-parallel-size`. When EP is enabled, vLLM automatically determines the EP size based on TP and the number of available GPUs.
- **SGLang** (`sglang.launch_server`): Uses `--expert-parallel-size N` (an explicit integer). Pass the `EP_SIZE` env var value directly.
- **ATOM** (AMD vLLM fork): Uses `--enable-expert-parallel` (same as vLLM).

**Required pattern for vLLM/ATOM scripts:** Scripts must conditionally enable `--enable-expert-parallel` based on the `EP_SIZE` env var from the config YAML, rather than hardcoding it:
```bash
if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi
# Then use $EP in the vllm serve command
```
This ensures the script respects the `ep` setting in the master config YAML's search-space.