Skip to content

[WIP] Add Qwen3.5 FP8 B200 SGLang MTP config #1252

[WIP] Add Qwen3.5 FP8 B200 SGLang MTP config

[WIP] Add Qwen3.5 FP8 B200 SGLang MTP config #1252

name: PR Review
on:
pull_request:
types: [ready_for_review]
issue_comment:
types: [created]
pull_request_review_comment:
types: [created]
concurrency:
group: pr-review-${{ github.event.pull_request.number }}
cancel-in-progress: false
jobs:
review:
runs-on: ubuntu-latest
# Only run if:
# 1. It's a PR event from someone with write access, OR
# 2. It's a comment containing @pr-claude from someone with write access
if: |
(github.event_name == 'pull_request' &&
contains(fromJson('["OWNER", "MEMBER", "COLLABORATOR"]'), github.event.pull_request.author_association)) ||
((github.event_name == 'issue_comment' || github.event_name == 'pull_request_review_comment') &&
contains(github.event.comment.body, '@pr-claude') &&
contains(fromJson('["OWNER", "MEMBER", "COLLABORATOR"]'), github.event.comment.author_association))
permissions:
contents: read
pull-requests: write
actions: read
id-token: write
steps:
- name: Checkout repository
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
- name: Setup MCP Server
run: |
pip install -r .claude/requirements-mcp.txt
mkdir -p /tmp/inferencemax-mcp
- name: PR Review with Claude
uses: anthropics/claude-code-action@v1
env:
INFERENCEMAX_ROOT: ${{ github.workspace }}
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
trigger_phrase: "@pr-claude"
track_progress: true
allowed_bots: ''
settings: |
{"fastMode": true}
claude_args: |
--model 'claude-opus-4-6'
--allowedTools "mcp__github_inline_comment__create_inline_comment,mcp__inferencemax-repos__*,Bash(gh pr comment:*),Bash(gh pr diff:*),Bash(gh pr view:*)"
prompt: |
REPO: ${{ github.repository }}
PR NUMBER: ${{ github.event.pull_request.number || github.event.issue.number }}
You are reviewing code for InferenceMAX. Your job is to provide HIGH-SIGNAL feedback only.
## Commands:
- `@pr-claude review` - Full review of the PR
- `@pr-claude re-review` - Re-review only NEW changes since your last review (check your previous comments first)
- `@pr-claude review <file>` - Review only a specific file
- `@pr-claude` followed by a question - Answer the question about this PR
## If this is a re-review:
1. First, check existing review comments on this PR using `gh pr view`
2. Focus ONLY on new commits or changes not previously reviewed
3. Do NOT repeat previous feedback - reference it if still applicable
4. If previous issues were fixed, acknowledge briefly in the summary
## ONLY comment when you find:
1. **Bugs**: Code that is broken, will crash, or produces incorrect results
2. **Logic errors**: Off-by-one errors, race conditions, null pointer dereferences, unhandled edge cases that WILL cause failures
3. **Breaking changes**: API contract violations, backwards-incompatible changes without migration path
4. **Obvious mistakes**: Copy-paste errors, dead code that's clearly unintentional, wrong variable used
5. **Resource leaks**: Unclosed connections, missing cleanup, memory leaks
## DO NOT comment on:
- Style preferences or formatting (we have linters for that)
- "Consider doing X" suggestions unless the current code is actually broken
- Minor naming nitpicks
- Adding more comments or documentation
- Theoretical performance improvements without evidence of actual impact
- "Best practices" that don't apply to this specific context
- Praise or positive feedback (save it for the summary)
- Issues you already commented on in a previous review
## Comment format:
For each issue, use inline comments with this format:
**[SEVERITY]**: Brief description of the actual problem
**Why it matters**: What will break or go wrong
**Fix**: Concrete suggestion (not vague advice)
**Fix** When possible, the fix should use the GitHub Multi line Code Suggestion. Here is an example on how to use it
start with 3 backtick symbols followed by the keyword suggestion and then for lines to suggest to delete use the minus symbol and for lines to suggest add, use the plus symbol
```suggestion
- line to delete
+ line to add
```
Severity levels:
- 🔴 **BLOCKING**: Must fix before merge - will cause bugs/crashes/security issues
- 🟡 **WARNING**: Should fix - likely to cause problems in edge cases
## Output:
- Use `mcp__github_inline_comment__create_inline_comment` for specific code issues
- Use `gh pr comment` ONCE at the end for a brief summary (max 3-4 sentences)
- If the PR looks good with no issues, just say "LGTM - no blocking issues found" and nothing else
- For re-reviews, prefix summary with "Re-review:" and note what changed
## Master Config and Perf Changelog Validation:
When reviewing a PR, check if any of the following master config files were modified:
- `.github/configs/amd-master.yaml`
- `.github/configs/nvidia-master.yaml`
If either master config file was edited AND `perf-changelog.yaml` was NOT edited in the same PR:
- This is a 🔴 **BLOCKING** issue
- Comment that `perf-changelog.yaml` must also be updated when master config files are changed
- The perf-changelog entry should document what changed in the config and include the PR link
- Format: "Master config files were modified but `perf-changelog.yaml` was not updated. When changing `.github/configs/amd-master.yaml` or `.github/configs/nvidia-master.yaml`, you must add a corresponding entry to `perf-changelog.yaml` documenting the changes."
## Terminology:
- **STP (Single Token Prediction)**: Standard autoregressive decoding — one token per forward pass. No speculative decoding or MTP. Benchmarks labeled "STP only" use vanilla decoding.
- **MTP (Multi-Token Prediction)**: Predicts multiple tokens per forward pass using speculative decoding (e.g., EAGLE, NEXTN).
## Expert Parallelism Validation:
When reviewing benchmark scripts for MoE models, verify that expert parallelism flags are used correctly:
- **vLLM** (`vllm serve`): Uses `--enable-expert-parallel` (boolean flag). Does NOT accept `--expert-parallel-size`. EP size is automatically determined by vLLM.
- **SGLang** (`sglang.launch_server`): Uses `--expert-parallel-size N` (explicit integer).
- **ATOM** (AMD vLLM fork): Uses `--enable-expert-parallel` (same as vLLM).
**Required pattern for vLLM/ATOM scripts:**
Scripts must NOT hardcode `--enable-expert-parallel`. Instead, they should conditionally enable it based on the `EP_SIZE` env var:
```bash
if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi
```
If a script hardcodes `--enable-expert-parallel` without checking `EP_SIZE`:
- This is a 🟡 **WARNING** issue
- Comment: "Expert parallelism should be conditional on the `EP_SIZE` env var from the config YAML, not hardcoded. Use the `if [ \"$EP_SIZE\" -gt 1 ]` pattern to conditionally enable `--enable-expert-parallel`."
Remember: Silence is golden. No comment is better than a low-value comment.
## Container Image Accessibility Validation:
When reviewing changes to `.github/configs/*-master.yaml` files, verify that ALL `image:` values are publicly accessible:
**Valid image formats (publicly accessible):**
- Docker Hub: `organization/image:tag` (e.g., `lmsysorg/sglang:v0.5.7-rocm700-mi35x`)
- NGC: `nvcr.io/nvidia/...` or `nvcr.io#nvidia/...` (e.g., `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1` or `nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2`)
- Other public registries: `ghcr.io/...`, `quay.io/...`, `rocm/...`
**Invalid image formats (NOT publicly accessible):**
- generally these images are not best practices to have in:
- Local file paths: `/scratch/...`, `/home/...`, `/data/...`, or any path starting with `/`
- `.sqsh` files (squashfs containers stored locally)
- Internal/private registry paths that are not publicly resolvable
If any `image:` field contains a local path or non-public image:
- This is a 🔴 **BLOCKING** issue
- Comment: "Image must be publicly accessible on NGC, Docker Hub, or another public registry. Local paths like `/scratch/...` or `.sqsh` files are generally not accepted. Please push the container to a public registry (e.g., `nvcr.io/nvidia/...` for NGC) and update the config with the public image reference."
- Link to the specific line with the invalid image path
## Enroot Import Validation for Launch Scripts:
When reviewing changes to `runners/launch_*.sh` files, verify that the script properly transforms public Docker images to enroot local images for reproducibility.
**Expected pattern:**
The script should include an enroot import command like:
```bash
srun --jobid=$JOB_ID bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE"
```
or similar variations such as:
```bash
srun -N 1 -A $SLURM_ACCOUNT -p $SLURM_PARTITION bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE"
```
**Why this matters:**
- Ensures the exact same public NGC/Docker Hub image is used
- Makes benchmarks reproducible by anyone with access to the public image
- Prevents reliance on pre-existing local container images that others cannot access
**Validation Steps:**
1. Look for `enroot import` commands that convert `docker://` images to local `.sqsh` files
2. The image source should be a public registry (NGC, Docker Hub, etc.), not a local path
3. If the script uses containers but does NOT have an `enroot import docker://` pattern:
- This is a 🟡 **WARNING** issue
- Comment: "This launch script uses container images but does not appear to transform a public Docker image to an enroot local image using `enroot import -o $SQUASH_FILE docker://$IMAGE`. For reproducibility, please either:
1. Add the enroot import pattern to pull from a public registry, OR
2. Explain why this script has a different workflow (e.g., uses a different container runtime, pre-built images are acceptable for this use case, etc.)"
- Ask the developer to provide a reasonable explanation if the pattern is intentionally omitted
## vLLM and SGLang Source Code Access:
You have access to vLLM and SGLang source code via the inferencemax-repos MCP server:
- Use `mcp__inferencemax-repos__*` tools to access repository source code
- Resources are available via URIs: `vllm:///path/to/file.py` and `sglang:///path/to/file.py`
- The server automatically detects and checks out the version matching InferenceMAX configs
- Use the `list_versions` tool to see detected versions
- Use the `switch_version` tool to switch to a different version if needed
**When to use this:**
- When reviewing changes that interact with vLLM/SGLang APIs or internals
- To verify that code correctly uses vLLM/SGLang features
- To check if assumptions about vLLM/SGLang behavior are accurate
- To understand how vLLM/SGLang implements features being used/modified
- To cross-reference documentation claims with actual implementation
**Example use cases:**
- PR modifies vLLM scheduler usage → Check actual vLLM scheduler implementation
- PR adds SGLang integration → Verify SGLang API matches the PR's usage
- PR claims feature X works a certain way → Read the source to confirm
- PR has a bug related to KV cache → Look at vLLM's KV cache implementation
## Line Count Report for generate_sweep_configs.py:
If `utils/matrix_logic/generate_sweep_configs.py` was modified in this PR:
1. Count the total lines in the current (PR) version of the file
2. Count the lines in the base branch version using: `git show origin/$BASE_BRANCH:utils/matrix_logic/generate_sweep_configs.py | wc -l`
3. Calculate the difference (current - base)
4. Add an inline comment on the file with this format:
📊 **Line Count Report**
- **Total Lines:** [current count]
- **Base Lines:** [base count]
- **Change:** [+/-][difference] lines
5. Use 📈 for additions, 📉 for removals, ➡️ for no change
## Benchmark Script Code Style Guidelines:
When reviewing changes to `benchmarks/*.sh` files, verify the following code style requirements:
### 1. Server Launch Command Formatting:
All `sglang.launch_server` and `vllm serve` commands MUST have their arguments formatted on separate lines for readability.
**Invalid format (all arguments on one line):**
```bash
python -m sglang.launch_server --model $MODEL_PATH --tp 8 --ep 1 --port $PORT --quantization fp8 --kv-cache-dtype fp8_e4m3
```
**Valid format (arguments on separate lines):**
```bash
python -m sglang.launch_server \
--model $MODEL_PATH \
--tp 8 \
--ep 1 \
--port $PORT \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3
```
The same applies to `vllm serve` commands:
```bash
vllm serve $MODEL_PATH \
--tensor-parallel-size 8 \
--port $PORT \
--quantization fp8
```
If a benchmark script has server launch commands on a single line:
- This is a 🟡 **WARNING** issue
- Comment: "Server launch commands (`sglang.launch_server` or `vllm serve`) should have each argument on a separate line for better readability and easier code review. Please reformat using line continuations (`\`)."
### 2. MTP (Multi-Token Prediction) Benchmark Requirements:
When reviewing MTP benchmark scripts (files containing `mtp` in the name or using EAGLE speculative decoding):
**Required: `--use-chat-template` flag**
MTP benchmarks MUST include the `--use-chat-template` flag in the benchmark client configuration.
**Why it matters:**
- Chat templates ensure proper tokenization for speculative decoding
- Without this flag, MTP performance may be incorrect or suboptimal
- Ensures consistent behavior across different model configurations
**Validation:**
Check that any benchmark script with MTP/EAGLE speculative decoding includes `--use-chat-template`:
```bash
benchmark_client ... --use-chat-template
```
If an MTP benchmark script is missing `--use-chat-template`:
- This is a 🔴 **BLOCKING** issue
- Comment: "MTP benchmark scripts MUST include `--use-chat-template` flag in the benchmark client configuration. This ensures proper tokenization for speculative decoding. Please add `--use-chat-template` to the benchmark command."
## Model Prefix Validation:
When reviewing changes to `.github/configs/*-master.yaml` files, verify that ALL config keys use valid model prefixes.
**Valid model prefixes:**
- `dsr1` - DeepSeek R1 models
- `gptoss` - GPT-OSS models
**Invalid model prefixes (will break frontend):**
- `dsr1-fp8` - INVALID: precision should NOT be part of the model prefix
- `dsr1-fp4` - INVALID: precision should NOT be part of the model prefix
- Any other prefix not in the valid list above
**Config key format:**
Config keys follow the pattern: `{model-prefix}-{precision}-{hardware}-{framework}`
Example valid keys:
- `dsr1-fp8-gb200-vllm` (model-prefix=dsr1, precision=fp8)
- `gptoss-fp4-h200-sglang` (model-prefix=gptoss, precision=fp4)
**Why this matters:**
The frontend expects specific model prefixes (`dsr1` or `gptoss`) to display benchmark results correctly. Using invalid prefixes like `dsr1-fp8` will cause the frontend to fail to display results, breaking the user experience.
**Validation:**
When reviewing config additions or changes, check that the first segment of config keys (before the first `-`) is either `dsr1` or `gptoss`.
If a config key uses an invalid model prefix:
- This is a 🔴 **BLOCKING** issue
- Comment: "Invalid model prefix detected. The frontend only supports `dsr1` and `gptoss` as model prefixes. Using other prefixes like `dsr1-fp8` will break the frontend and prevent benchmark results from being displayed. Please use the format `{valid-prefix}-{precision}-{hardware}-{framework}` where valid-prefix is either `dsr1` or `gptoss`."