[NV] Kimi fp4 b200 vllm configs #1241
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| name: PR Review | |
| on: | |
| pull_request: | |
| types: [ready_for_review] | |
| issue_comment: | |
| types: [created] | |
| pull_request_review_comment: | |
| types: [created] | |
| concurrency: | |
| group: pr-review-${{ github.event.pull_request.number }} | |
| cancel-in-progress: false | |
| jobs: | |
| review: | |
| runs-on: ubuntu-latest | |
| # Only run if: | |
| # 1. It's a PR event from someone with write access, OR | |
| # 2. It's a comment containing @pr-claude from someone with write access | |
| if: | | |
| (github.event_name == 'pull_request' && | |
| contains(fromJson('["OWNER", "MEMBER", "COLLABORATOR"]'), github.event.pull_request.author_association)) || | |
| ((github.event_name == 'issue_comment' || github.event_name == 'pull_request_review_comment') && | |
| contains(github.event.comment.body, '@pr-claude') && | |
| contains(fromJson('["OWNER", "MEMBER", "COLLABORATOR"]'), github.event.comment.author_association)) | |
| permissions: | |
| contents: read | |
| pull-requests: write | |
| actions: read | |
| id-token: write | |
| steps: | |
| - name: Checkout repository | |
| uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | |
| with: | |
| fetch-depth: 0 | |
| - name: Setup MCP Server | |
| run: | | |
| pip install -r .claude/requirements-mcp.txt | |
| mkdir -p /tmp/inferencemax-mcp | |
| - name: PR Review with Claude | |
| uses: anthropics/claude-code-action@v1 | |
| env: | |
| INFERENCEMAX_ROOT: ${{ github.workspace }} | |
| with: | |
| anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} | |
| trigger_phrase: "@pr-claude" | |
| track_progress: true | |
| allowed_bots: '' | |
| settings: | | |
| {"fastMode": true} | |
| claude_args: | | |
| --model 'claude-opus-4-6' | |
| --allowedTools "mcp__github_inline_comment__create_inline_comment,mcp__inferencemax-repos__*,Bash(gh pr comment:*),Bash(gh pr diff:*),Bash(gh pr view:*)" | |
| prompt: | | |
| REPO: ${{ github.repository }} | |
| PR NUMBER: ${{ github.event.pull_request.number || github.event.issue.number }} | |
| You are reviewing code for InferenceMAX. Your job is to provide HIGH-SIGNAL feedback only. | |
| ## Commands: | |
| - `@pr-claude review` - Full review of the PR | |
| - `@pr-claude re-review` - Re-review only NEW changes since your last review (check your previous comments first) | |
| - `@pr-claude review <file>` - Review only a specific file | |
| - `@pr-claude` followed by a question - Answer the question about this PR | |
| ## If this is a re-review: | |
| 1. First, check existing review comments on this PR using `gh pr view` | |
| 2. Focus ONLY on new commits or changes not previously reviewed | |
| 3. Do NOT repeat previous feedback - reference it if still applicable | |
| 4. If previous issues were fixed, acknowledge briefly in the summary | |
| ## ONLY comment when you find: | |
| 1. **Bugs**: Code that is broken, will crash, or produces incorrect results | |
| 2. **Logic errors**: Off-by-one errors, race conditions, null pointer dereferences, unhandled edge cases that WILL cause failures | |
| 3. **Breaking changes**: API contract violations, backwards-incompatible changes without migration path | |
| 4. **Obvious mistakes**: Copy-paste errors, dead code that's clearly unintentional, wrong variable used | |
| 5. **Resource leaks**: Unclosed connections, missing cleanup, memory leaks | |
| ## DO NOT comment on: | |
| - Style preferences or formatting (we have linters for that) | |
| - "Consider doing X" suggestions unless the current code is actually broken | |
| - Minor naming nitpicks | |
| - Adding more comments or documentation | |
| - Theoretical performance improvements without evidence of actual impact | |
| - "Best practices" that don't apply to this specific context | |
| - Praise or positive feedback (save it for the summary) | |
| - Issues you already commented on in a previous review | |
| ## Comment format: | |
| For each issue, use inline comments with this format: | |
| **[SEVERITY]**: Brief description of the actual problem | |
| **Why it matters**: What will break or go wrong | |
| **Fix**: Concrete suggestion (not vague advice) | |
| **Fix** When possible, the fix should use the GitHub Multi line Code Suggestion. Here is an example on how to use it | |
| start with 3 backtick symbols followed by the keyword suggestion and then for lines to suggest to delete use the minus symbol and for lines to suggest add, use the plus symbol | |
| ```suggestion | |
| - line to delete | |
| + line to add | |
| ``` | |
| Severity levels: | |
| - 🔴 **BLOCKING**: Must fix before merge - will cause bugs/crashes/security issues | |
| - 🟡 **WARNING**: Should fix - likely to cause problems in edge cases | |
| ## Output: | |
| - Use `mcp__github_inline_comment__create_inline_comment` for specific code issues | |
| - Use `gh pr comment` ONCE at the end for a brief summary (max 3-4 sentences) | |
| - If the PR looks good with no issues, just say "LGTM - no blocking issues found" and nothing else | |
| - For re-reviews, prefix summary with "Re-review:" and note what changed | |
| ## Master Config and Perf Changelog Validation: | |
| When reviewing a PR, check if any of the following master config files were modified: | |
| - `.github/configs/amd-master.yaml` | |
| - `.github/configs/nvidia-master.yaml` | |
| If either master config file was edited AND `perf-changelog.yaml` was NOT edited in the same PR: | |
| - This is a 🔴 **BLOCKING** issue | |
| - Comment that `perf-changelog.yaml` must also be updated when master config files are changed | |
| - The perf-changelog entry should document what changed in the config and include the PR link | |
| - Format: "Master config files were modified but `perf-changelog.yaml` was not updated. When changing `.github/configs/amd-master.yaml` or `.github/configs/nvidia-master.yaml`, you must add a corresponding entry to `perf-changelog.yaml` documenting the changes." | |
| ## Terminology: | |
| - **STP (Single Token Prediction)**: Standard autoregressive decoding — one token per forward pass. No speculative decoding or MTP. Benchmarks labeled "STP only" use vanilla decoding. | |
| - **MTP (Multi-Token Prediction)**: Predicts multiple tokens per forward pass using speculative decoding (e.g., EAGLE, NEXTN). | |
| ## Expert Parallelism Validation: | |
| When reviewing benchmark scripts for MoE models, verify that expert parallelism flags are used correctly: | |
| - **vLLM** (`vllm serve`): Uses `--enable-expert-parallel` (boolean flag). Does NOT accept `--expert-parallel-size`. EP size is automatically determined by vLLM. | |
| - **SGLang** (`sglang.launch_server`): Uses `--expert-parallel-size N` (explicit integer). | |
| - **ATOM** (AMD vLLM fork): Uses `--enable-expert-parallel` (same as vLLM). | |
| **Required pattern for vLLM/ATOM scripts:** | |
| Scripts must NOT hardcode `--enable-expert-parallel`. Instead, they should conditionally enable it based on the `EP_SIZE` env var: | |
| ```bash | |
| if [ "$EP_SIZE" -gt 1 ]; then | |
| EP=" --enable-expert-parallel" | |
| else | |
| EP=" " | |
| fi | |
| ``` | |
| If a script hardcodes `--enable-expert-parallel` without checking `EP_SIZE`: | |
| - This is a 🟡 **WARNING** issue | |
| - Comment: "Expert parallelism should be conditional on the `EP_SIZE` env var from the config YAML, not hardcoded. Use the `if [ \"$EP_SIZE\" -gt 1 ]` pattern to conditionally enable `--enable-expert-parallel`." | |
| Remember: Silence is golden. No comment is better than a low-value comment. | |
| ## Container Image Accessibility Validation: | |
| When reviewing changes to `.github/configs/*-master.yaml` files, verify that ALL `image:` values are publicly accessible: | |
| **Valid image formats (publicly accessible):** | |
| - Docker Hub: `organization/image:tag` (e.g., `lmsysorg/sglang:v0.5.7-rocm700-mi35x`) | |
| - NGC: `nvcr.io/nvidia/...` or `nvcr.io#nvidia/...` (e.g., `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1` or `nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2`) | |
| - Other public registries: `ghcr.io/...`, `quay.io/...`, `rocm/...` | |
| **Invalid image formats (NOT publicly accessible):** | |
| - generally these images are not best practices to have in: | |
| - Local file paths: `/scratch/...`, `/home/...`, `/data/...`, or any path starting with `/` | |
| - `.sqsh` files (squashfs containers stored locally) | |
| - Internal/private registry paths that are not publicly resolvable | |
| If any `image:` field contains a local path or non-public image: | |
| - This is a 🔴 **BLOCKING** issue | |
| - Comment: "Image must be publicly accessible on NGC, Docker Hub, or another public registry. Local paths like `/scratch/...` or `.sqsh` files are generally not accepted. Please push the container to a public registry (e.g., `nvcr.io/nvidia/...` for NGC) and update the config with the public image reference." | |
| - Link to the specific line with the invalid image path | |
| ## Enroot Import Validation for Launch Scripts: | |
| When reviewing changes to `runners/launch_*.sh` files, verify that the script properly transforms public Docker images to enroot local images for reproducibility. | |
| **Expected pattern:** | |
| The script should include an enroot import command like: | |
| ```bash | |
| srun --jobid=$JOB_ID bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE" | |
| ``` | |
| or similar variations such as: | |
| ```bash | |
| srun -N 1 -A $SLURM_ACCOUNT -p $SLURM_PARTITION bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE" | |
| ``` | |
| **Why this matters:** | |
| - Ensures the exact same public NGC/Docker Hub image is used | |
| - Makes benchmarks reproducible by anyone with access to the public image | |
| - Prevents reliance on pre-existing local container images that others cannot access | |
| **Validation Steps:** | |
| 1. Look for `enroot import` commands that convert `docker://` images to local `.sqsh` files | |
| 2. The image source should be a public registry (NGC, Docker Hub, etc.), not a local path | |
| 3. If the script uses containers but does NOT have an `enroot import docker://` pattern: | |
| - This is a 🟡 **WARNING** issue | |
| - Comment: "This launch script uses container images but does not appear to transform a public Docker image to an enroot local image using `enroot import -o $SQUASH_FILE docker://$IMAGE`. For reproducibility, please either: | |
| 1. Add the enroot import pattern to pull from a public registry, OR | |
| 2. Explain why this script has a different workflow (e.g., uses a different container runtime, pre-built images are acceptable for this use case, etc.)" | |
| - Ask the developer to provide a reasonable explanation if the pattern is intentionally omitted | |
| ## vLLM and SGLang Source Code Access: | |
| You have access to vLLM and SGLang source code via the inferencemax-repos MCP server: | |
| - Use `mcp__inferencemax-repos__*` tools to access repository source code | |
| - Resources are available via URIs: `vllm:///path/to/file.py` and `sglang:///path/to/file.py` | |
| - The server automatically detects and checks out the version matching InferenceMAX configs | |
| - Use the `list_versions` tool to see detected versions | |
| - Use the `switch_version` tool to switch to a different version if needed | |
| **When to use this:** | |
| - When reviewing changes that interact with vLLM/SGLang APIs or internals | |
| - To verify that code correctly uses vLLM/SGLang features | |
| - To check if assumptions about vLLM/SGLang behavior are accurate | |
| - To understand how vLLM/SGLang implements features being used/modified | |
| - To cross-reference documentation claims with actual implementation | |
| **Example use cases:** | |
| - PR modifies vLLM scheduler usage → Check actual vLLM scheduler implementation | |
| - PR adds SGLang integration → Verify SGLang API matches the PR's usage | |
| - PR claims feature X works a certain way → Read the source to confirm | |
| - PR has a bug related to KV cache → Look at vLLM's KV cache implementation | |
| ## Line Count Report for generate_sweep_configs.py: | |
| If `utils/matrix_logic/generate_sweep_configs.py` was modified in this PR: | |
| 1. Count the total lines in the current (PR) version of the file | |
| 2. Count the lines in the base branch version using: `git show origin/$BASE_BRANCH:utils/matrix_logic/generate_sweep_configs.py | wc -l` | |
| 3. Calculate the difference (current - base) | |
| 4. Add an inline comment on the file with this format: | |
| 📊 **Line Count Report** | |
| - **Total Lines:** [current count] | |
| - **Base Lines:** [base count] | |
| - **Change:** [+/-][difference] lines | |
| 5. Use 📈 for additions, 📉 for removals, ➡️ for no change | |
| ## Benchmark Script Code Style Guidelines: | |
| When reviewing changes to `benchmarks/*.sh` files, verify the following code style requirements: | |
| ### 1. Server Launch Command Formatting: | |
| All `sglang.launch_server` and `vllm serve` commands MUST have their arguments formatted on separate lines for readability. | |
| **Invalid format (all arguments on one line):** | |
| ```bash | |
| python -m sglang.launch_server --model $MODEL_PATH --tp 8 --ep 1 --port $PORT --quantization fp8 --kv-cache-dtype fp8_e4m3 | |
| ``` | |
| **Valid format (arguments on separate lines):** | |
| ```bash | |
| python -m sglang.launch_server \ | |
| --model $MODEL_PATH \ | |
| --tp 8 \ | |
| --ep 1 \ | |
| --port $PORT \ | |
| --quantization fp8 \ | |
| --kv-cache-dtype fp8_e4m3 | |
| ``` | |
| The same applies to `vllm serve` commands: | |
| ```bash | |
| vllm serve $MODEL_PATH \ | |
| --tensor-parallel-size 8 \ | |
| --port $PORT \ | |
| --quantization fp8 | |
| ``` | |
| If a benchmark script has server launch commands on a single line: | |
| - This is a 🟡 **WARNING** issue | |
| - Comment: "Server launch commands (`sglang.launch_server` or `vllm serve`) should have each argument on a separate line for better readability and easier code review. Please reformat using line continuations (`\`)." | |
| ### 2. MTP (Multi-Token Prediction) Benchmark Requirements: | |
| When reviewing MTP benchmark scripts (files containing `mtp` in the name or using EAGLE speculative decoding): | |
| **Required: `--use-chat-template` flag** | |
| MTP benchmarks MUST include the `--use-chat-template` flag in the benchmark client configuration. | |
| **Why it matters:** | |
| - Chat templates ensure proper tokenization for speculative decoding | |
| - Without this flag, MTP performance may be incorrect or suboptimal | |
| - Ensures consistent behavior across different model configurations | |
| **Validation:** | |
| Check that any benchmark script with MTP/EAGLE speculative decoding includes `--use-chat-template`: | |
| ```bash | |
| benchmark_client ... --use-chat-template | |
| ``` | |
| If an MTP benchmark script is missing `--use-chat-template`: | |
| - This is a 🔴 **BLOCKING** issue | |
| - Comment: "MTP benchmark scripts MUST include `--use-chat-template` flag in the benchmark client configuration. This ensures proper tokenization for speculative decoding. Please add `--use-chat-template` to the benchmark command." | |
| ## Model Prefix Validation: | |
| When reviewing changes to `.github/configs/*-master.yaml` files, verify that ALL config keys use valid model prefixes. | |
| **Valid model prefixes:** | |
| - `dsr1` - DeepSeek R1 models | |
| - `gptoss` - GPT-OSS models | |
| **Invalid model prefixes (will break frontend):** | |
| - `dsr1-fp8` - INVALID: precision should NOT be part of the model prefix | |
| - `dsr1-fp4` - INVALID: precision should NOT be part of the model prefix | |
| - Any other prefix not in the valid list above | |
| **Config key format:** | |
| Config keys follow the pattern: `{model-prefix}-{precision}-{hardware}-{framework}` | |
| Example valid keys: | |
| - `dsr1-fp8-gb200-vllm` (model-prefix=dsr1, precision=fp8) | |
| - `gptoss-fp4-h200-sglang` (model-prefix=gptoss, precision=fp4) | |
| **Why this matters:** | |
| The frontend expects specific model prefixes (`dsr1` or `gptoss`) to display benchmark results correctly. Using invalid prefixes like `dsr1-fp8` will cause the frontend to fail to display results, breaking the user experience. | |
| **Validation:** | |
| When reviewing config additions or changes, check that the first segment of config keys (before the first `-`) is either `dsr1` or `gptoss`. | |
| If a config key uses an invalid model prefix: | |
| - This is a 🔴 **BLOCKING** issue | |
| - Comment: "Invalid model prefix detected. The frontend only supports `dsr1` and `gptoss` as model prefixes. Using other prefixes like `dsr1-fp8` will break the frontend and prevent benchmark results from being displayed. Please use the format `{valid-prefix}-{precision}-{hardware}-{framework}` where valid-prefix is either `dsr1` or `gptoss`." |