-
Notifications
You must be signed in to change notification settings - Fork 165
Update Qwen3-Coder-480B-A35B.md for AMD #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
3e56070
c023783
5a0e67d
a027228
724c2b4
36ae2f7
d2849c9
f1650b8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -132,3 +132,59 @@ ERROR [multiproc_executor.py:511] ValueError: The output_size of gate's and up's | |
| - [EvalPlus](https://github.com/evalplus/evalplus) | ||
| - [Qwen3-Coder](https://github.com/QwenLM/Qwen3-Coder) | ||
| - [vLLM Documentation](https://docs.vllm.ai/) | ||
|
|
||
|
|
||
|
|
||
| ## AMD GPU Support | ||
| Recommended approaches by hardware type are: | ||
|
|
||
|
|
||
| MI300X/MI325X/MI355X | ||
|
|
||
|
Comment on lines
+141
to
+143
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Please follow the steps here to install and run Qwen3-Coder models on AMD MI300X/MI325X/MI355X GPU. | ||
|
|
||
| ### Step 1: Installing vLLM (AMD ROCm Backend: MI300X, MI325X, MI355X) | ||
| > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). | ||
|
|
||
|
|
||
| ```bash | ||
| uv venv | ||
| source .venv/bin/activate | ||
| uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ | ||
| ``` | ||
|
|
||
|
|
||
| ### Step 2: Start the vLLM server | ||
|
|
||
| Run the vllm online serving | ||
|
|
||
| ### BF16 | ||
|
|
||
|
|
||
| ```shell | ||
| VLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct --trust-remote-code --max-model-len 131072 --enable-expert-parallel --data-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make it multiline command and be consistent with existing format. |
||
| ``` | ||
|
|
||
| ### FP8 | ||
|
|
||
| ```shell | ||
|
|
||
| VLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --trust-remote-code --max-model-len 131072 --enable-expert-parallel --data-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make it multiline command and be consistent with existing format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @haic0 also address this. Make sure in all your PR also setup the command as multiline for better readability. |
||
|
|
||
| ``` | ||
|
|
||
|
|
||
| ### Step 4: Run Benchmark | ||
| Open a new terminal and run the following command to execute the benchmark script inside the container. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @haic0 hi, must we use this other instruction? is it fine to use the benchmark instruction specify for the NVIDIA? Because we should try to group the commands together under the command section like this PR #202 |
||
|
|
||
| ```shell | ||
| vllm bench serve \ | ||
| --backend vllm \ | ||
| --model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ | ||
| --endpoint /v1/completions \ | ||
| --dataset-name random \ | ||
| --random-input 2048 \ | ||
| --random-output 1024 \ | ||
| --max-concurrency 10 \ | ||
| --num-prompt 100 | ||
| ``` | ||
|
Comment on lines
+177
to
+190
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The purpose of Step 4 is unclear in relation to Step 3. After starting a server in Step 3, this step describes running what appears to be an offline benchmark, which wouldn't use the running server. To avoid confusion, please clarify if the intention is to benchmark the running server (in which case a client benchmark tool should be used) or to run a separate offline benchmark. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haic0 Let's merge all the command together under the same header as CUDA. We shouldn't create whole new section. You can refer to this PR.
https://github.com/vllm-project/recipes/pull/219/changes