From 451114d70a86b256fd9f4e2e41004f972c043346 Mon Sep 17 00:00:00 2001 From: jiacao-amd Date: Tue, 27 Jan 2026 13:59:35 -0800 Subject: [PATCH 1/2] Add AMD MI300X and MI355 GPU recipes for Ministral-Large-3 Reasoning model Signed-off-by: jiacao-amd add uv launch support Signed-off-by: jiacao-amd --- Mistral/Mistral-Large-3.md | 73 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/Mistral/Mistral-Large-3.md b/Mistral/Mistral-Large-3.md index 43e8d369..0d233bc2 100644 --- a/Mistral/Mistral-Large-3.md +++ b/Mistral/Mistral-Large-3.md @@ -330,3 +330,76 @@ response = client.chat.completions.create( assistant_message = response.choices[0].message.content print(assistant_message) ``` + + +## AMD GPU Support + +Please follow the steps here to install and run kimi-K2 models on AMD MI300X, MI325X and MI355X.
+You can choose either Option A (Docker) or Option B (install with uv). + +### Option A: Run on Host with uv + > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). + ```bash + uv venv + source .venv/bin/activate + uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ + ``` + +### Option B: Run with Docker +Pull the latest vllm docker: +```shell +docker pull vllm/vllm-openai-rocm:latest +``` +Launch the ROCm vLLM docker: +```shell +docker run -d -it \ + --ipc=host \ + --entrypoint /bin/bash \ + --network=host \ + --privileged \ + --cap-add=CAP_SYS_ADMIN \ + --device=/dev/kfd \ + --device=/dev/dri \ + --device=/dev/mem \ + --group-add video \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /:/work \ + -e SHELL=/bin/bash \ + -p 8000:8000 \ + --name Mistral-Large-3 \ + vllm/vllm-openai-rocm:latest +``` +### Log in to Hugging Face +Log in to your Hugging Face account: +```shell +huggingface-cli login +``` + +### Start the vLLM server + +Run the vllm online serving with this sample command: +```shell +SAFETENSORS_FAST_GPU=1 \ +VLLM_USE_V1=1 \ +VLLM_USE_TRITON_FLASH_ATTN=0 \ +VLLM_ROCM_USE_AITER=1 \ +vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \ + --tensor-parallel-size 8 \ + --no-enable-prefix-caching \ + --trust-remote-code +``` + +### Run Benchmark +Open a new terminal and run the following command to execute the benchmark script inside the container. +```shell +docker exec -it Mistral-Large-3 vllm bench serve \ + --model "mistralai/Mistral-Large-3-675B-Instruct-2512" \ + --dataset-name random \ + --random-input-len 8192 \ + --random-output-len 1024 \ + --request-rate 10000 \ + --num-prompts 16 \ + --ignore-eos \ + --trust-remote-code +``` \ No newline at end of file From 76d09592268236cd9c617a6766afce6706d5555b Mon Sep 17 00:00:00 2001 From: jiacao-amd Date: Wed, 25 Feb 2026 16:59:41 -0800 Subject: [PATCH 2/2] code update Signed-off-by: jiacao-amd --- Mistral/Mistral-Large-3.md | 135 +++++++++++++++++-------------------- 1 file changed, 60 insertions(+), 75 deletions(-) diff --git a/Mistral/Mistral-Large-3.md b/Mistral/Mistral-Large-3.md index 0d233bc2..9be155ff 100644 --- a/Mistral/Mistral-Large-3.md +++ b/Mistral/Mistral-Large-3.md @@ -9,15 +9,62 @@ Here are the links to the different formats: ## Installing vLLM +### CUDA + ```bash uv venv source .venv/bin/activate uv pip install -U vllm --torch-backend auto ``` +### ROCm + +You can choose either Option A (Docker) or Option B (install with uv). + +#### Option A: Run on Host with uv +> Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). +```bash +uv venv +source .venv/bin/activate +uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ +``` + +#### Option B: Run with Docker +Pull the latest vllm docker: +```shell +docker pull vllm/vllm-openai-rocm:latest +``` +Launch the ROCm vLLM docker: +```shell +docker run -d -it \ + --ipc=host \ + --entrypoint /bin/bash \ + --network=host \ + --privileged \ + --cap-add=CAP_SYS_ADMIN \ + --device=/dev/kfd \ + --device=/dev/dri \ + --device=/dev/mem \ + --group-add video \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /:/work \ + -e SHELL=/bin/bash \ + -p 8000:8000 \ + --name Mistral-Large-3 \ + vllm/vllm-openai-rocm:latest +``` + +Log in to your Hugging Face account: +```shell +huggingface-cli login +``` + ## Running the model -## Running Mistral-Large-3-Instruct FP8 on 8xH200 +### CUDA + +#### Running Mistral-Large-3-Instruct FP8 on 8xH200 The Mistral-Large-3-Instruct FP8 format can be used on one 8xH200 node. We recommend to use this format if you plan to fine-tune as it can be more precise than NVFP4 in some situations. @@ -43,7 +90,7 @@ Additional flags: * You can set `--max-model-len` to preserve memory. By default it is set to `262144` which is quite large but not necessary for most scenarios. * You can set `--max-num-batched-tokens` to balance throughput and latency, higher means higher throughput but higher latency. -## Running Mistral-Large-3-Instruct NVFP4 on 4xB200 +#### Running Mistral-Large-3-Instruct NVFP4 on 4xB200 We recommend to use this format if you plan to deploy Mistral-Large-3 as it achieves performance similar to FP8 for less memory. However please note that for large context (`> 64k`) we observed a subsequent drop of performance. In such cases, please use the FP8 weights. Otherwise on B200 (Blackwell 200) we observe a significant speed-up and a minor regression on vision datasets probably due to the calibration that was performed mainly on text data. @@ -72,6 +119,17 @@ Additional flags: * You can set `--max-num-batched-tokens` to balance throughput and latency, higher means higher throughput but higher latency. * You can set `--limit-mm-per-prompt.image 0` to skip loading the vision encoder to have additional space for KV cache if the model is used for text-only tasks. +### ROCm + +Run the vllm online serving with this sample command: +```shell +SAFETENSORS_FAST_GPU=1 \ +VLLM_ROCM_USE_AITER=1 \ +vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \ + --tensor-parallel-size 8 \ + --no-enable-prefix-caching \ + --trust-remote-code +``` ## Usage of the model @@ -329,77 +387,4 @@ response = client.chat.completions.create( assistant_message = response.choices[0].message.content print(assistant_message) -``` - - -## AMD GPU Support - -Please follow the steps here to install and run kimi-K2 models on AMD MI300X, MI325X and MI355X.
-You can choose either Option A (Docker) or Option B (install with uv). - -### Option A: Run on Host with uv - > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). - ```bash - uv venv - source .venv/bin/activate - uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ - ``` - -### Option B: Run with Docker -Pull the latest vllm docker: -```shell -docker pull vllm/vllm-openai-rocm:latest -``` -Launch the ROCm vLLM docker: -```shell -docker run -d -it \ - --ipc=host \ - --entrypoint /bin/bash \ - --network=host \ - --privileged \ - --cap-add=CAP_SYS_ADMIN \ - --device=/dev/kfd \ - --device=/dev/dri \ - --device=/dev/mem \ - --group-add video \ - --cap-add=SYS_PTRACE \ - --security-opt seccomp=unconfined \ - -v /:/work \ - -e SHELL=/bin/bash \ - -p 8000:8000 \ - --name Mistral-Large-3 \ - vllm/vllm-openai-rocm:latest -``` -### Log in to Hugging Face -Log in to your Hugging Face account: -```shell -huggingface-cli login -``` - -### Start the vLLM server - -Run the vllm online serving with this sample command: -```shell -SAFETENSORS_FAST_GPU=1 \ -VLLM_USE_V1=1 \ -VLLM_USE_TRITON_FLASH_ATTN=0 \ -VLLM_ROCM_USE_AITER=1 \ -vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \ - --tensor-parallel-size 8 \ - --no-enable-prefix-caching \ - --trust-remote-code -``` - -### Run Benchmark -Open a new terminal and run the following command to execute the benchmark script inside the container. -```shell -docker exec -it Mistral-Large-3 vllm bench serve \ - --model "mistralai/Mistral-Large-3-675B-Instruct-2512" \ - --dataset-name random \ - --random-input-len 8192 \ - --random-output-len 1024 \ - --request-rate 10000 \ - --num-prompts 16 \ - --ignore-eos \ - --trust-remote-code ``` \ No newline at end of file