From 451114d70a86b256fd9f4e2e41004f972c043346 Mon Sep 17 00:00:00 2001
From: jiacao-amd <jiahui.cao@amd.com>
Date: Tue, 27 Jan 2026 13:59:35 -0800
Subject: [PATCH 1/2] Add AMD MI300X and MI355 GPU recipes for
 Ministral-Large-3 Reasoning model

Signed-off-by: jiacao-amd <jiahui.cao@amd.com>

add uv launch support

Signed-off-by: jiacao-amd <jiahui.cao@amd.com>
---
 Mistral/Mistral-Large-3.md | 73 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
diff --git a/Mistral/Mistral-Large-3.md b/Mistral/Mistral-Large-3.md
index 43e8d369..0d233bc2 100644
--- a/Mistral/Mistral-Large-3.md
+++ b/Mistral/Mistral-Large-3.md
@@ -330,3 +330,76 @@ response = client.chat.completions.create(
 assistant_message = response.choices[0].message.content
 print(assistant_message)
 ```
+
+
+## AMD GPU Support
+
+Please follow the steps here to install and run kimi-K2 models on AMD MI300X, MI325X and MI355X.<br>
+You can choose either Option A (Docker) or Option B (install with uv).
+
+### Option A: Run on Host with uv
+ > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images).  
+ ```bash 
+ uv venv 
+ source .venv/bin/activate 
+ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
+ ```
+
+### Option B: Run with Docker
+Pull the latest vllm docker:
+```shell
+docker pull vllm/vllm-openai-rocm:latest
+```
+Launch the ROCm vLLM docker: 
+```shell
+docker run -d -it \
+  --ipc=host \
+  --entrypoint /bin/bash \
+  --network=host \
+  --privileged \
+  --cap-add=CAP_SYS_ADMIN \
+  --device=/dev/kfd \
+  --device=/dev/dri \
+  --device=/dev/mem \
+  --group-add video \
+  --cap-add=SYS_PTRACE \
+  --security-opt seccomp=unconfined \
+  -v /:/work \
+  -e SHELL=/bin/bash \
+  -p 8000:8000 \
+  --name Mistral-Large-3 \
+  vllm/vllm-openai-rocm:latest
+```
+### Log in to Hugging Face
+Log in to your Hugging Face account:
+```shell
+huggingface-cli login
+```
+
+### Start the vLLM server
+
+Run the vllm online serving with this sample command:
+```shell
+SAFETENSORS_FAST_GPU=1 \
+VLLM_USE_V1=1 \
+VLLM_USE_TRITON_FLASH_ATTN=0 \
+VLLM_ROCM_USE_AITER=1 \
+vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
+  --tensor-parallel-size 8 \
+  --no-enable-prefix-caching \
+  --trust-remote-code
+```
+
+### Run Benchmark
+Open a new terminal and run the following command to execute the benchmark script inside the container.
+```shell
+docker exec -it Mistral-Large-3 vllm bench serve \
+  --model "mistralai/Mistral-Large-3-675B-Instruct-2512" \
+  --dataset-name random \
+  --random-input-len 8192 \
+  --random-output-len 1024 \
+  --request-rate 10000 \
+  --num-prompts 16 \
+  --ignore-eos \
+  --trust-remote-code 
+```
\ No newline at end of file

From 76d09592268236cd9c617a6766afce6706d5555b Mon Sep 17 00:00:00 2001
From: jiacao-amd <jiahui.cao@amd.com>
Date: Wed, 25 Feb 2026 16:59:41 -0800
Subject: [PATCH 2/2] code update

Signed-off-by: jiacao-amd <jiahui.cao@amd.com>
---
 Mistral/Mistral-Large-3.md | 135 +++++++++++++++++--------------------
 1 file changed, 60 insertions(+), 75 deletions(-)

diff --git a/Mistral/Mistral-Large-3.md b/Mistral/Mistral-Large-3.md
index 0d233bc2..9be155ff 100644
--- a/Mistral/Mistral-Large-3.md
+++ b/Mistral/Mistral-Large-3.md
@@ -9,15 +9,62 @@ Here are the links to the different formats:
 
 ## Installing vLLM
 
+### CUDA
+
 ```bash
 uv venv
 source .venv/bin/activate
 uv pip install -U vllm --torch-backend auto
 ```
 
+### ROCm
+
+You can choose either Option A (Docker) or Option B (install with uv).
+
+#### Option A: Run on Host with uv
+> Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images).
+```bash
+uv venv
+source .venv/bin/activate
+uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
+```
+
+#### Option B: Run with Docker
+Pull the latest vllm docker:
+```shell
+docker pull vllm/vllm-openai-rocm:latest
+```
+Launch the ROCm vLLM docker:
+```shell
+docker run -d -it \
+  --ipc=host \
+  --entrypoint /bin/bash \
+  --network=host \
+  --privileged \
+  --cap-add=CAP_SYS_ADMIN \
+  --device=/dev/kfd \
+  --device=/dev/dri \
+  --device=/dev/mem \
+  --group-add video \
+  --cap-add=SYS_PTRACE \
+  --security-opt seccomp=unconfined \
+  -v /:/work \
+  -e SHELL=/bin/bash \
+  -p 8000:8000 \
+  --name Mistral-Large-3 \
+  vllm/vllm-openai-rocm:latest
+```
+
+Log in to your Hugging Face account:
+```shell
+huggingface-cli login
+```
+
 ## Running the model
 
-## Running Mistral-Large-3-Instruct FP8 on 8xH200
+### CUDA
+
+#### Running Mistral-Large-3-Instruct FP8 on 8xH200
 
 The Mistral-Large-3-Instruct FP8 format can be used on one 8xH200 node. We recommend to use this format if you plan to fine-tune as it can be more precise than NVFP4 in some situations.
 
@@ -43,7 +90,7 @@ Additional flags:
 * You can set `--max-model-len` to preserve memory. By default it is set to `262144` which is quite large but not necessary for most scenarios.
 * You can set `--max-num-batched-tokens` to balance throughput and latency, higher means higher throughput but higher latency.
 
-## Running Mistral-Large-3-Instruct NVFP4 on 4xB200
+#### Running Mistral-Large-3-Instruct NVFP4 on 4xB200
 
 We recommend to use this format if you plan to deploy Mistral-Large-3 as it achieves performance similar to FP8 for less memory. However please note that for large context (`> 64k`) we observed a subsequent drop of performance. In such cases, please use the FP8 weights. Otherwise on B200 (Blackwell 200) we observe a significant speed-up and a minor regression on vision datasets probably due to the calibration that was performed mainly on text data.
 
@@ -72,6 +119,17 @@ Additional flags:
 * You can set `--max-num-batched-tokens` to balance throughput and latency, higher means higher throughput but higher latency.
 * You can set `--limit-mm-per-prompt.image 0` to skip loading the vision encoder to have additional space for KV cache if the model is used for text-only tasks.
 
+### ROCm
+
+Run the vllm online serving with this sample command:
+```shell
+SAFETENSORS_FAST_GPU=1 \
+VLLM_ROCM_USE_AITER=1 \
+vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
+  --tensor-parallel-size 8 \
+  --no-enable-prefix-caching \
+  --trust-remote-code
+```
 
 ## Usage of the model
 
@@ -329,77 +387,4 @@ response = client.chat.completions.create(
 
 assistant_message = response.choices[0].message.content
 print(assistant_message)
-```
-
-
-## AMD GPU Support
-
-Please follow the steps here to install and run kimi-K2 models on AMD MI300X, MI325X and MI355X.<br>
-You can choose either Option A (Docker) or Option B (install with uv).
-
-### Option A: Run on Host with uv
- > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images).  
- ```bash 
- uv venv 
- source .venv/bin/activate 
- uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
- ```
-
-### Option B: Run with Docker
-Pull the latest vllm docker:
-```shell
-docker pull vllm/vllm-openai-rocm:latest
-```
-Launch the ROCm vLLM docker: 
-```shell
-docker run -d -it \
-  --ipc=host \
-  --entrypoint /bin/bash \
-  --network=host \
-  --privileged \
-  --cap-add=CAP_SYS_ADMIN \
-  --device=/dev/kfd \
-  --device=/dev/dri \
-  --device=/dev/mem \
-  --group-add video \
-  --cap-add=SYS_PTRACE \
-  --security-opt seccomp=unconfined \
-  -v /:/work \
-  -e SHELL=/bin/bash \
-  -p 8000:8000 \
-  --name Mistral-Large-3 \
-  vllm/vllm-openai-rocm:latest
-```
-### Log in to Hugging Face
-Log in to your Hugging Face account:
-```shell
-huggingface-cli login
-```
-
-### Start the vLLM server
-
-Run the vllm online serving with this sample command:
-```shell
-SAFETENSORS_FAST_GPU=1 \
-VLLM_USE_V1=1 \
-VLLM_USE_TRITON_FLASH_ATTN=0 \
-VLLM_ROCM_USE_AITER=1 \
-vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
-  --tensor-parallel-size 8 \
-  --no-enable-prefix-caching \
-  --trust-remote-code
-```
-
-### Run Benchmark
-Open a new terminal and run the following command to execute the benchmark script inside the container.
-```shell
-docker exec -it Mistral-Large-3 vllm bench serve \
-  --model "mistralai/Mistral-Large-3-675B-Instruct-2512" \
-  --dataset-name random \
-  --random-input-len 8192 \
-  --random-output-len 1024 \
-  --request-rate 10000 \
-  --num-prompts 16 \
-  --ignore-eos \
-  --trust-remote-code 
 ```
\ No newline at end of file