From 8ab2603682966621f77e0b6bec4d993bcc7d9fa7 Mon Sep 17 00:00:00 2001
From: hyukjlee <hyukjlee@amd.com>
Date: Mon, 26 Jan 2026 14:24:22 +0000
Subject: [PATCH 1/3] Mistral-3-Instruct update for AMD GPU

Signed-off-by: hyukjlee <hyukjlee@amd.com>
---
 Mistral/Mistral-3-Instruct-AMD.md | 91 +++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)
 create mode 100644 Mistral/Mistral-3-Instruct-AMD.md

diff --git a/Mistral/Mistral-3-Instruct-AMD.md b/Mistral/Mistral-3-Instruct-AMD.md
new file mode 100644
index 00000000..b1d4c4d5
--- /dev/null
+++ b/Mistral/Mistral-3-Instruct-AMD.md
@@ -0,0 +1,91 @@
+# Ministral 3 14B Instruct on vLLM - AMD Hardware
+
+## Introduction
+
+This quick start recipe explains how to run the Ministral 3 14B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
+
+## Key benefits of AMD GPUs on large models and developers
+
+The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Ministral:
+- Large HBM memory enables longer contexts and higher concurrency.
+- Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
+- Strong single-node performance reduces infrastructure complexity for serving.
+
+## Access & Licensing
+
+### License and Model parameters
+
+Please check whether you have access to the following model:
+- [Ministral 3 14B Instruct](https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512)
+
+## Prerequisites
+
+- OS: Linux
+- Drivers: ROCm 7.0 or above
+- GPU: AMD MI300X, MI325X, and MI355X
+
+## Deployment Steps
+
+### 1. Using vLLM docker image (For AMD users)
+
+```bash
+alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome --entrypoint /bin/bash'
+drun vllm/vllm-openai-rocm:v0.14.0
+```
+
+### 2. Start vLLM online server (run in background)
+
+```bash
+export TP=1
+export VLLM_ROCM_USE_AITER=1
+export MODEL="mistralai/Ministral-3-14B-Instruct-2512"
+vllm serve $MODEL \
+  --disable-log-requests \
+  --port 9090 \
+  -tp $TP \
+  --config_format mistral \
+  --load_format mistral \
+  --enable-auto-tool-choice \
+  --tool-call-parser mistral &
+```
+
+### 3. Running Inference using benchmark script
+
+Test the model with a text-only prompt.
+
+```bash
+curl http://localhost:9090/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+        "model": "mistralai/Ministral-3-14B-Instruct-2512",
+        "prompt": "Explain the benefits of KV cache in transformer decoding.",
+        "max_tokens": 128,
+        "temperature": 0
+    }'
+```
+
+Test result (local run):
+```bash
+"text":" How does it help in reducing the computational cost?\n\n### Understanding KV Cache in Transformer Decoding\n\nThe **KV cache** (Key-Value cache) is a technique used in **transformer-based models** (like GPT, BERT, etc.) during **autoregressive decoding** to improve efficiency. Here's how it works and why it's beneficial:\n\n---\n\n### **1. What is KV Cache?**\nDuring decoding, a transformer model generates tokens one by one. For each new token, the model computes **attention scores** between the current token and all previous tokens in the sequence. The attention mechanism involves:\n"
+```
+
+### 4. Performance benchmark
+
+```bash
+export MODEL="mistralai/Ministral-3-14B-Instruct-2512"
+export ISL=1024
+export OSL=1024
+export REQ=10
+export CONC=10
+vllm bench serve \
+  --backend vllm \
+  --model $MODEL \
+  --dataset-name random \
+  --random-input-len $ISL \
+  --random-output-len $OSL \
+  --num-prompts $REQ \
+  --ignore-eos \
+  --max-concurrency $CONC \
+  --port 9090 \
+  --percentile-metrics ttft,tpot,itl,e2el
+```
\ No newline at end of file

From 46dd5dcdcfa806f41cc81c6a29983a01eda6e8a0 Mon Sep 17 00:00:00 2001
From: Hyukjoon Lee <hyukjlee@amd.com>
Date: Mon, 9 Feb 2026 16:54:22 +0900
Subject: [PATCH 2/3] Update Mistral-3-Instruct-AMD.md

Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
---
 Mistral/Mistral-3-Instruct-AMD.md | 46 ++++++++++++-------------------
 1 file changed, 18 insertions(+), 28 deletions(-)

diff --git a/Mistral/Mistral-3-Instruct-AMD.md b/Mistral/Mistral-3-Instruct-AMD.md
index b1d4c4d5..b3ec0a7c 100644
--- a/Mistral/Mistral-3-Instruct-AMD.md
+++ b/Mistral/Mistral-3-Instruct-AMD.md
@@ -29,8 +29,20 @@ Please check whether you have access to the following model:
 ### 1. Using vLLM docker image (For AMD users)
 
 ```bash
-alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome --entrypoint /bin/bash'
-drun vllm/vllm-openai-rocm:v0.14.0
+docker run -it \
+  --network=host \
+  --device=/dev/kfd \
+  --device=/dev/dri \
+  --group-add=video \
+  --ipc=host \
+  --cap-add=SYS_PTRACE \
+  --security-opt seccomp=unconfined \
+  --shm-size 32G \
+  -v /data:/data \
+  -v $HOME:/myhome \
+  -w /myhome \
+  --entrypoint /bin/bash \
+  vllm/vllm-openai-rocm:latest
 ```
 
 ### 2. Start vLLM online server (run in background)
@@ -40,8 +52,7 @@ export TP=1
 export VLLM_ROCM_USE_AITER=1
 export MODEL="mistralai/Ministral-3-14B-Instruct-2512"
 vllm serve $MODEL \
-  --disable-log-requests \
-  --port 9090 \
+  --disable-log-requests \  
   -tp $TP \
   --config_format mistral \
   --load_format mistral \
@@ -49,27 +60,7 @@ vllm serve $MODEL \
   --tool-call-parser mistral &
 ```
 
-### 3. Running Inference using benchmark script
-
-Test the model with a text-only prompt.
-
-```bash
-curl http://localhost:9090/v1/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-        "model": "mistralai/Ministral-3-14B-Instruct-2512",
-        "prompt": "Explain the benefits of KV cache in transformer decoding.",
-        "max_tokens": 128,
-        "temperature": 0
-    }'
-```
-
-Test result (local run):
-```bash
-"text":" How does it help in reducing the computational cost?\n\n### Understanding KV Cache in Transformer Decoding\n\nThe **KV cache** (Key-Value cache) is a technique used in **transformer-based models** (like GPT, BERT, etc.) during **autoregressive decoding** to improve efficiency. Here's how it works and why it's beneficial:\n\n---\n\n### **1. What is KV Cache?**\nDuring decoding, a transformer model generates tokens one by one. For each new token, the model computes **attention scores** between the current token and all previous tokens in the sequence. The attention mechanism involves:\n"
-```
-
-### 4. Performance benchmark
+### 3. Performance benchmark
 
 ```bash
 export MODEL="mistralai/Ministral-3-14B-Instruct-2512"
@@ -85,7 +76,6 @@ vllm bench serve \
   --random-output-len $OSL \
   --num-prompts $REQ \
   --ignore-eos \
-  --max-concurrency $CONC \
-  --port 9090 \
+  --max-concurrency $CONC \  
   --percentile-metrics ttft,tpot,itl,e2el
-```
\ No newline at end of file
+```

From 47f9b034416943467cc873ce03e4e933c3686223 Mon Sep 17 00:00:00 2001
From: Hyukjoon Lee <hyukjlee@amd.com>
Date: Mon, 9 Feb 2026 17:11:30 +0900
Subject: [PATCH 3/3] Update Mistral-3-Instruct-AMD.md

Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
---
 Mistral/Mistral-3-Instruct-AMD.md | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/Mistral/Mistral-3-Instruct-AMD.md b/Mistral/Mistral-3-Instruct-AMD.md
index b3ec0a7c..c34c7b02 100644
--- a/Mistral/Mistral-3-Instruct-AMD.md
+++ b/Mistral/Mistral-3-Instruct-AMD.md
@@ -44,7 +44,13 @@ docker run -it \
   --entrypoint /bin/bash \
   vllm/vllm-openai-rocm:latest
 ```
-
+or you can use uv environment.
+ > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images).  
+ ```bash 
+ uv venv 
+ source .venv/bin/activate 
+ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
+ ```
 ### 2. Start vLLM online server (run in background)
 
 ```bash