From 8ab2603682966621f77e0b6bec4d993bcc7d9fa7 Mon Sep 17 00:00:00 2001 From: hyukjlee Date: Mon, 26 Jan 2026 14:24:22 +0000 Subject: [PATCH 1/3] Mistral-3-Instruct update for AMD GPU Signed-off-by: hyukjlee --- Mistral/Mistral-3-Instruct-AMD.md | 91 +++++++++++++++++++++++++++++++ 1 file changed, 91 insertions(+) create mode 100644 Mistral/Mistral-3-Instruct-AMD.md diff --git a/Mistral/Mistral-3-Instruct-AMD.md b/Mistral/Mistral-3-Instruct-AMD.md new file mode 100644 index 00000000..b1d4c4d5 --- /dev/null +++ b/Mistral/Mistral-3-Instruct-AMD.md @@ -0,0 +1,91 @@ +# Ministral 3 14B Instruct on vLLM - AMD Hardware + +## Introduction + +This quick start recipe explains how to run the Ministral 3 14B Instruct model on AMD MI300X/MI355X GPUs using vLLM. + +## Key benefits of AMD GPUs on large models and developers + +The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Ministral: +- Large HBM memory enables longer contexts and higher concurrency. +- Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment. +- Strong single-node performance reduces infrastructure complexity for serving. + +## Access & Licensing + +### License and Model parameters + +Please check whether you have access to the following model: +- [Ministral 3 14B Instruct](https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512) + +## Prerequisites + +- OS: Linux +- Drivers: ROCm 7.0 or above +- GPU: AMD MI300X, MI325X, and MI355X + +## Deployment Steps + +### 1. Using vLLM docker image (For AMD users) + +```bash +alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome --entrypoint /bin/bash' +drun vllm/vllm-openai-rocm:v0.14.0 +``` + +### 2. Start vLLM online server (run in background) + +```bash +export TP=1 +export VLLM_ROCM_USE_AITER=1 +export MODEL="mistralai/Ministral-3-14B-Instruct-2512" +vllm serve $MODEL \ + --disable-log-requests \ + --port 9090 \ + -tp $TP \ + --config_format mistral \ + --load_format mistral \ + --enable-auto-tool-choice \ + --tool-call-parser mistral & +``` + +### 3. Running Inference using benchmark script + +Test the model with a text-only prompt. + +```bash +curl http://localhost:9090/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "mistralai/Ministral-3-14B-Instruct-2512", + "prompt": "Explain the benefits of KV cache in transformer decoding.", + "max_tokens": 128, + "temperature": 0 + }' +``` + +Test result (local run): +```bash +"text":" How does it help in reducing the computational cost?\n\n### Understanding KV Cache in Transformer Decoding\n\nThe **KV cache** (Key-Value cache) is a technique used in **transformer-based models** (like GPT, BERT, etc.) during **autoregressive decoding** to improve efficiency. Here's how it works and why it's beneficial:\n\n---\n\n### **1. What is KV Cache?**\nDuring decoding, a transformer model generates tokens one by one. For each new token, the model computes **attention scores** between the current token and all previous tokens in the sequence. The attention mechanism involves:\n" +``` + +### 4. Performance benchmark + +```bash +export MODEL="mistralai/Ministral-3-14B-Instruct-2512" +export ISL=1024 +export OSL=1024 +export REQ=10 +export CONC=10 +vllm bench serve \ + --backend vllm \ + --model $MODEL \ + --dataset-name random \ + --random-input-len $ISL \ + --random-output-len $OSL \ + --num-prompts $REQ \ + --ignore-eos \ + --max-concurrency $CONC \ + --port 9090 \ + --percentile-metrics ttft,tpot,itl,e2el +``` \ No newline at end of file From 46dd5dcdcfa806f41cc81c6a29983a01eda6e8a0 Mon Sep 17 00:00:00 2001 From: Hyukjoon Lee Date: Mon, 9 Feb 2026 16:54:22 +0900 Subject: [PATCH 2/3] Update Mistral-3-Instruct-AMD.md Signed-off-by: Hyukjoon Lee --- Mistral/Mistral-3-Instruct-AMD.md | 46 ++++++++++++------------------- 1 file changed, 18 insertions(+), 28 deletions(-) diff --git a/Mistral/Mistral-3-Instruct-AMD.md b/Mistral/Mistral-3-Instruct-AMD.md index b1d4c4d5..b3ec0a7c 100644 --- a/Mistral/Mistral-3-Instruct-AMD.md +++ b/Mistral/Mistral-3-Instruct-AMD.md @@ -29,8 +29,20 @@ Please check whether you have access to the following model: ### 1. Using vLLM docker image (For AMD users) ```bash -alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome --entrypoint /bin/bash' -drun vllm/vllm-openai-rocm:v0.14.0 +docker run -it \ + --network=host \ + --device=/dev/kfd \ + --device=/dev/dri \ + --group-add=video \ + --ipc=host \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + --shm-size 32G \ + -v /data:/data \ + -v $HOME:/myhome \ + -w /myhome \ + --entrypoint /bin/bash \ + vllm/vllm-openai-rocm:latest ``` ### 2. Start vLLM online server (run in background) @@ -40,8 +52,7 @@ export TP=1 export VLLM_ROCM_USE_AITER=1 export MODEL="mistralai/Ministral-3-14B-Instruct-2512" vllm serve $MODEL \ - --disable-log-requests \ - --port 9090 \ + --disable-log-requests \ -tp $TP \ --config_format mistral \ --load_format mistral \ @@ -49,27 +60,7 @@ vllm serve $MODEL \ --tool-call-parser mistral & ``` -### 3. Running Inference using benchmark script - -Test the model with a text-only prompt. - -```bash -curl http://localhost:9090/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "mistralai/Ministral-3-14B-Instruct-2512", - "prompt": "Explain the benefits of KV cache in transformer decoding.", - "max_tokens": 128, - "temperature": 0 - }' -``` - -Test result (local run): -```bash -"text":" How does it help in reducing the computational cost?\n\n### Understanding KV Cache in Transformer Decoding\n\nThe **KV cache** (Key-Value cache) is a technique used in **transformer-based models** (like GPT, BERT, etc.) during **autoregressive decoding** to improve efficiency. Here's how it works and why it's beneficial:\n\n---\n\n### **1. What is KV Cache?**\nDuring decoding, a transformer model generates tokens one by one. For each new token, the model computes **attention scores** between the current token and all previous tokens in the sequence. The attention mechanism involves:\n" -``` - -### 4. Performance benchmark +### 3. Performance benchmark ```bash export MODEL="mistralai/Ministral-3-14B-Instruct-2512" @@ -85,7 +76,6 @@ vllm bench serve \ --random-output-len $OSL \ --num-prompts $REQ \ --ignore-eos \ - --max-concurrency $CONC \ - --port 9090 \ + --max-concurrency $CONC \ --percentile-metrics ttft,tpot,itl,e2el -``` \ No newline at end of file +``` From 47f9b034416943467cc873ce03e4e933c3686223 Mon Sep 17 00:00:00 2001 From: Hyukjoon Lee Date: Mon, 9 Feb 2026 17:11:30 +0900 Subject: [PATCH 3/3] Update Mistral-3-Instruct-AMD.md Signed-off-by: Hyukjoon Lee --- Mistral/Mistral-3-Instruct-AMD.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Mistral/Mistral-3-Instruct-AMD.md b/Mistral/Mistral-3-Instruct-AMD.md index b3ec0a7c..c34c7b02 100644 --- a/Mistral/Mistral-3-Instruct-AMD.md +++ b/Mistral/Mistral-3-Instruct-AMD.md @@ -44,7 +44,13 @@ docker run -it \ --entrypoint /bin/bash \ vllm/vllm-openai-rocm:latest ``` - +or you can use uv environment. + > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). + ```bash + uv venv + source .venv/bin/activate + uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ + ``` ### 2. Start vLLM online server (run in background) ```bash