From 234bbc331d78d0ebcdc5a81de4dbad8c0cf0f411 Mon Sep 17 00:00:00 2001
From: Yuan Yue <yueyuan@amd.com>
Date: Mon, 26 Jan 2026 10:57:57 +0000
Subject: [PATCH] Add AMD GPU support for Nemotron-3-Nano-30B-A3B

Tested and verified on AMD MI300X:
- Docker image: rocm/vllm-dev:nightly
- AITER disabled for compatibility
- Benchmark passed: 4/4 requests successful (1024/1024 input/output)

Co-authored-by: billishyahao <bill.he@amd.com>
Signed-off-by: Yuan Yue <yueyuan@amd.com>
---
 NVIDIA/Nemotron-3-Nano-30B-A3B.md | 45 +++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/NVIDIA/Nemotron-3-Nano-30B-A3B.md b/NVIDIA/Nemotron-3-Nano-30B-A3B.md
index 3fcfb439..42baac46 100644
--- a/NVIDIA/Nemotron-3-Nano-30B-A3B.md
+++ b/NVIDIA/Nemotron-3-Nano-30B-A3B.md
@@ -277,3 +277,48 @@ The two main tunable configs for Nemotron Nano 3 are the `--tensor-parallel-size
 - Therefore, increasing TP (which would lower the throughput at the same BS) may allow higher BS to run (which would increase the throughput), and the net throughput gain/loss depends on models and configurations.
 
 Note that the statements above assume that the concurrency setting on the client side, like the `--max-concurrency` flag in the performance benchmarking command, matches the `--max-num-seqs` (BS) setting on the server side.
+
+### AMD GPU Support
+
+Please follow the steps here to install and run Nemotron-3-Nano-30B-A3B models on AMD MI300X GPU.
+### Step 1: Prepare Docker Environment
+Pull the latest vllm docker:
+```shell
+docker pull rocm/vllm-dev:nightly
+```
+Launch the ROCm vLLM docker: 
+```shell
+docker run -it --ipc=host --network=host --privileged --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri --device=/dev/mem --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $(pwd):/work -e SHELL=/bin/bash --name Nemotron-Nano rocm/vllm-dev:nightly 
+```
+### Step 2: Log in to Hugging Face
+Huggingface login
+```shell
+huggingface-cli login
+```
+### Step 3: Start the vLLM server
+Run the vllm online serving
+Sample Command
+```shell
+SAFETENSORS_FAST_GPU=1 \
+VLLM_USE_TRITON_FLASH_ATTN=0 \
+VLLM_ROCM_USE_AITER=0 \
+vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+  --tensor-parallel-size 1 \
+  --max-model-len 32768 \
+  --max-num-seqs 256 \
+  --trust-remote-code \
+  --disable-log-requests
+```
+### Step 4: Run Benchmark
+Open a new terminal and run the following command to execute the benchmark script inside the container.
+```shell
+docker exec -it Nemotron-Nano vllm bench serve \
+  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+  --dataset-name random \
+  --random-input-len 1024 \
+  --random-output-len 1024 \
+  --request-rate 1 \
+  --num-prompts 4 \
+  --ignore-eos \
+  --trust-remote-code
+```