Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions NVIDIA/Nemotron-3-Nano-30B-A3B.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,3 +277,48 @@ The two main tunable configs for Nemotron Nano 3 are the `--tensor-parallel-size
- Therefore, increasing TP (which would lower the throughput at the same BS) may allow higher BS to run (which would increase the throughput), and the net throughput gain/loss depends on models and configurations.

Note that the statements above assume that the concurrency setting on the client side, like the `--max-concurrency` flag in the performance benchmarking command, matches the `--max-num-seqs` (BS) setting on the server side.

### AMD GPU Support

Please follow the steps here to install and run Nemotron-3-Nano-30B-A3B models on AMD MI300X GPU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PR description mentions support for MI300X/MI325X/MI355X, but the documentation only lists MI300X. To provide complete information to users, please list all supported AMD GPU models.

Suggested change
Please follow the steps here to install and run Nemotron-3-Nano-30B-A3B models on AMD MI300X GPU.
Please follow the steps here to install and run Nemotron-3-Nano-30B-A3B models on AMD MI300X/MI325X/MI355X GPUs.

### Step 1: Prepare Docker Environment
Pull the latest vllm docker:
```shell
docker pull rocm/vllm-dev:nightly
```
Launch the ROCm vLLM docker:
```shell
docker run -it --ipc=host --network=host --privileged --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri --device=/dev/mem --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $(pwd):/work -e SHELL=/bin/bash --name Nemotron-Nano rocm/vllm-dev:nightly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This docker run command is very long and horizontally scrolls, which harms readability. Consider breaking it into multiple lines using backslashes (\) for better clarity and ease of use.

Suggested change
docker run -it --ipc=host --network=host --privileged --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri --device=/dev/mem --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $(pwd):/work -e SHELL=/bin/bash --name Nemotron-Nano rocm/vllm-dev:nightly
docker run -it \
--ipc=host \
--network=host \
--privileged \
--cap-add=CAP_SYS_ADMIN \
--device=/dev/kfd \
--device=/dev/dri \
--device=/dev/mem \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $(pwd):/work \
-e SHELL=/bin/bash \
--name Nemotron-Nano \
rocm/vllm-dev:nightly

```
### Step 2: Log in to Hugging Face
Huggingface login
```shell
huggingface-cli login
```
### Step 3: Start the vLLM server
Run the vllm online serving
Sample Command
```shell
SAFETENSORS_FAST_GPU=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
VLLM_ROCM_USE_AITER=0 \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 256 \
--trust-remote-code \
--disable-log-requests
Comment on lines +302 to +310
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for AMD GPU support only provides instructions for the BF16 model variant. The model also has an FP8 variant. Please clarify if FP8 is supported on AMD GPUs. If it is, please provide instructions for running it. If not, it would be helpful to state this explicitly as a known limitation. The NVIDIA section provides a good example of how to handle different data types using a DTYPE environment variable.

```
### Step 4: Run Benchmark
Open a new terminal and run the following command to execute the benchmark script inside the container.
```shell
docker exec -it Nemotron-Nano vllm bench serve \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--request-rate 1 \
--num-prompts 4 \
--ignore-eos \
--trust-remote-code
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pull request description mentions 'Document known limitations', but this section is missing. It's important to include this for users. For example, the PR description states Requires VLLM_ROCM_USE_AITER=0 for compatibility, and while this is set in the server launch command, there is no explanation for why it's needed. Please add a 'Known Limitations' section to explain this and any other caveats for running on AMD hardware.