Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 68 additions & 9 deletions NVIDIA/Nemotron-Nano-12B-v2-VL.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,68 @@ This guide describes how to run Nemotron-Nano-12B-v2-VL series on the targeted a

## Installing vLLM

### CUDA

* vLLM 0.11.0 does not include Nemotron-Nano-12B-v2-VL, so either [install from source](https://docs.vllm.ai/en/v0.6.0/getting_started/installation.html) or refer to [this](https://hub.docker.com/layers/vllm/vllm-openai/nightly-8bff831f0aa239006f34b721e63e1340e3472067/images/sha256-ef112680ed30e4b9d7bf794dcda4abd829e9405a73e013f9e046658cf22d0577) nightly build
```bash
docker pull vllm/vllm-openai:nightly-8bff831f0aa239006f34b721e63e1340e3472067
```

For DGX Spark, container relase is avaiable
For DGX Spark, container relase is avaiable
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.12.post1-py3

```bash
docker pull nvcr.io/nvidia/vllm:25.12.post1-py3
```

### ROCm

You can choose either Option A (Docker) or Option B (install with uv).

#### Option A: Run on Host with uv
> Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images).
```bash
uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
```

#### Option B: Run with Docker
Pull the latest vllm docker:
```shell
docker pull vllm/vllm-openai-rocm:latest
```
Launch the ROCm vLLM docker:
```shell
docker run -d -it \
--ipc=host \
--entrypoint /bin/bash \
--network=host \
--privileged \
--cap-add=CAP_SYS_ADMIN \
--device=/dev/kfd \
--device=/dev/dri \
--device=/dev/mem \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v /:/work \
-e SHELL=/bin/bash \
-p 8000:8000 \
--name Nemotron-Nano-12B \
vllm/vllm-openai-rocm:latest
```

Log in to your Hugging Face account:
```shell
huggingface-cli login
```

## Serving Nemotron-Nano-12B-v2-VL
### Server:

### CUDA

#### Server:
The following command will launch an inference server on 1 GPU.

Notes:
Expand Down Expand Up @@ -45,7 +93,7 @@ python3 -m vllm.entrypoints.openai.api_server \
--served-model-name "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"
```

### Client (bash):
#### Client (bash):
```bash
curl -X 'POST' \
'http://127.0.0.1:5566/v1/chat/completions' \
Expand All @@ -57,7 +105,7 @@ curl -X 'POST' \
}'
```

### Client (Python):
#### Client (Python):
```python
from openai import OpenAI
client = OpenAI(
Expand Down Expand Up @@ -112,18 +160,18 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```

### vLLM `LLM` API
#### vLLM `LLM` API

Notes:
* Examples are using [BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16) precision model. We encourage you to try [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8) and [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD) as well!
* You can set `max_model_len <len>` ([doc](https://docs.vllm.ai/en/latest/configuration/engine_args.html#-max-model-len)) to preserve memory. Model is trained on a context length of ~131K, but unless the use-case is long context videos, a smaller context would fit as-well.
* You can set `allowed_local_media_path <root>` ([doc](https://docs.vllm.ai/en/latest/configuration/engine_args.html#-allowed-local-media-path)) to limit the accessibility of local files.

#### Efficient Video Sampling (EVS)
##### Efficient Video Sampling (EVS)
* You can set `video_pruning_rate <fraction>` to tweak video compression. Read more about EVS on [arXiv](https://arxiv.org/abs/2510.14624).


#### Usage with image path
##### Usage with image path
```python
from vllm import LLM, SamplingParams

Expand Down Expand Up @@ -165,7 +213,7 @@ for o in outputs:
print(o.outputs[0].text)
```

#### Usage with video path
##### Usage with video path
* See Efficient Video Sampling (EVS): affects videos only, defines how much of the video tokens to prune
```python
import os
Expand Down Expand Up @@ -213,7 +261,7 @@ for o in outputs:
print(o.outputs[0].text)
```

#### Usage with video tensors and custom sampling
##### Usage with video tensors and custom sampling
```python
from vllm import LLM, SamplingParams
import decord
Expand Down Expand Up @@ -373,3 +421,14 @@ def main():
if __name__ == "__main__":
main()
```

### ROCm

Run the vllm online serving with this sample command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sample command uses --tensor-parallel-size 8, which assumes an 8-GPU setup. This might be confusing for users, as the preceding text doesn't mention this requirement. Please add a note to clarify that this value should be adjusted based on the user's number of available GPUs (e.g., --tensor-parallel-size 1 for a single GPU).

```shell
SAFETENSORS_FAST_GPU=1 \
vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--trust-remote-code
```