diff --git a/NVIDIA/Nemotron-Nano-12B-v2-VL.md b/NVIDIA/Nemotron-Nano-12B-v2-VL.md index 393f7fa1..fe950530 100644 --- a/NVIDIA/Nemotron-Nano-12B-v2-VL.md +++ b/NVIDIA/Nemotron-Nano-12B-v2-VL.md @@ -4,20 +4,68 @@ This guide describes how to run Nemotron-Nano-12B-v2-VL series on the targeted a ## Installing vLLM +### CUDA + * vLLM 0.11.0 does not include Nemotron-Nano-12B-v2-VL, so either [install from source](https://docs.vllm.ai/en/v0.6.0/getting_started/installation.html) or refer to [this](https://hub.docker.com/layers/vllm/vllm-openai/nightly-8bff831f0aa239006f34b721e63e1340e3472067/images/sha256-ef112680ed30e4b9d7bf794dcda4abd829e9405a73e013f9e046658cf22d0577) nightly build ```bash docker pull vllm/vllm-openai:nightly-8bff831f0aa239006f34b721e63e1340e3472067 ``` -For DGX Spark, container relase is avaiable +For DGX Spark, container relase is avaiable https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.12.post1-py3 ```bash docker pull nvcr.io/nvidia/vllm:25.12.post1-py3 ``` +### ROCm + +You can choose either Option A (Docker) or Option B (install with uv). + +#### Option A: Run on Host with uv +> Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). +```bash +uv venv +source .venv/bin/activate +uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ +``` + +#### Option B: Run with Docker +Pull the latest vllm docker: +```shell +docker pull vllm/vllm-openai-rocm:latest +``` +Launch the ROCm vLLM docker: +```shell +docker run -d -it \ + --ipc=host \ + --entrypoint /bin/bash \ + --network=host \ + --privileged \ + --cap-add=CAP_SYS_ADMIN \ + --device=/dev/kfd \ + --device=/dev/dri \ + --device=/dev/mem \ + --group-add video \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v /:/work \ + -e SHELL=/bin/bash \ + -p 8000:8000 \ + --name Nemotron-Nano-12B \ + vllm/vllm-openai-rocm:latest +``` + +Log in to your Hugging Face account: +```shell +huggingface-cli login +``` + ## Serving Nemotron-Nano-12B-v2-VL -### Server: + +### CUDA + +#### Server: The following command will launch an inference server on 1 GPU. Notes: @@ -45,7 +93,7 @@ python3 -m vllm.entrypoints.openai.api_server \ --served-model-name "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16" ``` -### Client (bash): +#### Client (bash): ```bash curl -X 'POST' \ 'http://127.0.0.1:5566/v1/chat/completions' \ @@ -57,7 +105,7 @@ curl -X 'POST' \ }' ``` -### Client (Python): +#### Client (Python): ```python from openai import OpenAI client = OpenAI( @@ -112,18 +160,18 @@ completion = client.chat.completions.create( print(completion.choices[0].message.content) ``` -### vLLM `LLM` API +#### vLLM `LLM` API Notes: * Examples are using [BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16) precision model. We encourage you to try [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8) and [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD) as well! * You can set `max_model_len ` ([doc](https://docs.vllm.ai/en/latest/configuration/engine_args.html#-max-model-len)) to preserve memory. Model is trained on a context length of ~131K, but unless the use-case is long context videos, a smaller context would fit as-well. * You can set `allowed_local_media_path ` ([doc](https://docs.vllm.ai/en/latest/configuration/engine_args.html#-allowed-local-media-path)) to limit the accessibility of local files. -#### Efficient Video Sampling (EVS) +##### Efficient Video Sampling (EVS) * You can set `video_pruning_rate ` to tweak video compression. Read more about EVS on [arXiv](https://arxiv.org/abs/2510.14624). -#### Usage with image path +##### Usage with image path ```python from vllm import LLM, SamplingParams @@ -165,7 +213,7 @@ for o in outputs: print(o.outputs[0].text) ``` -#### Usage with video path +##### Usage with video path * See Efficient Video Sampling (EVS): affects videos only, defines how much of the video tokens to prune ```python import os @@ -213,7 +261,7 @@ for o in outputs: print(o.outputs[0].text) ``` -#### Usage with video tensors and custom sampling +##### Usage with video tensors and custom sampling ```python from vllm import LLM, SamplingParams import decord @@ -373,3 +421,14 @@ def main(): if __name__ == "__main__": main() ``` + +### ROCm + +Run the vllm online serving with this sample command: +```shell +SAFETENSORS_FAST_GPU=1 \ +vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 \ + --tensor-parallel-size 8 \ + --no-enable-prefix-caching \ + --trust-remote-code +```