From 0f065ac8ad59230bca8c23ff9b5dfbdbbdc64578 Mon Sep 17 00:00:00 2001 From: ChangLiu0709 Date: Tue, 24 Feb 2026 17:40:17 +0000 Subject: [PATCH 1/4] Add Ernie4.5-VL recipe with AMD MI300X/MI325X/MI355X support Signed-off-by: seungrokj Signed-off-by: ChangLiu0709 --- Ernie/Ernie4.5-VL.md | 100 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 99 insertions(+), 1 deletion(-) diff --git a/Ernie/Ernie4.5-VL.md b/Ernie/Ernie4.5-VL.md index 87a37b09..a4ed8ff8 100644 --- a/Ernie/Ernie4.5-VL.md +++ b/Ernie/Ernie4.5-VL.md @@ -4,7 +4,7 @@ This guide describes how to run [ERNIE-4.5-VL-28B-A3B-PT](https://huggingface.co ## Installing vLLM -Ernie4.5-VL support was recently added to vLLM main branch and is not yet available in any official release: +ERNIE-4.5-VL support was recently added to vLLM main branch and is not yet available in any official release: ```bash uv venv --python 3.12 --seed source .venv/bin/activate @@ -101,3 +101,101 @@ Median ITL (ms): 36.35 P99 ITL (ms): 236.49 ================================================== ``` + + +## AMD GPU Support + +Please follow the steps here to install and run ERNIE-4.5-VL model on AMD MI300X, MI325X, MI355X GPUs. + +### Step 1: Prepare Environment +#### Option 1: Installation from pre-built wheels (For AMD ROCm: MI300x/MI325x/MI355x) +We recommend using the official package for AMD GPUs (MI300x/MI325x/MI355x). +```bash +uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm +``` +⚠️ The vLLM wheel for ROCm is compatible with Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment is incompatible, please use docker flow in [vLLM](https://vllm.ai/). + +#### Option 2: Docker image +Pull the latest vllm docker: + +```bash +docker pull vllm/vllm-openai-rocm:latest +``` + +Launch the ROCm vLLM docker: + +```bash +docker run -it \ + --ipc=host \ + --network=host \ + --privileged \ + --cap-add=CAP_SYS_ADMIN \ + --device=/dev/kfd \ + --device=/dev/dri \ + --device=/dev/mem \ + --group-add video \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -v $(pwd):/work \ + -e SHELL=/bin/bash \ + --name Ernie-4.5-VL \ + vllm/vllm-openai-rocm:latest +``` + +After running the command above, you are already inside the container. Proceed to Step 2 in that shell. If you detached from the container or started it in detached mode, attach to the container with: + +```bash +docker attach Ernie-4.5-VL +``` + +### Step 2: Log in to Hugging Face +Hugging Face login: + +```bash +huggingface-cli login +``` + +### Step 3: Start the vLLM server + +Run the vllm online serving +Sample Command +```bash +VLLM_ROCM_USE_AITER=1 \ +SAFETENSORS_FAST_GPU=1 \ +vllm serve baidu/ERNIE-4.5-VL-28B-A3B-PT \ + --tensor-parallel-size 4 \ + --gpu-memory-utilization 0.9 \ + --disable-log-requests \ + --no-enable-prefix-caching \ + --trust-remote-code +``` + + +### Step 4: Run Benchmark +Open a new terminal and run the following command to execute the benchmark script: + +```bash +vllm bench serve \ + --model baidu/ERNIE-4.5-VL-28B-A3B-PT \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --request-rate 10000 \ + --num-prompts 16 \ + --trust-remote-code \ + --ignore-eos +``` + +If you are using a Docker environment, open a new terminal and run the benchmark inside the container with: + +```bash +docker exec -it Ernie-4.5-VL vllm bench serve \ + --model baidu/ERNIE-4.5-VL-28B-A3B-PT \ + --dataset-name random \ + --random-input-len 8000 \ + --random-output-len 1000 \ + --request-rate 10000 \ + --num-prompts 16 \ + --trust-remote-code \ + --ignore-eos +``` \ No newline at end of file From b751ce5c176edbf0b65a0e9b3a85b6cabf265000 Mon Sep 17 00:00:00 2001 From: ChangLiu0709 Date: Fri, 27 Feb 2026 15:55:47 +0000 Subject: [PATCH 2/4] Remove docker benchmark command Signed-off-by: ChangLiu0709 --- Ernie/Ernie4.5-VL.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/Ernie/Ernie4.5-VL.md b/Ernie/Ernie4.5-VL.md index a4ed8ff8..70c2a178 100644 --- a/Ernie/Ernie4.5-VL.md +++ b/Ernie/Ernie4.5-VL.md @@ -185,17 +185,3 @@ vllm bench serve \ --trust-remote-code \ --ignore-eos ``` - -If you are using a Docker environment, open a new terminal and run the benchmark inside the container with: - -```bash -docker exec -it Ernie-4.5-VL vllm bench serve \ - --model baidu/ERNIE-4.5-VL-28B-A3B-PT \ - --dataset-name random \ - --random-input-len 8000 \ - --random-output-len 1000 \ - --request-rate 10000 \ - --num-prompts 16 \ - --trust-remote-code \ - --ignore-eos -``` \ No newline at end of file From 81e69b6644341f0a911a8abf8957bb84330b3ebf Mon Sep 17 00:00:00 2001 From: ChangLiu0709 Date: Fri, 27 Feb 2026 16:52:55 +0000 Subject: [PATCH 3/4] Reformat the content merging AMD and NVIDIA settings together Signed-off-by: ChangLiu0709 --- Ernie/Ernie4.5-VL.md | 106 +++++++++---------------------------------- 1 file changed, 21 insertions(+), 85 deletions(-) diff --git a/Ernie/Ernie4.5-VL.md b/Ernie/Ernie4.5-VL.md index 70c2a178..40fe44c7 100644 --- a/Ernie/Ernie4.5-VL.md +++ b/Ernie/Ernie4.5-VL.md @@ -11,8 +11,15 @@ source .venv/bin/activate uv pip install -U vllm --torch-backend auto ``` +## Installing vLLM (For AMD ROCm: MI300x/MI325x/MI355x) +```bash +uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700 +``` +⚠️ The vLLM wheel for ROCm is compatible with Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment is incompatible, please use docker flow in [vLLM](https://vllm.ai/) + ## Running Ernie4.5-VL +### Serving Ernie4.5-VL Model on H100 GPUs NOTE: torch.compile and cuda graph are not supported due to the heterogeneous expert architecture. (vision and text experts) ```bash # 28B model 80G*1 GPU @@ -37,7 +44,6 @@ vllm serve baidu/ERNIE-4.5-VL-424B-A47B-PT \ --cpu-offload-gb 50 ``` - If your single node GPU memory is insufficient, native BF16 deployment may require multi nodes, multi node deployment reference [vLLM doc](https://docs.vllm.ai/en/latest/serving/parallelism_scaling.html?#multi-node-deployment) to start ray cluster. Then run vllm on the master node ```bash # 424B model 80G*16 GPU with native BF16 @@ -46,6 +52,20 @@ vllm serve baidu/ERNIE-4.5-VL-424B-A47B-PT \ --tensor-parallel-size 16 ``` +### Serving Ernie4.5-VL Model on MI300X/MI325X/MI355X GPUs + +Run the vLLM online serving on AMD GPUs using the command below: +```bash +VLLM_ROCM_USE_AITER=1 \ +SAFETENSORS_FAST_GPU=1 \ +vllm serve baidu/ERNIE-4.5-VL-28B-A3B-PT \ + --tensor-parallel-size 4 \ + --gpu-memory-utilization 0.9 \ + --disable-log-requests \ + --no-enable-prefix-caching \ + --trust-remote-code +``` + ## Benchmarking For benchmarking, only the first `vllm bench serve` after service startup to ensure it is not affected by prefix cache @@ -101,87 +121,3 @@ Median ITL (ms): 36.35 P99 ITL (ms): 236.49 ================================================== ``` - - -## AMD GPU Support - -Please follow the steps here to install and run ERNIE-4.5-VL model on AMD MI300X, MI325X, MI355X GPUs. - -### Step 1: Prepare Environment -#### Option 1: Installation from pre-built wheels (For AMD ROCm: MI300x/MI325x/MI355x) -We recommend using the official package for AMD GPUs (MI300x/MI325x/MI355x). -```bash -uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm -``` -⚠️ The vLLM wheel for ROCm is compatible with Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment is incompatible, please use docker flow in [vLLM](https://vllm.ai/). - -#### Option 2: Docker image -Pull the latest vllm docker: - -```bash -docker pull vllm/vllm-openai-rocm:latest -``` - -Launch the ROCm vLLM docker: - -```bash -docker run -it \ - --ipc=host \ - --network=host \ - --privileged \ - --cap-add=CAP_SYS_ADMIN \ - --device=/dev/kfd \ - --device=/dev/dri \ - --device=/dev/mem \ - --group-add video \ - --cap-add=SYS_PTRACE \ - --security-opt seccomp=unconfined \ - -v $(pwd):/work \ - -e SHELL=/bin/bash \ - --name Ernie-4.5-VL \ - vllm/vllm-openai-rocm:latest -``` - -After running the command above, you are already inside the container. Proceed to Step 2 in that shell. If you detached from the container or started it in detached mode, attach to the container with: - -```bash -docker attach Ernie-4.5-VL -``` - -### Step 2: Log in to Hugging Face -Hugging Face login: - -```bash -huggingface-cli login -``` - -### Step 3: Start the vLLM server - -Run the vllm online serving -Sample Command -```bash -VLLM_ROCM_USE_AITER=1 \ -SAFETENSORS_FAST_GPU=1 \ -vllm serve baidu/ERNIE-4.5-VL-28B-A3B-PT \ - --tensor-parallel-size 4 \ - --gpu-memory-utilization 0.9 \ - --disable-log-requests \ - --no-enable-prefix-caching \ - --trust-remote-code -``` - - -### Step 4: Run Benchmark -Open a new terminal and run the following command to execute the benchmark script: - -```bash -vllm bench serve \ - --model baidu/ERNIE-4.5-VL-28B-A3B-PT \ - --dataset-name random \ - --random-input-len 8000 \ - --random-output-len 1000 \ - --request-rate 10000 \ - --num-prompts 16 \ - --trust-remote-code \ - --ignore-eos -``` From 437d5bfc31498ce28166a702db577405a08c9fc9 Mon Sep 17 00:00:00 2001 From: ChangLiu0709 Date: Thu, 5 Mar 2026 12:35:09 +0000 Subject: [PATCH 4/4] Add sections for CUDA and ROCm Signed-off-by: ChangLiu0709 --- Ernie/Ernie4.5-VL.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/Ernie/Ernie4.5-VL.md b/Ernie/Ernie4.5-VL.md index 40fe44c7..2c38a1b2 100644 --- a/Ernie/Ernie4.5-VL.md +++ b/Ernie/Ernie4.5-VL.md @@ -4,6 +4,7 @@ This guide describes how to run [ERNIE-4.5-VL-28B-A3B-PT](https://huggingface.co ## Installing vLLM +### CUDA ERNIE-4.5-VL support was recently added to vLLM main branch and is not yet available in any official release: ```bash uv venv --python 3.12 --seed @@ -11,9 +12,9 @@ source .venv/bin/activate uv pip install -U vllm --torch-backend auto ``` -## Installing vLLM (For AMD ROCm: MI300x/MI325x/MI355x) +### AMD ROCm: MI300X/MI325X/MI355X ```bash -uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700 +uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.16.0/rocm700 ``` ⚠️ The vLLM wheel for ROCm is compatible with Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment is incompatible, please use docker flow in [vLLM](https://vllm.ai/)