Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 21 additions & 13 deletions Ernie/Ernie4.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,21 @@ This guide describes how to run [ERNIE-4.5-21B-A3B-PT](https://huggingface.co/ba

## Installing vLLM
Note: transformers >= 4.54.0 and vllm >= 0.10.1

### CUDA
```bash
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
```

## Running Ernie4.5
### AMD ROCm: MI300x/MI325x/MI355x
```bash
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.16.0/rocm700
```
⚠️ The vLLM wheel for ROCm is compatible with Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment is incompatible, please use docker flow in [vLLM](https://vllm.ai/)

## Running Ernie4.5
### Serving Ernie4.5 Model on H100 GPUs
```bash
# 21B model 80G*1 GPU
vllm serve baidu/ERNIE-4.5-21B-A3B-PT
Expand All @@ -33,8 +39,19 @@ vllm serve baidu/ERNIE-4.5-300B-A47B-PT \
--tensor-parallel-size 16
```

## Running Ernie4.5 MTP
### Serving Ernie4.5 Model on MI300X/MI325X/MI355X GPUs
Run the vLLM online serving on AMD GPUs using the command below:
```bash
VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
vllm serve baidu/ERNIE-4.5-21B-A3B-PT \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--disable-log-requests \
--trust-remote-code
```

## Running Ernie4.5 MTP
```bash
# 21B MTP model 80G*1 GPU
vllm serve baidu/ERNIE-4.5-21B-A3B-PT \
Expand All @@ -58,12 +75,8 @@ vllm serve baidu/ERNIE-4.5-300B-A47B-PT \
--speculative-config '{"method": "ernie_mtp","model": "baidu/ERNIE-4.5-300B-A47B-PT","num_speculative_tokens": 1}'
```


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this unnecessary line change?

## Benchmarking

For benchmarking, only the first `vllm bench serve` after service startup to ensure it is not affected by prefix cache


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this unnecessary line change?

```bash
# Prompt-heavy benchmark (8k/1k)
vllm bench serve \
Expand All @@ -79,18 +92,14 @@ vllm bench serve \
```

### Benchmark Configurations

Test different workloads by adjusting input/output lengths:

- **Prompt-heavy**: 8000 input / 1000 output
- **Decode-heavy**: 1000 input / 8000 output
- **Balanced**: 1000 input / 1000 output

Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512

### Expected Output


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this unnecessary line change?

```shell
============ Serving Benchmark Result ============
Successful requests: 16
Expand All @@ -114,5 +123,4 @@ Mean ITL (ms): 16.84
Median ITL (ms): 15.49
P99 ITL (ms): 20.69
==================================================
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this unnecessary line change?

```