From 2a449a6a51927445733ba8c6bfdc9785a0010b93 Mon Sep 17 00:00:00 2001 From: hyukjlee Date: Wed, 28 Jan 2026 01:58:03 +0000 Subject: [PATCH 1/3] Llama3.3-70B update for AMD GPU Signed-off-by: hyukjlee --- Llama/Llama3.3_70B_AMD.md | 84 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 84 insertions(+) create mode 100644 Llama/Llama3.3_70B_AMD.md diff --git a/Llama/Llama3.3_70B_AMD.md b/Llama/Llama3.3_70B_AMD.md new file mode 100644 index 00000000..28a65aaa --- /dev/null +++ b/Llama/Llama3.3_70B_AMD.md @@ -0,0 +1,84 @@ +# Llama 3.3 70B Instruct on vLLM - AMD Hardware + +## Introduction + +This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs using vLLM. + +## Key benefits of AMD GPUs on large models and developers + +The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3: +- Can run large 70B-parameter models with strong throughput on a single node. +- Massive HBM memory capacity enables support for extended context lengths and larger batch sizes. +- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment. + +## Access & Licensing + +### License and Model parameters + +To use the Llama 3.3 model, you must first gain access to the model repo under Hugging Face. +- [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) + +## Prerequisites + +- OS: Linux +- Drivers: ROCm 7.0 or above +- GPU: AMD MI300X, MI325X, and MI355X + +## Deployment Steps + +### 1. Using vLLM docker image (For AMD users) + +```bash +alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome --entrypoint /bin/bash' +drun vllm/vllm-openai-rocm:v0.14.1 +``` + +### 2. Start vLLM online server (run in background) + +```bash +export TP=2 +export MODEL="meta-llama/Llama-3.3-70B-Instruct" +export VLLM_ROCM_USE_AITER=1 +vllm serve $MODEL \ + --disable-log-requests \ + --port 8005 \ + -tp $TP & +``` + +### 3. Running Inference using benchmark script + +Test the model with a text-only prompt. + +```bash +curl http://localhost:8005/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-3.3-70B-Instruct", + "prompt": "Summarize the key differences between throughput and latency in LLM serving.", + "max_tokens": 128, + "temperature": 0 + }' +``` + +### 4. Performance benchmark + +```bash +export MODEL="meta-llama/Llama-3.3-70B-Instruct" +export ISL=1024 +export OSL=1024 +export REQ=10 +export CONC=10 +vllm bench serve \ + --backend vllm \ + --model $MODEL \ + --dataset-name random \ + --random-input-len $ISL \ + --random-output-len $OSL \ + --num-prompts $REQ \ + --ignore-eos \ + --max-concurrency $CONC \ + --port 8005 \ + --percentile-metrics ttft,tpot,itl,e2el +``` + + From 2d8abbeb1714decd40e4ea14c3609feaddc49c21 Mon Sep 17 00:00:00 2001 From: Hyukjoon Lee Date: Mon, 9 Feb 2026 16:53:22 +0900 Subject: [PATCH 2/3] Update Llama3.3_70B_AMD.md Signed-off-by: Hyukjoon Lee --- Llama/Llama3.3_70B_AMD.md | 39 +++++++++++++++++---------------------- 1 file changed, 17 insertions(+), 22 deletions(-) diff --git a/Llama/Llama3.3_70B_AMD.md b/Llama/Llama3.3_70B_AMD.md index 28a65aaa..2513d7bb 100644 --- a/Llama/Llama3.3_70B_AMD.md +++ b/Llama/Llama3.3_70B_AMD.md @@ -29,8 +29,20 @@ To use the Llama 3.3 model, you must first gain access to the model repo under H ### 1. Using vLLM docker image (For AMD users) ```bash -alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome --entrypoint /bin/bash' -drun vllm/vllm-openai-rocm:v0.14.1 +docker run -it \ + --network=host \ + --device=/dev/kfd \ + --device=/dev/dri \ + --group-add=video \ + --ipc=host \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + --shm-size 32G \ + -v /data:/data \ + -v $HOME:/myhome \ + -w /myhome \ + --entrypoint /bin/bash \ + vllm/vllm-openai-rocm:latest ``` ### 2. Start vLLM online server (run in background) @@ -40,27 +52,11 @@ export TP=2 export MODEL="meta-llama/Llama-3.3-70B-Instruct" export VLLM_ROCM_USE_AITER=1 vllm serve $MODEL \ - --disable-log-requests \ - --port 8005 \ + --disable-log-requests \ -tp $TP & ``` -### 3. Running Inference using benchmark script - -Test the model with a text-only prompt. - -```bash -curl http://localhost:8005/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/Llama-3.3-70B-Instruct", - "prompt": "Summarize the key differences between throughput and latency in LLM serving.", - "max_tokens": 128, - "temperature": 0 - }' -``` - -### 4. Performance benchmark +### 3. Performance benchmark ```bash export MODEL="meta-llama/Llama-3.3-70B-Instruct" @@ -76,8 +72,7 @@ vllm bench serve \ --random-output-len $OSL \ --num-prompts $REQ \ --ignore-eos \ - --max-concurrency $CONC \ - --port 8005 \ + --max-concurrency $CONC \ --percentile-metrics ttft,tpot,itl,e2el ``` From f463c16aadad14356f19008051259004f346c7e2 Mon Sep 17 00:00:00 2001 From: Hyukjoon Lee Date: Mon, 9 Feb 2026 17:10:15 +0900 Subject: [PATCH 3/3] Update Llama3.3_70B_AMD.md Signed-off-by: Hyukjoon Lee --- Llama/Llama3.3_70B_AMD.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Llama/Llama3.3_70B_AMD.md b/Llama/Llama3.3_70B_AMD.md index 2513d7bb..bc15ce4d 100644 --- a/Llama/Llama3.3_70B_AMD.md +++ b/Llama/Llama3.3_70B_AMD.md @@ -44,7 +44,13 @@ docker run -it \ --entrypoint /bin/bash \ vllm/vllm-openai-rocm:latest ``` - +or you can use uv environment. + > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). + ```bash + uv venv + source .venv/bin/activate + uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ + ``` ### 2. Start vLLM online server (run in background) ```bash