vllm-project · hyukjlee · Jan 28, 2026 · Feb 9, 2026 · Feb 9, 2026 · gemini-code-assist
diff --git a/Llama/Llama3.3_70B_AMD.md b/Llama/Llama3.3_70B_AMD.md
@@ -0,0 +1,85 @@
+# Llama 3.3 70B Instruct on vLLM - AMD Hardware
+
+## Introduction
+
+This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
-This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
+This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X, MI325X, and MI355X GPUs using vLLM.
-This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
+This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X, MI325X, and MI355X GPUs using vLLM.
+
+## Key benefits of AMD GPUs on large models and developers
+
+The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
-The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
+The AMD Instinct GPU accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
-The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
+The AMD Instinct GPU accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
+- Can run large 70B-parameter models with strong throughput on a single node.
+- Massive HBM memory capacity enables support for extended context lengths and larger batch sizes.
+- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
+- Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
+- Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
+
+## Access & Licensing
+
+### License and Model parameters
+
+To use the Llama 3.3 model, you must first gain access to the model repo under Hugging Face.
+- [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
+
+## Prerequisites
+
+- OS: Linux
+- Drivers: ROCm 7.0 or above
+- GPU: AMD MI300X, MI325X, and MI355X
+
+## Deployment Steps
+
+### 1. Using vLLM docker image (For AMD users)
+
+```bash
+docker run -it \
+  --network=host \
+  --device=/dev/kfd \
+  --device=/dev/dri \
+  --group-add=video \
+  --ipc=host \
+  --cap-add=SYS_PTRACE \
+  --security-opt seccomp=unconfined \
+  --shm-size 32G \
+  -v /data:/data \
+  -v $HOME:/myhome \
+  -w /myhome \
+  --entrypoint /bin/bash \
+  vllm/vllm-openai-rocm:latest
+```
+or you can use uv environment.
+ > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images).  
+ ```bash 
+ uv venv 
+ source .venv/bin/activate 
+ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
+ ```
+### 2. Start vLLM online server (run in background)
+
+```bash
+export TP=2
+export MODEL="meta-llama/Llama-3.3-70B-Instruct"
+export VLLM_ROCM_USE_AITER=1
+vllm serve $MODEL \
+  --disable-log-requests \  
+  -tp $TP &  
+```
+
+### 3. Performance benchmark
+
+```bash
+export MODEL="meta-llama/Llama-3.3-70B-Instruct"
+export ISL=1024
+export OSL=1024
+export REQ=10
+export CONC=10
+vllm bench serve \
+  --backend vllm \
+  --model $MODEL \
+  --dataset-name random \
+  --random-input-len $ISL \
+  --random-output-len $OSL \
+  --num-prompts $REQ \
+  --ignore-eos \
+  --max-concurrency $CONC \  
+  --percentile-metrics ttft,tpot,itl,e2el
+```
+
+