Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions Llama/Llama3.3_70B_AMD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Llama 3.3 70B Instruct on vLLM - AMD Hardware

## Introduction

This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency, please consider including the MI325X GPU in this introductory sentence, as it is mentioned in the 'Prerequisites' section below.

Suggested change
This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X, MI325X, and MI355X GPUs using vLLM.


## Key benefits of AMD GPUs on large models and developers

The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrase 'GPUs accelerators' is redundant. Please consider rephrasing to either 'GPU accelerators' or simply 'GPUs' for conciseness and clarity.

Suggested change
The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
The AMD Instinct GPU accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:

- Can run large 70B-parameter models with strong throughput on a single node.
- Massive HBM memory capacity enables support for extended context lengths and larger batch sizes.
- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrasing 'Using Optimized Triton...' is a bit awkward for a list item. To improve readability, consider rephrasing to start with a noun or adjective, similar to the other items in the list.

Suggested change
- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
- Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.


## Access & Licensing

### License and Model parameters

To use the Llama 3.3 model, you must first gain access to the model repo under Hugging Face.
- [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)

## Prerequisites

- OS: Linux
- Drivers: ROCm 7.0 or above
- GPU: AMD MI300X, MI325X, and MI355X

## Deployment Steps

### 1. Using vLLM docker image (For AMD users)

```bash
docker run -it \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--shm-size 32G \
-v /data:/data \
-v $HOME:/myhome \
-w /myhome \
--entrypoint /bin/bash \
vllm/vllm-openai-rocm:latest
```
or you can use uv environment.
> Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images).
```bash
uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
```
### 2. Start vLLM online server (run in background)

```bash
export TP=2
export MODEL="meta-llama/Llama-3.3-70B-Instruct"
export VLLM_ROCM_USE_AITER=1
vllm serve $MODEL \
--disable-log-requests \
-tp $TP &
```

### 3. Performance benchmark

```bash
export MODEL="meta-llama/Llama-3.3-70B-Instruct"
export ISL=1024
export OSL=1024
export REQ=10
export CONC=10
vllm bench serve \
--backend vllm \
--model $MODEL \
--dataset-name random \
--random-input-len $ISL \
--random-output-len $OSL \
--num-prompts $REQ \
--ignore-eos \
--max-concurrency $CONC \
--percentile-metrics ttft,tpot,itl,e2el
```