-
Notifications
You must be signed in to change notification settings - Fork 165
Llama 3.3-70B update for AMD GPU #212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,85 @@ | ||||||
| # Llama 3.3 70B Instruct on vLLM - AMD Hardware | ||||||
|
|
||||||
| ## Introduction | ||||||
|
|
||||||
| This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs using vLLM. | ||||||
|
|
||||||
| ## Key benefits of AMD GPUs on large models and developers | ||||||
|
|
||||||
| The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The phrase 'GPUs accelerators' is redundant. Please consider rephrasing to either 'GPU accelerators' or simply 'GPUs' for conciseness and clarity.
Suggested change
|
||||||
| - Can run large 70B-parameter models with strong throughput on a single node. | ||||||
| - Massive HBM memory capacity enables support for extended context lengths and larger batch sizes. | ||||||
| - Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The phrasing 'Using Optimized Triton...' is a bit awkward for a list item. To improve readability, consider rephrasing to start with a noun or adjective, similar to the other items in the list.
Suggested change
|
||||||
|
|
||||||
| ## Access & Licensing | ||||||
|
|
||||||
| ### License and Model parameters | ||||||
|
|
||||||
| To use the Llama 3.3 model, you must first gain access to the model repo under Hugging Face. | ||||||
| - [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | ||||||
|
|
||||||
| ## Prerequisites | ||||||
|
|
||||||
| - OS: Linux | ||||||
| - Drivers: ROCm 7.0 or above | ||||||
| - GPU: AMD MI300X, MI325X, and MI355X | ||||||
|
|
||||||
| ## Deployment Steps | ||||||
|
|
||||||
| ### 1. Using vLLM docker image (For AMD users) | ||||||
|
|
||||||
| ```bash | ||||||
| docker run -it \ | ||||||
| --network=host \ | ||||||
| --device=/dev/kfd \ | ||||||
| --device=/dev/dri \ | ||||||
| --group-add=video \ | ||||||
| --ipc=host \ | ||||||
| --cap-add=SYS_PTRACE \ | ||||||
| --security-opt seccomp=unconfined \ | ||||||
| --shm-size 32G \ | ||||||
| -v /data:/data \ | ||||||
| -v $HOME:/myhome \ | ||||||
| -w /myhome \ | ||||||
| --entrypoint /bin/bash \ | ||||||
| vllm/vllm-openai-rocm:latest | ||||||
| ``` | ||||||
| or you can use uv environment. | ||||||
| > Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the [documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#pre-built-images). | ||||||
| ```bash | ||||||
| uv venv | ||||||
| source .venv/bin/activate | ||||||
| uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ | ||||||
| ``` | ||||||
| ### 2. Start vLLM online server (run in background) | ||||||
|
|
||||||
| ```bash | ||||||
| export TP=2 | ||||||
| export MODEL="meta-llama/Llama-3.3-70B-Instruct" | ||||||
| export VLLM_ROCM_USE_AITER=1 | ||||||
| vllm serve $MODEL \ | ||||||
| --disable-log-requests \ | ||||||
| -tp $TP & | ||||||
| ``` | ||||||
|
|
||||||
| ### 3. Performance benchmark | ||||||
|
|
||||||
| ```bash | ||||||
| export MODEL="meta-llama/Llama-3.3-70B-Instruct" | ||||||
| export ISL=1024 | ||||||
| export OSL=1024 | ||||||
| export REQ=10 | ||||||
| export CONC=10 | ||||||
| vllm bench serve \ | ||||||
| --backend vllm \ | ||||||
| --model $MODEL \ | ||||||
| --dataset-name random \ | ||||||
| --random-input-len $ISL \ | ||||||
| --random-output-len $OSL \ | ||||||
| --num-prompts $REQ \ | ||||||
| --ignore-eos \ | ||||||
| --max-concurrency $CONC \ | ||||||
| --percentile-metrics ttft,tpot,itl,e2el | ||||||
| ``` | ||||||
|
|
||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, please consider including the
MI325XGPU in this introductory sentence, as it is mentioned in the 'Prerequisites' section below.