Conversation
Signed-off-by: hyukjlee <hyukjlee@amd.com>
Summary of ChangesHello @hyukjlee, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request adds a new, detailed guide for deploying and utilizing Llama 4 Scout and Maverick large language models on AMD Instinct GPUs. The documentation covers the entire process from environment setup using vLLM docker images to executing inference and benchmarking, providing a clear pathway for users to leverage AMD hardware for these specific models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a new guide for running Llama 4 models on AMD hardware. The guide is well-structured and provides useful commands. I've identified several areas for improvement, mainly related to consistency, clarity, and potential typos in commands. My suggestions focus on correcting these minor issues to enhance the guide's readability and ensure the commands run without errors. Key points include correcting an inconsistent list of supported GPUs, fixing minor grammatical errors and typos, and flagging a potentially incorrect Docker image version that could prevent users from following the instructions successfully.
Llama/Llama4-Scout_AMD.md
Outdated
|
|
||
| ```bash | ||
| alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome --entrypoint /bin/bash' | ||
| drun vllm/vllm-openai-rocm:v0.14.1 |
There was a problem hiding this comment.
The vLLM Docker image version v0.14.1 for ROCm seems unusual and might be a typo. Recent vLLM versions are in the v0.5.x range, and older ROCm-compatible versions were around v0.4.x. This could be a typo for v0.4.1. Using an incorrect tag will likely cause the docker run command to fail. Please verify the correct image tag.
|
|
||
| ## Introduction | ||
|
|
||
| This quick start recipe explains how to run Llama 4 Scout 16 experts and Maverick 128 experts models on MI300X and MI355X GPUs. |
There was a problem hiding this comment.
The introduction mentions support for MI300X and MI355X GPUs, but the prerequisites section on line 25 also lists MI325X. For consistency and clarity, please update the introduction to include all supported GPUs.
| This quick start recipe explains how to run Llama 4 Scout 16 experts and Maverick 128 experts models on MI300X and MI355X GPUs. | |
| This quick start recipe explains how to run Llama 4 Scout 16 experts and Maverick 128 experts models on MI300X, MI325X, and MI355X GPUs. |
|
|
||
| The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 4: | ||
| - Massive HBM memory capacity enables support for extended context lengths, delivering smooth and efficient performance. | ||
| - Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment. |
There was a problem hiding this comment.
The phrasing 'Using Optimized Triton and AITER kernels provide...' is slightly awkward. To improve clarity and maintain a parallel structure with the previous bullet point, consider rephrasing.
| - Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment. | |
| - Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment. |
Llama/Llama4-Scout_AMD.md
Outdated
|
|
||
| ### License and Model parameters | ||
|
|
||
| To use Llama 4 Scout and Maverick models, you must first need to gain access to the model repos under Huggingface. |
There was a problem hiding this comment.
The phrase 'you must first need to' is redundant. Using either 'you must first' or 'you first need to' would be more concise.
| To use Llama 4 Scout and Maverick models, you must first need to gain access to the model repos under Huggingface. | |
| To use Llama 4 Scout and Maverick models, you first need to gain access to the model repos under Huggingface. |
Llama/Llama4-Scout_AMD.md
Outdated
| - [Llama4 Scout 16E](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | ||
| - [Llama4 Maverick 128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct) |
There was a problem hiding this comment.
The model names use 'Llama4' without a space, which is inconsistent with 'Llama 4' used elsewhere in the document (e.g., the title). For consistency, please add a space.
| - [Llama4 Scout 16E](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | |
| - [Llama4 Maverick 128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct) | |
| - [Llama 4 Scout 16E](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | |
| - [Llama 4 Maverick 128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct) |
| ### 2. Start vLLM online server (run in background) | ||
|
|
||
| ```bash | ||
| export TP=8 |
| export VLLM_ROCM_USE_AITER=1 | ||
| vllm serve $MODEL \ | ||
| --disable-log-requests \ | ||
| -tp $TP \ |
Llama/Llama4-Scout_AMD.md
Outdated
| --max-model-len 32000 & | ||
| ``` | ||
|
|
||
| ### 3. Running Inference using benchmark script |
There was a problem hiding this comment.
The section title 'Running Inference using benchmark script' is a bit misleading, as this section demonstrates a single inference request using curl, not a benchmark script. The next section is dedicated to performance benchmarking. A more accurate title would be 'Running Inference' or 'Basic Inference Test'.
| ### 3. Running Inference using benchmark script | |
| ### 3. Running Inference |
Llama/Llama4-Scout_AMD.md
Outdated
|
|
||
| ### 3. Running Inference using benchmark script | ||
|
|
||
| Let the Ll4 Scout model to describe the following two images. |
There was a problem hiding this comment.
Llama/Llama4-Scout_AMD.md
Outdated
|  | ||
|
|
||
| ```bash | ||
| curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ |
There was a problem hiding this comment.
Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
Requesting review for following PR