Skip to content

Llama4-Scout update for AMD GPU#214

Open
hyukjlee wants to merge 4 commits intovllm-project:mainfrom
hyukjlee:llama_4
Open

Llama4-Scout update for AMD GPU#214
hyukjlee wants to merge 4 commits intovllm-project:mainfrom
hyukjlee:llama_4

Conversation

@hyukjlee
Copy link

Requesting review for following PR

Signed-off-by: hyukjlee <hyukjlee@amd.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hyukjlee, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request adds a new, detailed guide for deploying and utilizing Llama 4 Scout and Maverick large language models on AMD Instinct GPUs. The documentation covers the entire process from environment setup using vLLM docker images to executing inference and benchmarking, providing a clear pathway for users to leverage AMD hardware for these specific models.

Highlights

  • New Documentation for AMD GPUs: Introduced a new quick start guide, Llama/Llama4-Scout_AMD.md, detailing how to run Llama 4 Scout and Maverick models on AMD MI300X, MI325X, and MI355X GPUs.
  • vLLM Integration: The guide provides comprehensive steps for deploying these models using a vLLM docker image, including environment setup and server configuration.
  • Inference and Benchmarking: Instructions are included for running inference with example curl commands and for performing performance benchmarks using vllm bench serve.
  • Model Access and Prerequisites: Details on accessing Llama 4 models via Huggingface and necessary prerequisites like ROCm 7.0+ drivers are outlined.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new guide for running Llama 4 models on AMD hardware. The guide is well-structured and provides useful commands. I've identified several areas for improvement, mainly related to consistency, clarity, and potential typos in commands. My suggestions focus on correcting these minor issues to enhance the guide's readability and ensure the commands run without errors. Key points include correcting an inconsistent list of supported GPUs, fixing minor grammatical errors and typos, and flagging a potentially incorrect Docker image version that could prevent users from following the instructions successfully.


```bash
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome --entrypoint /bin/bash'
drun vllm/vllm-openai-rocm:v0.14.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The vLLM Docker image version v0.14.1 for ROCm seems unusual and might be a typo. Recent vLLM versions are in the v0.5.x range, and older ROCm-compatible versions were around v0.4.x. This could be a typo for v0.4.1. Using an incorrect tag will likely cause the docker run command to fail. Please verify the correct image tag.


## Introduction

This quick start recipe explains how to run Llama 4 Scout 16 experts and Maverick 128 experts models on MI300X and MI355X GPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The introduction mentions support for MI300X and MI355X GPUs, but the prerequisites section on line 25 also lists MI325X. For consistency and clarity, please update the introduction to include all supported GPUs.

Suggested change
This quick start recipe explains how to run Llama 4 Scout 16 experts and Maverick 128 experts models on MI300X and MI355X GPUs.
This quick start recipe explains how to run Llama 4 Scout 16 experts and Maverick 128 experts models on MI300X, MI325X, and MI355X GPUs.


The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 4:
- Massive HBM memory capacity enables support for extended context lengths, delivering smooth and efficient performance.
- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrasing 'Using Optimized Triton and AITER kernels provide...' is slightly awkward. To improve clarity and maintain a parallel structure with the previous bullet point, consider rephrasing.

Suggested change
- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
- Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.


### License and Model parameters

To use Llama 4 Scout and Maverick models, you must first need to gain access to the model repos under Huggingface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrase 'you must first need to' is redundant. Using either 'you must first' or 'you first need to' would be more concise.

Suggested change
To use Llama 4 Scout and Maverick models, you must first need to gain access to the model repos under Huggingface.
To use Llama 4 Scout and Maverick models, you first need to gain access to the model repos under Huggingface.

Comment on lines +18 to +19
- [Llama4 Scout 16E](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
- [Llama4 Maverick 128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model names use 'Llama4' without a space, which is inconsistent with 'Llama 4' used elsewhere in the document (e.g., the title). For consistency, please add a space.

Suggested change
- [Llama4 Scout 16E](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
- [Llama4 Maverick 128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct)
- [Llama 4 Scout 16E](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
- [Llama 4 Maverick 128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct)

### 2. Start vLLM online server (run in background)

```bash
export TP=8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line has a trailing space. It's best to remove it for code cleanliness.

Suggested change
export TP=8
export TP=8

export VLLM_ROCM_USE_AITER=1
vllm serve $MODEL \
--disable-log-requests \
-tp $TP \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line has trailing spaces after the line continuation character. It's best to remove them for code cleanliness.

Suggested change
-tp $TP \
-tp $TP \

--max-model-len 32000 &
```

### 3. Running Inference using benchmark script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The section title 'Running Inference using benchmark script' is a bit misleading, as this section demonstrates a single inference request using curl, not a benchmark script. The next section is dedicated to performance benchmarking. A more accurate title would be 'Running Inference' or 'Basic Inference Test'.

Suggested change
### 3. Running Inference using benchmark script
### 3. Running Inference


### 3. Running Inference using benchmark script

Let the Ll4 Scout model to describe the following two images.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a couple of small errors in this sentence. 'Ll4' appears to be a typo for 'Llama 4', and 'to describe' is ungrammatical. It should be 'describe'.

Suggested change
Let the Ll4 Scout model to describe the following two images.
Let the Llama 4 Scout model describe the following two images.

![second image](./images/cat.png)

```bash
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The curl command has multiple spaces between arguments, which can be cleaned up for better readability.

Suggested change
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{

Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant