Skip to content

Llama 3.3-70B update for AMD GPU#212

Open
hyukjlee wants to merge 3 commits intovllm-project:mainfrom
hyukjlee:llama_3_3
Open

Llama 3.3-70B update for AMD GPU#212
hyukjlee wants to merge 3 commits intovllm-project:mainfrom
hyukjlee:llama_3_3

Conversation

@hyukjlee
Copy link

Requesting review for the following PR

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hyukjlee, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces new documentation to facilitate the deployment and benchmarking of Meta's Llama 3.1 8B Instruct and Llama 3.3 70B Instruct models on AMD MI300X/MI355X GPUs. These guides provide step-by-step instructions for leveraging vLLM to serve these large language models, enhancing accessibility and performance insights for users with AMD hardware.

Highlights

  • Llama 3.1 8B Instruct Guide: Introduced a new quick start guide for deploying the Llama 3.1 8B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
  • Llama 3.3 70B Instruct Guide: Added a comprehensive quick start guide for running the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs, also leveraging vLLM.
  • Standardized Deployment Workflow: Both guides outline a clear four-step deployment process: using a vLLM Docker image, starting the vLLM online server, running inference with a 'curl' command, and performing performance benchmarks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds documentation for running Llama 3.1 8B and Llama 3.3 70B models on AMD hardware. The changes are well-structured and provide useful command-line examples. I've provided a few suggestions to improve clarity and consistency in the new markdown files. Specifically, I've pointed out some minor inconsistencies in the listed hardware and suggested improvements to phrasing for better readability.


## Introduction

This quick start recipe explains how to run the Llama 3.1 8B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency, please consider including the MI325X GPU in this introductory sentence, as it is mentioned in the 'Prerequisites' section below.

Suggested change
This quick start recipe explains how to run the Llama 3.1 8B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
This quick start recipe explains how to run the Llama 3.1 8B Instruct model on AMD MI300X, MI325X, and MI355X GPUs using vLLM.


## Key benefits of AMD GPUs on large models and developers

The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrase 'GPUs accelerators' is redundant. Please consider rephrasing to either 'GPU accelerators' or simply 'GPUs' for conciseness and clarity.

Suggested change
The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.1:
The AMD Instinct GPU accelerators are purpose-built to handle the demands of next-gen models like Llama 3.1:

-tp $TP &
```

### 3. Running Inference using benchmark script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The title 'Running Inference using benchmark script' is a bit misleading, as this section demonstrates a single inference request rather than running a benchmark script. A title like 'Running a Test Inference' or 'Running a Sample Inference' would be more accurate.

Suggested change
### 3. Running Inference using benchmark script
### 3. Running a Test Inference


## Introduction

This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency, please consider including the MI325X GPU in this introductory sentence, as it is mentioned in the 'Prerequisites' section below.

Suggested change
This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X/MI355X GPUs using vLLM.
This quick start recipe explains how to run the Llama 3.3 70B Instruct model on AMD MI300X, MI325X, and MI355X GPUs using vLLM.


## Key benefits of AMD GPUs on large models and developers

The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrase 'GPUs accelerators' is redundant. Please consider rephrasing to either 'GPU accelerators' or simply 'GPUs' for conciseness and clarity.

Suggested change
The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
The AMD Instinct GPU accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:

The AMD Instinct GPUs accelerators are purpose-built to handle the demands of next-gen models like Llama 3.3:
- Can run large 70B-parameter models with strong throughput on a single node.
- Massive HBM memory capacity enables support for extended context lengths and larger batch sizes.
- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrasing 'Using Optimized Triton...' is a bit awkward for a list item. To improve readability, consider rephrasing to start with a noun or adjective, similar to the other items in the list.

Suggested change
- Using Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.
- Optimized Triton and AITER kernels provide best-in-class performance and TCO for production deployment.

Signed-off-by: hyukjlee <hyukjlee@amd.com>
Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
Signed-off-by: Hyukjoon Lee <hyukjlee@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant