Skip to content

Add MiniMax-M2.5 Usage Guide#257

Draft
faradawn wants to merge 1 commit intovllm-project:mainfrom
faradawn:minimax-2.5-recipe
Draft

Add MiniMax-M2.5 Usage Guide#257
faradawn wants to merge 1 commit intovllm-project:mainfrom
faradawn:minimax-2.5-recipe

Conversation

@faradawn
Copy link

Add detailed configs for different deployments

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @faradawn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, detailed usage guide for integrating the MiniMax-M2.5 model with vLLM. The guide provides essential steps for setting up the environment, deploying the model with various configurations, and evaluating its performance through benchmarking, aiming to streamline the adoption and utilization of the model.

Highlights

  • New Usage Guide Added: A comprehensive guide for running the MiniMax-M2.5 model with vLLM has been added, covering installation, deployment, and benchmarking procedures.
  • vLLM Installation Instructions: Detailed steps for installing vLLM via pip (including a note on nightly builds for potential corrupted output) and Docker are now available.
  • Model Deployment Configurations: Specific Docker commands for deploying MiniMax-M2.5 are provided, including recommended GPU setups (e.g., 4x H200/H20 or A100/A800) and a configuration for B200 (FP8).
  • Benchmarking Script and Expected Output: A script for benchmarking MiniMax-M2.5 performance using vllm bench serve is included, along with an example of the expected output metrics.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • MiniMax/MiniMax-M2.5.md
    • Added a new usage guide for MiniMax-M2.5 with vLLM.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new usage guide for MiniMax-M2.5. However, the current draft contains significant redundancy with the existing MiniMax/MiniMax-M2.md and lacks the 'detailed configs' promised in the description. The guide includes duplicate Docker commands and missing hardware-specific optimizations for B200. I recommend consolidating this information into the existing comprehensive guide or expanding this file with unique, optimized deployment configurations and verified benchmark data.

Comment on lines +1 to +3
# MiniMax-M2.5 Usage Guide

This guide describes how to run [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) with vLLM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new guide significantly overlaps with the existing MiniMax/MiniMax-M2.md, which already covers MiniMax-M2.5 and provides more comprehensive details such as system requirements, advanced parallelism (DP/EP), and verified benchmarks. Consider merging any unique M2.5-specific information into the existing guide instead of creating a separate file to avoid documentation fragmentation and maintenance overhead.


## Running MiniMax-M2.5

MiniMax-M2.5 can be run on different GPU configurations. The recommended setup uses 4x H200/H20 or 4x A100/A800 GPUs with tensor parallelism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To fulfill the goal of providing 'detailed configs for different deployments', it would be beneficial to include examples for Data Parallelism (DP) and Expert Parallelism (EP). Since pure TP8 is not supported for this model, providing the DP8+EP or TP+EP commands is crucial for users scaling beyond 4 GPUs.

Comment on lines +38 to +47
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:nightly MiniMaxAI/MiniMax-M2.5 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--trust-remote-code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This Docker command is identical to the one provided in the installation section (lines 19-29). For a specific 'B200 (FP8)' deployment, it should include the necessary environment variables (e.g., VLLM_USE_FLASHINFER_MOE_FP8=0) to address known compatibility issues on this hardware, as documented in the general MiniMax guide.

Suggested change
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:nightly MiniMaxAI/MiniMax-M2.5 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--trust-remote-code
docker run --gpus all \
-e VLLM_USE_FLASHINFER_MOE_FP8=0 \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:nightly MiniMaxAI/MiniMax-M2.5 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--trust-remote-code

Comment on lines +71 to +94
============ Serving Benchmark Result ============
Successful requests: xxx
Failed requests: xxx
Maximum request concurrency: xxx
Benchmark duration (s): xxx
Total input tokens: xxx
Total generated tokens: xxx
Request throughput (req/s): xxx
Output token throughput (tok/s): xxx
Peak output token throughput (tok/s): xxx
Peak concurrent requests: xxx
Total Token throughput (tok/s): xxx
---------------Time to First Token----------------
Mean TTFT (ms): xxx
Median TTFT (ms): xxx
P99 TTFT (ms): xxx
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): xxx
Median TPOT (ms): xxx
P99 TPOT (ms): xxx
---------------Inter-token Latency----------------
Mean ITL (ms): xxx
Median ITL (ms): xxx
P99 ITL (ms): xxx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The benchmark results are currently empty placeholders (xxx). Providing actual representative performance data or removing this section until verified metrics are available would improve the guide's utility and credibility.

@jeejeelee
Copy link
Collaborator

@functionstackx
Copy link
Contributor

hi @kedarpotdar-nv @faradawn for h100 seeing that TP4EP4 is the best recipe verus TP8EP8

-enable-expert-parallel --tensor-parallel-size=4

can u add this to ur PR instead of me starting another PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants