Conversation
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Summary of ChangesHello @faradawn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new, detailed usage guide for integrating the MiniMax-M2.5 model with vLLM. The guide provides essential steps for setting up the environment, deploying the model with various configurations, and evaluating its performance through benchmarking, aiming to streamline the adoption and utilization of the model. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces a new usage guide for MiniMax-M2.5. However, the current draft contains significant redundancy with the existing MiniMax/MiniMax-M2.md and lacks the 'detailed configs' promised in the description. The guide includes duplicate Docker commands and missing hardware-specific optimizations for B200. I recommend consolidating this information into the existing comprehensive guide or expanding this file with unique, optimized deployment configurations and verified benchmark data.
| # MiniMax-M2.5 Usage Guide | ||
|
|
||
| This guide describes how to run [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) with vLLM. |
There was a problem hiding this comment.
This new guide significantly overlaps with the existing MiniMax/MiniMax-M2.md, which already covers MiniMax-M2.5 and provides more comprehensive details such as system requirements, advanced parallelism (DP/EP), and verified benchmarks. Consider merging any unique M2.5-specific information into the existing guide instead of creating a separate file to avoid documentation fragmentation and maintenance overhead.
|
|
||
| ## Running MiniMax-M2.5 | ||
|
|
||
| MiniMax-M2.5 can be run on different GPU configurations. The recommended setup uses 4x H200/H20 or 4x A100/A800 GPUs with tensor parallelism. |
There was a problem hiding this comment.
To fulfill the goal of providing 'detailed configs for different deployments', it would be beneficial to include examples for Data Parallelism (DP) and Expert Parallelism (EP). Since pure TP8 is not supported for this model, providing the DP8+EP or TP+EP commands is crucial for users scaling beyond 4 GPUs.
| docker run --gpus all \ | ||
| -p 8000:8000 \ | ||
| --ipc=host \ | ||
| -v ~/.cache/huggingface:/root/.cache/huggingface \ | ||
| vllm/vllm-openai:nightly MiniMaxAI/MiniMax-M2.5 \ | ||
| --tensor-parallel-size 4 \ | ||
| --tool-call-parser minimax_m2 \ | ||
| --reasoning-parser minimax_m2_append_think \ | ||
| --enable-auto-tool-choice \ | ||
| --trust-remote-code |
There was a problem hiding this comment.
This Docker command is identical to the one provided in the installation section (lines 19-29). For a specific 'B200 (FP8)' deployment, it should include the necessary environment variables (e.g., VLLM_USE_FLASHINFER_MOE_FP8=0) to address known compatibility issues on this hardware, as documented in the general MiniMax guide.
| docker run --gpus all \ | |
| -p 8000:8000 \ | |
| --ipc=host \ | |
| -v ~/.cache/huggingface:/root/.cache/huggingface \ | |
| vllm/vllm-openai:nightly MiniMaxAI/MiniMax-M2.5 \ | |
| --tensor-parallel-size 4 \ | |
| --tool-call-parser minimax_m2 \ | |
| --reasoning-parser minimax_m2_append_think \ | |
| --enable-auto-tool-choice \ | |
| --trust-remote-code | |
| docker run --gpus all \ | |
| -e VLLM_USE_FLASHINFER_MOE_FP8=0 \ | |
| -p 8000:8000 \ | |
| --ipc=host \ | |
| -v ~/.cache/huggingface:/root/.cache/huggingface \ | |
| vllm/vllm-openai:nightly MiniMaxAI/MiniMax-M2.5 \ | |
| --tensor-parallel-size 4 \ | |
| --tool-call-parser minimax_m2 \ | |
| --reasoning-parser minimax_m2_append_think \ | |
| --enable-auto-tool-choice \ | |
| --trust-remote-code |
| ============ Serving Benchmark Result ============ | ||
| Successful requests: xxx | ||
| Failed requests: xxx | ||
| Maximum request concurrency: xxx | ||
| Benchmark duration (s): xxx | ||
| Total input tokens: xxx | ||
| Total generated tokens: xxx | ||
| Request throughput (req/s): xxx | ||
| Output token throughput (tok/s): xxx | ||
| Peak output token throughput (tok/s): xxx | ||
| Peak concurrent requests: xxx | ||
| Total Token throughput (tok/s): xxx | ||
| ---------------Time to First Token---------------- | ||
| Mean TTFT (ms): xxx | ||
| Median TTFT (ms): xxx | ||
| P99 TTFT (ms): xxx | ||
| -----Time per Output Token (excl. 1st token)------ | ||
| Mean TPOT (ms): xxx | ||
| Median TPOT (ms): xxx | ||
| P99 TPOT (ms): xxx | ||
| ---------------Inter-token Latency---------------- | ||
| Mean ITL (ms): xxx | ||
| Median ITL (ms): xxx | ||
| P99 ITL (ms): xxx |
|
We have added this model in https://github.com/vllm-project/recipes/blob/main/MiniMax/MiniMax-M2.md |
|
hi @kedarpotdar-nv @faradawn for h100 seeing that TP4EP4 is the best recipe verus TP8EP8
can u add this to ur PR instead of me starting another PR? |
Add detailed configs for different deployments