Skip to content

perf(MoE): Use TE quant/dequant for SwiGLU fp8 input store to improve performance and stability#1753

Draft
xiaoxi-wangfj wants to merge 3 commits intoNVIDIA:mainfrom
021ai:optimize-swiglu-input-fp8-quant
Draft

perf(MoE): Use TE quant/dequant for SwiGLU fp8 input store to improve performance and stability#1753
xiaoxi-wangfj wants to merge 3 commits intoNVIDIA:mainfrom
021ai:optimize-swiglu-input-fp8-quant

Conversation

@xiaoxi-wangfj
Copy link
Contributor

Description

Replace native .to(fp8) casting in SwiGLU with Transformer Engine quant/dequant interfaces for storing activation inputs in FP8.

Benefits:

  1. Higher performance – For quant+dequant operator performance, the dynamic quantization and dequantization method in transformer_engineprovides a 1.48x to 1.66x speedup compared to the native method.
  2. Better numerical stability – The dynamic quantization and dequantization method in transformer_engine handles extreme values more gracefully. Native .to(fp8) can cause underflow of very small numbers to 0, or overflow of large values to inf, while TE scaling reduces these issues.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Member

@ksivaman ksivaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except a small nit to not import internal API as the exact structure of the files might change or move around in TE

@xiaoxi-wangfj xiaoxi-wangfj requested review from a team as code owners December 31, 2025 01:45
… performance and stability

Co-authored-by: xiaoxi-wangfj <690912414@qq.com>
Co-authored-by: pumpkinsm <123sssmmm@gmail.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
@xiaoxi-wangfj xiaoxi-wangfj force-pushed the optimize-swiglu-input-fp8-quant branch from ccdbe18 to cd7f9fe Compare December 31, 2025 02:07
@xiaoxi-wangfj
Copy link
Contributor Author

@ksivaman Thanks for the review! I’ve addressed all the comments and updated the PR. Could you please re-review when you get a chance?

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026
@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

@Phlip79 Phlip79 marked this pull request as draft March 4, 2026 23:05
@Phlip79 Phlip79 removed the needs-follow-up Issue needs follow-up label Mar 4, 2026
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants