Scaled fp8 mfma gfx950#2246
Open
stefankoncarevic wants to merge 4 commits intomfma-enable-kpack-values-gfx950from
Open
Scaled fp8 mfma gfx950#2246stefankoncarevic wants to merge 4 commits intomfma-enable-kpack-values-gfx950from
stefankoncarevic wants to merge 4 commits intomfma-enable-kpack-values-gfx950from
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on the gfx950 architecture. These instructions use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors, sharing the same hardware instructions as FP4 scaled MFMAs but with different configuration parameters (cbsz=0, blgp=0). The PR depends on #2242 which relaxes isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines, enabling configurations like kpack=4 with k_base=32.
Changes:
- Added 4 new MfmaTypeId enum values for scaled FP8 type combinations (Fp8Fp8ScaledTyId, Fp8Bf8ScaledTyId, Bf8Fp8ScaledTyId, Bf8Bf8ScaledTyId)
- Implemented scaled FP8 MFMA selection logic in MfmaInsnGroup that tries scaled FP8 MFMAs first when kPerBlock is large enough
- Updated AccelEmitter to generate scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers
- Added 9 comprehensive tests covering all combinations of MFMA sizes (16x16x128, 32x32x64), schedule versions (1, 2, 3, 4), FP8 type combinations, and kpack values (1, 4, 8, 32)
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir | Added 9 tests for scaled FP8 MFMA operations covering single/double-buffer pipelines and various kpack configurations |
| mlir/include/mlir/Dialect/Rock/IR/MfmaInsnGroup.h | Added 4 new enum values for scaled FP8 type IDs and isScaledFp8() method declaration |
| mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp | Implemented scaled FP8 MFMA instruction mapping, selection logic in selectForGfx950(), and isScaledFp8() method |
| mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp | Added logic to emit scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
eb61882 to
f93c548
Compare
16x16x128) on gfx950 architecture. These tests cover: - Single buffering (scheduleVersion 1, 3) with kpack=32 and kpack=1 - Double buffering (scheduleVersion 2, 4) with kpack=32 - Double buffering with kpack < k_base (kpack=1, 4, 8) - All FP8 type combinations: FP8×FP8, BF8×BF8, FP8×BF8, BF8×FP8 The tests verify that amdgpu.scaled_mfma operations are correctly generated for OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors.
- Remove duplicate entries in getMfmaInsnInfoMap - Clarify neutral scale creation comment in AccelEmitter.cpp - Rename zeroAttr to neutralScaleAttr for clarity
f93c548 to
ff1d1c9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves: https://amd-hub.atlassian.net/browse/AIROCMLIR-477
Motivation
This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on gfx950 architecture. The scaled FP8 MFMAs use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors and provide improved performance for 8-bit floating-point matrix operations.
PR #2242 relaxes the isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines (scheduleVersion 2 or 4), which is required for some of the test configurations in this PR to work correctly.
Technical Details
Scaled FP8 MFMA Instructions on gfx950
The gfx950 architecture introduces scaled MFMA instructions for OCP FP8 types (f8E4M3FN, f8E5M2):
These instructions differ from native FP8 MFMAs (32x32x16 with k_base=8) by using implicit scale factors. The compiler generates amdgpu.scaled_mfma operations with constant scale values
MFMA Selection Logic
The MfmaInsnGroup::select function in MfmaInsnGroup.cpp selects scaled FP8 MFMAs when:
Test Plan
Added 9 tests to mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir covering all combinations of MFMA sizes, scheduleVersion values, FP8 type combinations, and kpack values.
All tests pass
Test Result
Submission Checklist