Scaled fp8 mfma gfx950 by stefankoncarevic · Pull Request #2246 · ROCm/rocMLIR

stefankoncarevic · 2026-02-20T14:02:17Z

⚠️ This PR depends on #2242 and should not be merged before that.
Resolves: https://amd-hub.atlassian.net/browse/AIROCMLIR-477

Motivation

This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on gfx950 architecture. The scaled FP8 MFMAs use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors and provide improved performance for 8-bit floating-point matrix operations.
PR #2242 relaxes the isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines (scheduleVersion 2 or 4), which is required for some of the test configurations in this PR to work correctly.

Technical Details

Scaled FP8 MFMA Instructions on gfx950
The gfx950 architecture introduces scaled MFMA instructions for OCP FP8 types (f8E4M3FN, f8E5M2):

32x32x64 MFMA: M=32, N=32, K=64, k_base=32, output vector<16xf32>
16x16x128 MFMA: M=16, N=16, K=128, k_base=32, output vector<4xf32>

These instructions differ from native FP8 MFMAs (32x32x16 with k_base=8) by using implicit scale factors. The compiler generates amdgpu.scaled_mfma operations with constant scale values

MFMA Selection Logic
The MfmaInsnGroup::select function in MfmaInsnGroup.cpp selects scaled FP8 MFMAs when:

Architecture is gfx950
Input types are OCP FP8 (f8E4M3FN or f8E5M2)
isCoherentWithK validation passes for the given kpack, kpackPerBlock, and scheduleVersion

Test Plan

Added 9 tests to mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir covering all combinations of MFMA sizes, scheduleVersion values, FP8 type combinations, and kpack values.
All tests pass

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on the gfx950 architecture. These instructions use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors, sharing the same hardware instructions as FP4 scaled MFMAs but with different configuration parameters (cbsz=0, blgp=0). The PR depends on #2242 which relaxes isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines, enabling configurations like kpack=4 with k_base=32.

Changes:

Added 4 new MfmaTypeId enum values for scaled FP8 type combinations (Fp8Fp8ScaledTyId, Fp8Bf8ScaledTyId, Bf8Fp8ScaledTyId, Bf8Bf8ScaledTyId)
Implemented scaled FP8 MFMA selection logic in MfmaInsnGroup that tries scaled FP8 MFMAs first when kPerBlock is large enough
Updated AccelEmitter to generate scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers
Added 9 comprehensive tests covering all combinations of MFMA sizes (16x16x128, 32x32x64), schedule versions (1, 2, 3, 4), FP8 type combinations, and kpack values (1, 4, 8, 32)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir	Added 9 tests for scaled FP8 MFMA operations covering single/double-buffer pipelines and various kpack configurations
mlir/include/mlir/Dialect/Rock/IR/MfmaInsnGroup.h	Added 4 new enum values for scaled FP8 type IDs and isScaledFp8() method declaration
mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp	Implemented scaled FP8 MFMA instruction mapping, selection logic in selectForGfx950(), and isScaledFp8() method
mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp	Added logic to emit scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp

mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp

16x16x128) on gfx950 architecture. These tests cover: - Single buffering (scheduleVersion 1, 3) with kpack=32 and kpack=1 - Double buffering (scheduleVersion 2, 4) with kpack=32 - Double buffering with kpack < k_base (kpack=1, 4, 8) - All FP8 type combinations: FP8×FP8, BF8×BF8, FP8×BF8, BF8×FP8 The tests verify that amdgpu.scaled_mfma operations are correctly generated for OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors.

- Remove duplicate entries in getMfmaInsnInfoMap - Clarify neutral scale creation comment in AccelEmitter.cpp - Rename zeroAttr to neutralScaleAttr for clarity

stefankoncarevic requested a review from causten as a code owner February 20, 2026 14:02

stefankoncarevic requested review from dhernandez0, djramic, justinrosner, pabloantoniom and umangyadav February 20, 2026 14:03

umangyadav requested a review from Copilot February 20, 2026 18:30

Copilot started reviewing on behalf of umangyadav February 20, 2026 18:32 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp Outdated Show resolved Hide resolved

mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Outdated Show resolved Hide resolved

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 2 times, most recently from eb61882 to f93c548 Compare February 24, 2026 08:47

stefankoncarevic added 4 commits February 24, 2026 05:56

WIP: Scaled FP8 MFMA support

0a3d564

Clean up scaled FP8 MFMA code based on review feedback

d9b4eac

- Remove duplicate entries in getMfmaInsnInfoMap - Clarify neutral scale creation comment in AccelEmitter.cpp - Rename zeroAttr to neutralScaleAttr for clarity

Clang format

ff1d1c9

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch from f93c548 to ff1d1c9 Compare February 24, 2026 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Scaled fp8 mfma gfx950#2246

Scaled fp8 mfma gfx950#2246
stefankoncarevic wants to merge 4 commits intomfma-enable-kpack-values-gfx950from
scaled-fp8-mfma-gfx950

stefankoncarevic commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

stefankoncarevic commented Feb 20, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant