Skip to content

Comments

Scaled fp8 mfma gfx950#2246

Open
stefankoncarevic wants to merge 4 commits intomfma-enable-kpack-values-gfx950from
scaled-fp8-mfma-gfx950
Open

Scaled fp8 mfma gfx950#2246
stefankoncarevic wants to merge 4 commits intomfma-enable-kpack-values-gfx950from
scaled-fp8-mfma-gfx950

Conversation

@stefankoncarevic
Copy link
Contributor

⚠️ This PR depends on #2242 and should not be merged before that.
Resolves: https://amd-hub.atlassian.net/browse/AIROCMLIR-477

Motivation

This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on gfx950 architecture. The scaled FP8 MFMAs use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors and provide improved performance for 8-bit floating-point matrix operations.
PR #2242 relaxes the isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines (scheduleVersion 2 or 4), which is required for some of the test configurations in this PR to work correctly.

Technical Details

Scaled FP8 MFMA Instructions on gfx950
The gfx950 architecture introduces scaled MFMA instructions for OCP FP8 types (f8E4M3FN, f8E5M2):

  • 32x32x64 MFMA: M=32, N=32, K=64, k_base=32, output vector<16xf32>
  • 16x16x128 MFMA: M=16, N=16, K=128, k_base=32, output vector<4xf32>

These instructions differ from native FP8 MFMAs (32x32x16 with k_base=8) by using implicit scale factors. The compiler generates amdgpu.scaled_mfma operations with constant scale values

MFMA Selection Logic
The MfmaInsnGroup::select function in MfmaInsnGroup.cpp selects scaled FP8 MFMAs when:

  1. Architecture is gfx950
  2. Input types are OCP FP8 (f8E4M3FN or f8E5M2)
  3. isCoherentWithK validation passes for the given kpack, kpackPerBlock, and scheduleVersion

Test Plan

Added 9 tests to mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir covering all combinations of MFMA sizes, scheduleVersion values, FP8 type combinations, and kpack values.
All tests pass

Test Result

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on the gfx950 architecture. These instructions use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors, sharing the same hardware instructions as FP4 scaled MFMAs but with different configuration parameters (cbsz=0, blgp=0). The PR depends on #2242 which relaxes isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines, enabling configurations like kpack=4 with k_base=32.

Changes:

  • Added 4 new MfmaTypeId enum values for scaled FP8 type combinations (Fp8Fp8ScaledTyId, Fp8Bf8ScaledTyId, Bf8Fp8ScaledTyId, Bf8Bf8ScaledTyId)
  • Implemented scaled FP8 MFMA selection logic in MfmaInsnGroup that tries scaled FP8 MFMAs first when kPerBlock is large enough
  • Updated AccelEmitter to generate scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers
  • Added 9 comprehensive tests covering all combinations of MFMA sizes (16x16x128, 32x32x64), schedule versions (1, 2, 3, 4), FP8 type combinations, and kpack values (1, 4, 8, 32)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir Added 9 tests for scaled FP8 MFMA operations covering single/double-buffer pipelines and various kpack configurations
mlir/include/mlir/Dialect/Rock/IR/MfmaInsnGroup.h Added 4 new enum values for scaled FP8 type IDs and isScaledFp8() method declaration
mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp Implemented scaled FP8 MFMA instruction mapping, selection logic in selectForGfx950(), and isScaledFp8() method
mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Added logic to emit scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@stefankoncarevic stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 2 times, most recently from eb61882 to f93c548 Compare February 24, 2026 08:47
16x16x128) on gfx950 architecture. These tests cover:

- Single buffering (scheduleVersion 1, 3) with kpack=32 and kpack=1
- Double buffering (scheduleVersion 2, 4) with kpack=32
- Double buffering with kpack < k_base (kpack=1, 4, 8)
- All FP8 type combinations: FP8×FP8, BF8×BF8, FP8×BF8, BF8×FP8

The tests verify that amdgpu.scaled_mfma operations are correctly
generated for OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale
factors.
- Remove duplicate entries in getMfmaInsnInfoMap
- Clarify neutral scale creation comment in AccelEmitter.cpp
- Rename zeroAttr to neutralScaleAttr for clarity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant