Implement mxfp4 split-k gemm by willghatch · Pull Request #958 · iree-org/wave

willghatch · 2026-02-24T00:48:15Z

The core things added are split-k gemm, and it is tested for (1) generation of the buffer_atomic_pk_add_bf16 instruction that we wanted to use, and (2) for gemm correctness.

Overview of changes unrelated to wave_asm:

remove_global_indexing in general_utils.py: Zeroes out tiling constraint starts (e.g. K_SPLIT_OFF) alongside workgroup IDs before dimension scaling, so that the subtraction of the start offset doesn't mix scaled and unscaled units (K vs K/32 for MXFP4 scales).
Fixing spurious bounds on split-K tiling that prevented scale vector merging: TilingConstraint.get_index_bound was conservatively generating bounds for the split-K case because sympy could not prove that ceiling(Min(K, f(wg)) / tile)
- tile <= K. These bounds prevented merge_contiguous_reads from combining scalar scale reads into vector<4xi8> loads (it skips reads that already have bounds). Add _work_may_exceed_dim() to structurally detect the aligned split-k pattern and prove no overshoot, avoiding the spurious bound. (This was necessary to get scale_preshuffle to have 4x vector loads when combined with split-k.)

willghatch · 2026-02-24T00:50:06Z

@harsh-nod This has splitk with preshuffle_scales functional with the 4x vector load. I've done some basic cleanup, but as mentioned there are still parts of it that I haven't fully reviewed or understood.

willghatch · 2026-02-24T18:45:38Z

@harsh-nod this is now rebased on top of main, which now has the wave_asm backend commit that you carved out of this one. So it should be ready to go.

tests/kernel/wave_gemm_test.py

wave_lang/kernel/wave/utils/general_utils.py

harsh-nod · 2026-02-24T22:19:13Z

wave_lang/kernel/wave/constraints.py

+    Returns ``False`` (no overshoot) when we can prove that the tiled work
+    never exceeds the tensor dimension.  In particular this handles the
+    split-K pattern where ``work_bound = tile * ceiling(Min(dim, f(wg)) / tile)``
+    and ``dim`` is tile-aligned: ``ceiling(Min(dim, x) / tile) * tile <= dim``.


Can you explain what is happening here?

The logic is now tightened, but it is improving the analysis for whether a read might overshoot bounds. By checking for the pattern of tiles where reads are bound to min(X, bound), we know that it is still within the bound, and we don't need to emit the bound guard rails. This allows the read merge to take effect to load the 4xi8 vectors instead of individual bytes.

The core things added are split-k gemm, and it is tested for (1) generation of the `buffer_atomic_pk_add_bf16` instruction that we wanted to use, and (2) for gemm correctness. Overview of some of the major changes: - `remove_global_indexing` in `general_utils.py`: Zeroes out tiling constraint starts (e.g. `K_SPLIT_OFF`) alongside workgroup IDs before dimension scaling, so that the subtraction of the start offset doesn't mix scaled and unscaled units (K vs K/32 for MXFP4 scales). - Fixing spurious bounds on split-K tiling that prevented scale vector merging: TilingConstraint.get_index_bound was conservatively generating bounds for the split-K case because sympy could not prove that ceiling(Min(K, f(wg)) / tile) * tile <= K. These bounds prevented merge_contiguous_reads from combining scalar scale reads into vector<4xi8> loads (it skips reads that already have bounds). Add _work_may_exceed_dim() to structurally detect the aligned split-k pattern and prove no overshoot, avoiding the spurious bound. (This was necessary to get scale_preshuffle to have 4x vector loads when combined with split-k.) Signed-off-by: William G Hatch <william@hatch.uno>

Signed-off-by: William G Hatch <william@hatch.uno>

harsh-nod · 2026-02-26T17:42:30Z

examples/python/7.1_schedule.py

+)
+for _p in [str(_EXAMPLES_DIR), str(_WAVE_ROOT), str(_E2E_DIR)]:
+    if _p not in sys.path:
+        sys.path.insert(0, _p)


Instead of this could you modify the imports so we can do something like
import WaveASMCompiler, capture_wave_kernel_info ?

harsh-nod · 2026-02-26T17:45:47Z

examples/python/7.1_schedule.py

+    torch.cuda.synchronize()
+
+    bf16_eps = 2**-7
+    atol = num_splits * bf16_eps * max(torch_ref.abs().max().item(), 1.0)


Why does atol depend on num_splits? Shouldnt it be independent of num splits?

No, the number of splits increases the accumulation of errors. IE we get error from casting to BF16, then we accumulate in BF16 which means we accumulate error num_splits times.

harsh-nod · 2026-02-26T17:47:11Z

examples/python/7.1_schedule.py

+    torch.cuda.synchronize()
+
+    bf16_eps = 2**-7
+    atol = num_splits * bf16_eps * max(torch_ref.abs().max().item(), 1.0)


same here regarding num_splits

harsh-nod · 2026-02-26T17:52:35Z

tests/kernel/wave_gemm_test.py

+    w_scales_gpu = w_scales.cuda()
+    c_gpu = device_zeros(m, n, dtype=torch.bfloat16)
+
+    splitk_gemm(x_gpu, x_scales_gpu, w_t_gpu, w_scales_gpu, c_gpu)


Can we have the bitcast to fp16 controllable through a flag so that for the correctness tests, we disable the bitcast?

harsh-nod · 2026-02-26T19:57:29Z

wave_lang/kernel/wave/constraints.py

+        dim_int = int(dim_bound)
+        tile, ceil_expr = _extract_tile_and_ceiling(work_bound)
+        if tile is not None and dim_int % tile == 0 and ceil_expr is not None:
+            numerator = (ceil_expr.args[0] * tile).simplify()


Is the use of .simplify important here? Are you relying sympy to transform this to a canonical form?

And if so, will this work

numer, _ = ceil_expr.args[0].as_numer_denom() if isinstance(numer, Min) and any(a == dim_int for a in numer.args): return False

willghatch requested review from harsh-nod and panditsa February 24, 2026 00:48

willghatch force-pushed the users/willghatch/splitk-mxfp4 branch 2 times, most recently from 88b0c99 to 8de6506 Compare February 24, 2026 18:44

harsh-nod reviewed Feb 24, 2026

View reviewed changes

tests/kernel/wave_gemm_test.py Outdated Show resolved Hide resolved

harsh-nod reviewed Feb 24, 2026

View reviewed changes

wave_lang/kernel/wave/utils/general_utils.py Outdated Show resolved Hide resolved

harsh-nod reviewed Feb 24, 2026

View reviewed changes

willghatch force-pushed the users/willghatch/splitk-mxfp4 branch from 6bbc407 to f02dc5b Compare February 25, 2026 21:10

willghatch force-pushed the users/willghatch/splitk-mxfp4 branch from f02dc5b to 65575f0 Compare February 25, 2026 23:28

bunch of fixes

ca5f8e8

Signed-off-by: William G Hatch <william@hatch.uno>

willghatch force-pushed the users/willghatch/splitk-mxfp4 branch from 65575f0 to ca5f8e8 Compare February 25, 2026 23:31

harsh-nod reviewed Feb 26, 2026

View reviewed changes

WIP - fix tests

74c2c14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement mxfp4 split-k gemm#958

Implement mxfp4 split-k gemm#958
willghatch wants to merge 3 commits intomainfrom
users/willghatch/splitk-mxfp4

willghatch commented Feb 24, 2026 •

edited

Loading

Uh oh!

willghatch commented Feb 24, 2026

Uh oh!

willghatch commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

harsh-nod Feb 24, 2026

Uh oh!

willghatch Feb 25, 2026

Uh oh!

harsh-nod Feb 26, 2026

Uh oh!

harsh-nod Feb 26, 2026

Uh oh!

willghatch Feb 26, 2026

Uh oh!

harsh-nod Feb 26, 2026

Uh oh!

harsh-nod Feb 26, 2026

Uh oh!

harsh-nod Feb 26, 2026

Uh oh!

harsh-nod Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

willghatch commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willghatch commented Feb 24, 2026

Uh oh!

willghatch commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

willghatch commented Feb 24, 2026 •

edited

Loading