Skip to content

Conversation

@abrown
Copy link
Contributor

@abrown abrown commented Jan 26, 2026

This change replaces the llvm.fmul and llvm.fadd instructions with the fused llvm.fma operation. This should have no downstream impact on the emitted machine code which, due to auto-vectorization and other LLVM magic, already ends up using VFMADD213PS.

What is unclear about this change is that we materialize some fastmath flags from thin air: it seems like we should be able to configure this somewhere at the user level (TODO).

@abrown
Copy link
Contributor Author

abrown commented Jan 26, 2026

This is a draft for now until we can discuss what to do about the fastmath flags.

@abrown abrown requested a review from alexbaden January 27, 2026 21:31
// Multiply and accumulate.
auto mul = LLVM::FMulOp::create(builder, loc, tgtTy, aElem, bElem);
accum = LLVM::FAddOp::create(builder, loc, tgtTy, accum, mul);
auto flags = LLVM::FastmathFlagsAttr::get(builder.getContext(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl.dot_scaled has a fast math flag, but triton typically prefers fast math to be off

This change replaces the `llvm.fmul` and `llvm.fadd` instructions with
the fused `llvm.fma` operation. This should have no downstream impact on
the emitted machine code which, due to auto-vectorization and other LLVM
magic, already ends up using `VFMADD213PS`.

What _is_ unclear about this change is that we materialize some fastmath
flags from thin air: it seems like we should be able to configure this
somewhere at the user level (TODO).
@abrown abrown marked this pull request as ready for review January 29, 2026 17:24
@abrown
Copy link
Contributor Author

abrown commented Jan 29, 2026

This has no effect on performance. I still see vfmadd231ps being used in the emitted machine code and my benchmarking infrastructure shows the same results as before:

$ for i in {1..5}; do python bench-triton/matmul.py --size 1024 --device cpu --provider triton
 --block-size-m 8 --block-size-n 256 --block-size-k 1; done
0.4107300620226039
0.41122981901199657
0.41164295082399166
0.4109849791469111
0.41048660226802114

@abrown abrown requested a review from alexbaden January 29, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants