[Navi] Add support with Infinity Cache (LLC) awareness for improved performance by tianwyan · Pull Request #169 · ROCm/flash-attention

tianwyan · 2026-01-12T14:58:46Z

Motivation

This PR enables Flash Attention Triton support for AMD RDNA3 (Navi) GPUs, specifically targeting the gfx1100 architecture. The goal is to bring Flash Attention performance optimizations to consumer-grade AMD GPUs while leveraging the unique Infinity Cache (LLC) architecture for improved memory throughput.

Technical Details

New Architecture Support:

Added gfx1100 (RDNA3/Navi 31) to the supported GPU architectures in the Triton Flash Attention backend

Performance Optimizations:

Implemented Infinity Cache (LLC) awareness to optimize memory access patterns and reduce DRAM bandwidth pressure
Enabled exp2 instruction by default for faster exponential calculations on RDNA3
Added additional Triton autotuning configurations optimized for Navi's wavefront and cache characteristics

Code Cleanup:

Renamed "L2 cache" terminology to "Infinity Cache (LLC)" throughout the codebase to accurately reflect AMD's cache hierarchy and avoid confusion with the traditional L2 cache

Test Plan

Functional testing on AMD Radeon RX 7900 XTX (gfx1100)
Verified Flash Attention forward pass correctness against reference implementation
Benchmarked memory bandwidth utilization with and without LLC awareness

Test Result

All existing Triton Flash Attention tests pass on gfx1100
~2x performance improvement with LLC-aware implementation on memory-bound attention workloads
LLC awareness significantly reduces DRAM bandwidth pressure by better utilizing the 96MB Infinity Cache on RDNA3

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: loscrossos <165311345+loscrossos@users.noreply.github.com>

- Implement block-sparse attention in flash_fwd_sm100.py - Update interface.py to handle SM100 block size calculations (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows) - Add mask_mod parameter support in mask.py for block-sparse masking - Add SM100 test fixtures and tile size handling in test_mask_mod.py This enables block-sparsity on SM 10.0 architecture, including mask_mod support and proper block size accounting.

…-AILab#2014) * use correction warps for epi when varlen (non tma O) * properly enable fallback epilogue for varlen q * fix rebase errors * update tests

* add fastdivmod for oob reads in mask_mods * Updates for h100

… conditions (Dao-AILab#2033) * enable deterministic mode for sm100 bwd and fix race conditions * turn off lpt scheduler for causal * use more regs for reduce when deterministic * make a src for tiled mma dK toggleable parameter, remove smem async fence for lse release * use 100k iterations for default

Not much to see here, but this causes linter noise

* Bump pin * Swtich to new fastdivmod * cleanup varlen on blackwell * Allow for only cute install

…s/fa3-compile Add torch.compile support to flash attention 3

* add local for sm100 bwd * add deterministic * update tests * ruff files * remove old code * move comment * override window_size = None for causal * revert to fwd test defaults

…long seqlen

tridao and others added 30 commits June 30, 2025 01:35

[Cute] Fix fwd_sm90 epilogue when varlen

7661781

[Cute] Implement sliding window for forward pass

10a8916

[Cute] Add ruff options

de2ce8f

[Cute] Run ruff on utility files

217c9d3

[Cute] Run ruff on bwd_pre/postprocess.py

3222ea3

[Cute] Move tile scheduler to a separate file

62349eb

[Cute] Add FastDivmod

8d454a3

[Cute] Refactor TileScheduler classes

e94e0c2

[Cute] Port SingleTileLPTScheduler from C++ to Python

525fb43

[Cute] Update comment about cute version

60e1e89

[Cute] Update to cute-dsl 4.1.0.dev0

6a44198

[Cute] Use RS WGMMA for fwd_sm90

25bd20c

[Cute] Use tile_scheduler in fwd_sm90

0d0ab1b

[Cute] Add SingleTileVarlenScheduler to fwd_sm90

312bb9b

[Cute] Do manual f32->f16x2 conversion for fwd_sm90

10e8c39

[Cute] Split tP arrival for fwd_sm100

3fc8c3c

[Cute] Set tP arrival split to be 3/4

723c36b

[Cute] Fix missing tmem_store fence

e540fc1

[Cute] Tune num registers for fwd_sm100

aace11d

[Cute] Check that compute_capability is 9.x or 10.x

f14dcb1

[BE] Better compress flash attention binaries (Dao-AILab#1744)

8ba246f

adding changes for Windows compile fix for MSVC. (Dao-AILab#1716)

944811e

Signed-off-by: loscrossos <165311345+loscrossos@users.noreply.github.com>

[CI] Compile with nvcc 12.9.1

1e55644

Bump to v2.8.1

7b0bfcc

[WIP] Add benchmarking script

adf27d1

[FA3] Don't return lse

ed20940

[Cute] Clean up flash_fwd_sm90 and flash_fwd_sm100 a bit

87855ac

[Cute] Support varlen in flash_fwd_sm100

3d0e14a

[Cute] Don't need max_seqlen_q for varlen fwd anymore

730e230

[Cute] Fix varlen scheduler when SeqUsedQ is not passed in

10ee063

guilhermeleobas and others added 28 commits November 17, 2025 19:18

Fix flash_api_stable.cpp build

b555ac7

Only run schema_check if dtype is not float8_e4m3fn

0aa4fa1

Correctly compute kBlockM for sm88/86/80

47d7137

Fix bug in boxed_mha_bwd

49fb775

don't run autograd_check when num_splits > 0

65dd580

[Cute,Sm100,Fwd] use correction warps for epi when not using TMA (Dao…

43375aa

…-AILab#2014) * use correction warps for epi when varlen (non tma O) * properly enable fallback epilogue for varlen q * fix rebase errors * update tests

Raise TypeError if out is specified when compiling _flash_attn_forward

3fcde4b

add fastdivmod for oob reads in mask_mods (Dao-AILab#2020)

052015a

* add fastdivmod for oob reads in mask_mods * Updates for h100

don't pass mask_fn to softmax_step generically (Dao-AILab#2026)

d063b33

swap order of decorators (Dao-AILab#2029)

a986d01

[NFC] Trivial fix to silence linter (Dao-AILab#1928)

9194297

Not much to see here, but this causes linter noise

Add LICENSE and AUTHORS to flash_attn/cute (Dao-AILab#2032)

5cc6fa4

[Cute] Add authors

63b66f2

[Cute,Fwd] enable mask mod without blocksparsity (Dao-AILab#2031)

92ca9da

Bump pin (Dao-AILab#2025)

672381f

* Bump pin * Swtich to new fastdivmod * cleanup varlen on blackwell * Allow for only cute install

ruff all the smaller files (Dao-AILab#2040)

91ba87d

[Flash] Fix head dim 64 bwd (Dao-AILab#2035)

de6a6ad

Add headdim64 tests (Dao-AILab#2041)

26ba559

Merge pull request Dao-AILab#1769 from guilhermeleobas/guilhermeleoba…

59df2f9

…s/fa3-compile Add torch.compile support to flash attention 3

[Cute,Bwd,Sm100] Add local for sm100 bwd (Dao-AILab#2046)

56fdf3e

* add local for sm100 bwd * add deterministic * update tests * ruff files * remove old code * move comment * override window_size = None for causal * revert to fwd test defaults

[Navi]add more triton config

4f1d8bb

[Navi]enable exp2 by default

929a4bb

[Navi]Add support for arch gfx1100

dc8b05d

[ROCM]warp fa to support L2 cache aware to improve performance

64c5924

[Navi]renaming L2 cache to Infinity Cache (LLC) to avoid confusion

e935f3b

[ROCM]Optimized for gfx1100 (RDNA3) with LLC-aware head grouping for …

92cc73a

…long seqlen

tianwyan requested a review from micmelesse January 12, 2026 15:00

tianwyan closed this Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Navi] Add support with Infinity Cache (LLC) awareness for improved performance#169

[Navi] Add support with Infinity Cache (LLC) awareness for improved performance#169
tianwyan wants to merge 259 commits intomainfrom
tianwyan/navi_experiment

tianwyan commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

tianwyan commented Jan 12, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants