Skip to content

[Navi] Add support with Infinity Cache (LLC) awareness for improved performance#169

Closed
tianwyan wants to merge 259 commits intomainfrom
tianwyan/navi_experiment
Closed

[Navi] Add support with Infinity Cache (LLC) awareness for improved performance#169
tianwyan wants to merge 259 commits intomainfrom
tianwyan/navi_experiment

Conversation

@tianwyan
Copy link

Motivation

This PR enables Flash Attention Triton support for AMD RDNA3 (Navi) GPUs, specifically targeting the gfx1100 architecture. The goal is to bring Flash Attention performance optimizations to consumer-grade AMD GPUs while leveraging the unique Infinity Cache (LLC) architecture for improved memory throughput.

Technical Details

New Architecture Support:

  • Added gfx1100 (RDNA3/Navi 31) to the supported GPU architectures in the Triton Flash Attention backend

Performance Optimizations:

  • Implemented Infinity Cache (LLC) awareness to optimize memory access patterns and reduce DRAM bandwidth pressure
  • Enabled exp2 instruction by default for faster exponential calculations on RDNA3
  • Added additional Triton autotuning configurations optimized for Navi's wavefront and cache characteristics

Code Cleanup:

  • Renamed "L2 cache" terminology to "Infinity Cache (LLC)" throughout the codebase to accurately reflect AMD's cache hierarchy and avoid confusion with the traditional L2 cache

Test Plan

  • Functional testing on AMD Radeon RX 7900 XTX (gfx1100)
  • Verified Flash Attention forward pass correctness against reference implementation
  • Benchmarked memory bandwidth utilization with and without LLC awareness

Test Result

  • All existing Triton Flash Attention tests pass on gfx1100
  • ~2x performance improvement with LLC-aware implementation on memory-bound attention workloads
  • LLC awareness significantly reduces DRAM bandwidth pressure by better utilizing the 96MB Infinity Cache on RDNA3

Submission Checklist

tridao and others added 30 commits June 30, 2025 01:35
Signed-off-by: loscrossos <165311345+loscrossos@users.noreply.github.com>
guilhermeleobas and others added 28 commits November 17, 2025 19:18
- Implement block-sparse attention in flash_fwd_sm100.py
- Update interface.py to handle SM100 block size calculations
  (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows)
- Add mask_mod parameter support in mask.py for block-sparse masking
- Add SM100 test fixtures and tile size handling in test_mask_mod.py

This enables block-sparsity on SM 10.0 architecture, including
mask_mod support and proper block size accounting.
…-AILab#2014)

* use correction warps for epi when varlen (non tma O)

* properly enable fallback epilogue for varlen q

* fix rebase errors

* update tests
* add fastdivmod for oob reads in mask_mods

* Updates for h100
… conditions (Dao-AILab#2033)

* enable deterministic mode for sm100 bwd and fix race conditions

* turn off lpt scheduler for causal

* use more regs for reduce when deterministic

* make a src for tiled mma dK toggleable parameter, remove smem async fence for lse release

* use 100k iterations for default
Not much to see here, but this causes linter noise
* Bump pin

* Swtich to new fastdivmod

* cleanup varlen on blackwell

* Allow for only cute install
…s/fa3-compile

Add torch.compile support to flash attention 3
* add local for sm100 bwd

* add deterministic

* update tests

* ruff files

* remove old code

* move comment

* override window_size = None for causal

* revert to fwd test defaults
@tianwyan tianwyan requested a review from micmelesse January 12, 2026 15:00
@tianwyan tianwyan closed this Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.