Skip to content

Comments

Only add one LDS Barrier for one wave blockSizes#2253

Open
umangyadav wants to merge 3 commits intobarrierFixfrom
barrierForOneWave
Open

Only add one LDS Barrier for one wave blockSizes#2253
umangyadav wants to merge 3 commits intobarrierFixfrom
barrierForOneWave

Conversation

@umangyadav
Copy link
Member

@umangyadav umangyadav commented Feb 24, 2026

Motivation

Depends on #2250

For schedule=1 after loop pipelining, it generates following program after barrier placement and before loop pipelining

scf.for 
{
LDSBarrier (__bwd__barrier_)
GlobalLoad
DSWrite
LDSBarrier (__fwd_barrier__)
DSRead + MFMA
}

After loop pipelining and pushDownBarrier it becomes this

scf.for {
GlobalLoad
LDSBarrier 
DSRead + MFMA
LDSBarrier 
DSWrite
}

In here it requires bwd_barrier for loop carried dependency which makes sure that all waves within workgroup have finished issuing and reading from LDS buffers before continuing to next iteration of for loop which again writes to same LDS buffer. Waves can be out of sync therefore this barrier ensures the Reads from all waves have finished before writing into same buffer.

For the special case of blockSize = 1xWaveSize, we don't need to wait for all the waves as there is only one wave. Just issuing DSReads is enough and we don't have to wait it to finish before proceeding to next iteration.

Therefore for that case we can skip adding __bwd_barrier__

Technical Details

Currently this is only enabled for single For loop and therefore only GEMMs and Convs. Nested for loop may require additional analysis across the loops.

Test Plan

Added new tests with blockSize = 1xWaveSize

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an optimization that skips backward LDS barriers for single-wave GPU kernels with specific schedule versions (1 and 3) during loop pipelining. The motivation is that when a workgroup consists of only one wave (blockSize ≤ waveSize), the GPU's in-order instruction execution within the wave eliminates the need for explicit synchronization barriers between iterations.

Changes:

  • Added canSkipBackwardBarrierForOneWave() function to determine when backward barriers can be safely skipped
  • Modified placeBarriers() to accept the parent function and conditionally skip backward barriers for single-wave cases
  • Added comprehensive test coverage including unit tests and end-to-end tests for various scheduleVersions and data types

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mlir/lib/Dialect/Rock/Transforms/RockPipeline.cpp Core implementation: added barrier optimization logic and modified placeBarriers signature
mlir/test/Dialect/Rock/test_rock_pipeline_wave_barrier.mlir Unit tests covering single-wave (scheduleVersion 1,3), multi-wave, and different scheduleVersions
mlir/test/Dialect/Rock/rock-pipeline-early-exit.mlir Added arch attribute to existing test function to ensure compatibility
mlir/test/e2e/GemmOneWaveBarrier.toml E2E test for single-wave GEMM with scheduleVersion=1
mlir/test/e2e/GemmOneWaveBarrierDirectToLDS.toml E2E test for single-wave GEMM with scheduleVersion=3 (DirectToLDS)
mlir/test/e2e/GemmOneWaveBarrierFp8.toml E2E test for single-wave GEMM with fp8 data types
mlir/test/e2e/PrGemmOneWaveBarrier.toml PR-specific test for single-wave GEMM optimization
mlir/test/e2e/PrGemmOneWaveBarrierDirectToLDS.toml PR-specific test for DirectToLDS variant
mlir/test/e2e/*.cfg Configuration files specifying hardware requirements for each test
mlir/test/e2e/CMakeLists.txt Added new test suites to the build system

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant