Skip to content

Comments

Refine LDS barrier placement #2250

Open
umangyadav wants to merge 4 commits intodevelopfrom
barrierFix
Open

Refine LDS barrier placement #2250
umangyadav wants to merge 4 commits intodevelopfrom
barrierFix

Conversation

@umangyadav
Copy link
Member

@umangyadav umangyadav commented Feb 24, 2026

Motivation

Currently for scheduleVersion =1, loop pipelining is inserting barrier before it's necessary.

This is the output after rock-pipeline currently for --schedule_version 1

scf.for 
{
LDSBarrier
GlobalLoad
DSRead + MFMA
LDSBarrier
LDSWrite
}

Notice that LDSbarrier appears before GlobalLoad but instead it should appear after GlobalLoad.

With this PR, it appears after GlobalLoad.

Technical Details

This bug is happening because currently PushDownBarrier runs before removal of stages and PushDownBarrier exits early as soon as it finds a terminator (e.g. rock.yield)

This PR fixes one more bug, if there are no "stages" found then it shouldn't do loop pipelining and it should preserve LDS Barriers in original program. But at present it first removes all the barriers and then exits, which could lead to accuracy problems. In practice we won't get a program without stages, but there was a unit-test which was removing all barriers when there were no stages.

I've modified lit-tests to look for barriers in correct place after removal of stages.

It should appear before LDSRead and not before GlobalLoad.

Also fix a bug where, if there are no stages found then it should preserve barriers in original loop
// CHECK: %[[ALLOC_LDS_A:.*]] = rock.alloc() : memref<16xf16, #gpu.address_space<private>>
// CHECK: %[[ALLOC_LDS_B:.*]] = rock.alloc() : memref<16xf16, #gpu.address_space<private>>
// CHECK: %[[WID_LDS:.*]] = rock.workitem_id : index
// CHECK: memref.load %[[ALLOC_LDS_A]][%c0]
Copy link
Member Author

@umangyadav umangyadav Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has another bug. It should have two barriers but currently it only adds one because it is using f16 as the data type on LDS Allocs. multi-buffering fails on that (It assumes i8 dtype for LDS Buffers). placeBarrier() fails to identify dependency in this case.

This doesn't impact accuracy as we always allocate LDS Buffers in i8 dtype but it should be fixed still. I've opened AIROCMLIR-530

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes barrier placement issues in the Rock loop-pipelining pipeline for scheduleVersion=1, ensuring LDS barriers are pushed down past non-LDS ops (e.g., global loads) and preventing accidental barrier removal when no rock.stage ops are present.

Changes:

  • Adjust PushBarrierDownRewritePattern to avoid pushing across terminators and to not push past other barriers.
  • Skip barrier stripping / pipelining work for loops that have a pipeline attribute but contain no rock.stage ops.
  • Update/expand Rock pipeline lit tests to check post-remove-stages barrier placement and “no stages” behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
mlir/lib/Dialect/Rock/Transforms/RockPipeline.cpp Refines barrier pushdown behavior and stage/barrier cleanup sequencing for loop pipelining.
mlir/test/Dialect/Rock/test_rock_pipeline_nested.mlir Adds extensive REMOVE-STAGES checks for barrier placement across nested pipelined loops and no-stage cases.
mlir/test/Dialect/Rock/test_rock_pipeline.mlir Adds REMOVE-STAGES checks to validate barrier placement relative to multibuffer extraction and loads/stores.
mlir/test/Dialect/Rock/rock-pipeline-early-exit.mlir Updates FileCheck expectations to validate barrier placement in an early-exit/nested-control scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +808 to 813
RewritePatternSet patterns(&getContext());
patterns.add<RemoveStagesRewritePattern, PushBarrierDownRewritePattern,
RemoveBackToBackBarriersRewritePattern>(&getContext());
if (failed(
applyPatternsGreedily(func, std::move(patternsRemoveStages))))
return signalPassFailure();

RewritePatternSet patternsBackToBack(&getContext());
patternsBackToBack.add<RemoveBackToBackBarriersRewritePattern>(ctx);
if (failed(applyPatternsGreedily(func, std::move(patternsBackToBack))))
applyPatternsGreedily(getOperation(), std::move(patterns)))) {
return signalPassFailure();
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the remove-stages cleanup, applyPatternsGreedily() is now given a single pattern set containing RemoveStagesRewritePattern, PushBarrierDownRewritePattern, and RemoveBackToBackBarriersRewritePattern. Greedy rewrites don't guarantee an explicit phase ordering between patterns, so this makes correct barrier placement depend on the driver’s worklist behavior. To keep the behavior deterministic (and aligned with the intent to push barriers only after stages/terminators are removed), consider running these pattern applications in separate applyPatternsGreedily calls (e.g., RemoveStages first, then PushBarrierDown, then back-to-back cleanup) or assign benefits to enforce the intended ordering.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +103
// Prologue: barrier before LDS write
// REMOVE-STAGES: memref.load %{{.*}} : memref<16xf16, #gpu.address_space<private>>
// REMOVE-STAGES: rock.lds_barrier
// REMOVE-STAGES: memref.store %{{.*}} : memref<16xf16, #gpu.address_space<workgroup>>
// REMOVE-STAGES: scf.for
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These FileCheck patterns are quite loose for asserting barrier placement in the prologue: REMOVE-STAGES: memref.load ... followed by REMOVE-STAGES: rock.lds_barrier allows unrelated loads/barriers elsewhere in the function to satisfy the check. Since this PR is specifically about barrier position relative to nearby ops, it would be more robust to use REMOVE-STAGES-NEXT (or capture and reuse SSA values) to tightly bind the barrier to the intended load/store sequence.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant