Conversation
It should appear before LDSRead and not before GlobalLoad. Also fix a bug where, if there are no stages found then it should preserve barriers in original loop
| // CHECK: %[[ALLOC_LDS_A:.*]] = rock.alloc() : memref<16xf16, #gpu.address_space<private>> | ||
| // CHECK: %[[ALLOC_LDS_B:.*]] = rock.alloc() : memref<16xf16, #gpu.address_space<private>> | ||
| // CHECK: %[[WID_LDS:.*]] = rock.workitem_id : index | ||
| // CHECK: memref.load %[[ALLOC_LDS_A]][%c0] |
There was a problem hiding this comment.
This test has another bug. It should have two barriers but currently it only adds one because it is using f16 as the data type on LDS Allocs. multi-buffering fails on that (It assumes i8 dtype for LDS Buffers). placeBarrier() fails to identify dependency in this case.
This doesn't impact accuracy as we always allocate LDS Buffers in i8 dtype but it should be fixed still. I've opened AIROCMLIR-530
There was a problem hiding this comment.
Pull request overview
This PR fixes barrier placement issues in the Rock loop-pipelining pipeline for scheduleVersion=1, ensuring LDS barriers are pushed down past non-LDS ops (e.g., global loads) and preventing accidental barrier removal when no rock.stage ops are present.
Changes:
- Adjust
PushBarrierDownRewritePatternto avoid pushing across terminators and to not push past other barriers. - Skip barrier stripping / pipelining work for loops that have a pipeline attribute but contain no
rock.stageops. - Update/expand Rock pipeline lit tests to check post-
remove-stagesbarrier placement and “no stages” behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| mlir/lib/Dialect/Rock/Transforms/RockPipeline.cpp | Refines barrier pushdown behavior and stage/barrier cleanup sequencing for loop pipelining. |
| mlir/test/Dialect/Rock/test_rock_pipeline_nested.mlir | Adds extensive REMOVE-STAGES checks for barrier placement across nested pipelined loops and no-stage cases. |
| mlir/test/Dialect/Rock/test_rock_pipeline.mlir | Adds REMOVE-STAGES checks to validate barrier placement relative to multibuffer extraction and loads/stores. |
| mlir/test/Dialect/Rock/rock-pipeline-early-exit.mlir | Updates FileCheck expectations to validate barrier placement in an early-exit/nested-control scenario. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RewritePatternSet patterns(&getContext()); | ||
| patterns.add<RemoveStagesRewritePattern, PushBarrierDownRewritePattern, | ||
| RemoveBackToBackBarriersRewritePattern>(&getContext()); | ||
| if (failed( | ||
| applyPatternsGreedily(func, std::move(patternsRemoveStages)))) | ||
| return signalPassFailure(); | ||
|
|
||
| RewritePatternSet patternsBackToBack(&getContext()); | ||
| patternsBackToBack.add<RemoveBackToBackBarriersRewritePattern>(ctx); | ||
| if (failed(applyPatternsGreedily(func, std::move(patternsBackToBack)))) | ||
| applyPatternsGreedily(getOperation(), std::move(patterns)))) { | ||
| return signalPassFailure(); |
There was a problem hiding this comment.
In the remove-stages cleanup, applyPatternsGreedily() is now given a single pattern set containing RemoveStagesRewritePattern, PushBarrierDownRewritePattern, and RemoveBackToBackBarriersRewritePattern. Greedy rewrites don't guarantee an explicit phase ordering between patterns, so this makes correct barrier placement depend on the driver’s worklist behavior. To keep the behavior deterministic (and aligned with the intent to push barriers only after stages/terminators are removed), consider running these pattern applications in separate applyPatternsGreedily calls (e.g., RemoveStages first, then PushBarrierDown, then back-to-back cleanup) or assign benefits to enforce the intended ordering.
| // Prologue: barrier before LDS write | ||
| // REMOVE-STAGES: memref.load %{{.*}} : memref<16xf16, #gpu.address_space<private>> | ||
| // REMOVE-STAGES: rock.lds_barrier | ||
| // REMOVE-STAGES: memref.store %{{.*}} : memref<16xf16, #gpu.address_space<workgroup>> | ||
| // REMOVE-STAGES: scf.for |
There was a problem hiding this comment.
These FileCheck patterns are quite loose for asserting barrier placement in the prologue: REMOVE-STAGES: memref.load ... followed by REMOVE-STAGES: rock.lds_barrier allows unrelated loads/barriers elsewhere in the function to satisfy the check. Since this PR is specifically about barrier position relative to nearby ops, it would be more robust to use REMOVE-STAGES-NEXT (or capture and reuse SSA values) to tightly bind the barrier to the intended load/store sequence.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Motivation
Currently for scheduleVersion =1, loop pipelining is inserting barrier before it's necessary.
This is the output after
rock-pipelinecurrently for--schedule_version 1Notice that LDSbarrier appears before GlobalLoad but instead it should appear after GlobalLoad.
With this PR, it appears after GlobalLoad.
Technical Details
This bug is happening because currently
PushDownBarrierruns before removal of stages andPushDownBarrierexits early as soon as it finds a terminator (e.g. rock.yield)This PR fixes one more bug, if there are no "stages" found then it shouldn't do loop pipelining and it should preserve LDS Barriers in original program. But at present it first removes all the barriers and then exits, which could lead to accuracy problems. In practice we won't get a program without stages, but there was a unit-test which was removing all barriers when there were no stages.
I've modified lit-tests to look for barriers in correct place after removal of stages.