Replies: 3 comments 2 replies
-
|
Beta Was this translation helpful? Give feedback.
-
|
I am referring to our approach as a stream vs double buffer since the compute values are always taken from the same buffer in LDS. We stream tiles through vgprs to the same buffer. We are really relying on more occupancy to get perf. |
Beta Was this translation helpful? Give feedback.
-
|
Having looked at our code, Im still not sure we do software pipelining at all.. Let me state the theory first, Once SW pipeline it it should look like : Above is just an example that does hide load latency with compute & accumulate Why I think our is not doing software pipelining because I read our code as follows : Above just peels of the first iteration out of the loop but does not SW pipeline. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I would like to discuss how we implement software pipelining in rocMLIR.
In the graphs, light color means
global_loadis issued but data is not ready. Dark color means data is committed.Double-buffer case
Since we do

global_loadbeforelds_barrierat the beginning of the loop. At the moment ofglobal_loadwe need two sets of VGPRs , one to wait forglobal_loadand the other to wait fords_write. As shown in the graph below:If we switch the

global_loadandlds_barrier, as shown in the graph below, we'll need only one set of VGPRs.Triple-buffer case
By doing

lds_barrierbefore 'global_load', we can achieve triple-buffer software pipelining with two sets of VGPRs, as shown belowNote that at every moment, only two sets of VGPRs are in use.
So my question for @krzysz00 and @sjw36:
Beta Was this translation helpful? Give feedback.
All reactions