Rethinking about streaming LD #249

yyan7223 · 2026-01-10T01:15:26Z

After talking with @ShangkunLi , I re-design the streaming LD, here is the timing graph:

As you can see, recv_opt.rdy keeps to be 0 until streaming LD finishes, which means element will block itself during streaming LD. The fu_crossbar and routing_crossbar will also be blocked.

@ShangkunLi , we can perform the streaming LD by simply configuring the (base, step, bound)

and sending a OPT_STREAM_LD (it does not consume any operands so routing_crossbar is idle).

Please check cgra/test/CgraRTL_streaming_read_test.py for configuration about streaming LD, feel free to contact me if there are any concerns.

@tancheng , I think we should deprecate the PR#235 (previous design of streaming LD) , now we do the streaming LD control directly in the MemUnitRTL.

tancheng · 2026-01-10T03:03:05Z

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

yyan7223 · 2026-01-10T03:39:30Z

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant.
So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

tancheng · 2026-01-10T04:30:24Z

cgra/test/CgraRTL_streaming_read_test.py

@@ -0,0 +1,345 @@
+"""
+==========================================================================
+CgraRTL_fir_test.py


update comment

tancheng · 2026-01-10T04:32:15Z

cgra/test/CgraWithStreamingLoadRTL_test.py

+  kTotalCtrlSteps = 3
+  src_ctrl_pkt = []
+  src_opt_pkt = [
+      # tile 0


Briefly describe what the test does.

tancheng · 2026-01-10T04:36:29Z

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

yyan7223 · 2026-01-10T10:50:48Z

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

Thanks for reminding, now we have:

  elif s.from_mem_rdata.val & s.from_mem_rdata.rdy:
    # Moves to next address only when current read finishes.
    s.streaming_raddr <<= s.streaming_raddr + s.streaming_stride

So if the consumer tile cannot consume the streamed data in time, its input channel becomes full. Then producer's MemUnitRTL has s.send_out[0].rdy=0 and s.from_mem_rdata.rdy=0, which blocks the uncompleted reading by stopping incrementing s.streaming_raddr.

tancheng · 2026-01-10T14:18:57Z

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

Thanks for reminding, now we have:
  elif s.from_mem_rdata.val & s.from_mem_rdata.rdy:
    # Moves to next address only when current read finishes.
    s.streaming_raddr <<= s.streaming_raddr + s.streaming_stride
So if the consumer tile cannot consume the streamed data in time, its input channel becomes full. Then producer's MemUnitRTL has s.send_out[0].rdy=0 and s.from_mem_rdata.rdy=0, which blocks the uncompleted reading by stopping incrementing s.streaming_raddr.

Hmm, that's actually a problem.. If the memUnit perform this s.streaming_raddr + s.streaming_stride, the second request is blocked by the first request's response. They are not pipelined. Your previous PR is more pipelinable.

yyan7223 · 2026-01-10T14:34:13Z

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

Thanks for reminding, now we have:
  elif s.from_mem_rdata.val & s.from_mem_rdata.rdy:
    # Moves to next address only when current read finishes.
    s.streaming_raddr <<= s.streaming_raddr + s.streaming_stride
So if the consumer tile cannot consume the streamed data in time, its input channel becomes full. Then producer's MemUnitRTL has s.send_out[0].rdy=0 and s.from_mem_rdata.rdy=0, which blocks the uncompleted reading by stopping incrementing s.streaming_raddr.
Hmm, that's actually a problem.. If the memUnit perform this s.streaming_raddr + s.streaming_stride, the second request is blocked by the first request's response. They are not pipelined. Your previous PR is more pipelinable.

Ok then we can achieve the pipeline by performing s.streaming_raddr + s.streaming_stride as long as to_mem_raddr.rdy==1, and use a counter to decide when to quit streaming. For example, if streaming LD request 4 data in total, then we can do counter+1 everytime s.from_mem_rdata.val & s.from_mem_rdata.rdy, and quit streaming if counter==3. How do you think?

tancheng · 2026-01-11T03:04:20Z

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

Thanks for reminding, now we have:
  elif s.from_mem_rdata.val & s.from_mem_rdata.rdy:
    # Moves to next address only when current read finishes.
    s.streaming_raddr <<= s.streaming_raddr + s.streaming_stride
So if the consumer tile cannot consume the streamed data in time, its input channel becomes full. Then producer's MemUnitRTL has s.send_out[0].rdy=0 and s.from_mem_rdata.rdy=0, which blocks the uncompleted reading by stopping incrementing s.streaming_raddr.
Hmm, that's actually a problem.. If the memUnit perform this s.streaming_raddr + s.streaming_stride, the second request is blocked by the first request's response. They are not pipelined. Your previous PR is more pipelinable.
Ok then we can achieve the pipeline by performing s.streaming_raddr + s.streaming_stride as long as to_mem_raddr.rdy==1, and use a counter to decide when to quit streaming. For example, if streaming LD request 4 data in total, then we can do counter+1 everytime s.from_mem_rdata.val & s.from_mem_rdata.rdy, and quit streaming if counter==3. How do you think?

Sounds good to me.

tancheng · 2026-01-11T21:10:24Z

fu/single/MemUnitRTL.py

-          s.to_mem_raddr.val @= s.streaming_status & ~s.already_sent_raddr
+          # Keep issuing new LD requests as long as address buffer has free space,
+          # so that LD requests can be processed in pipeline.
+          s.to_mem_raddr.val @= s.streaming_status & s.to_mem_raddr.rdy


I don't think this is correct. streaming_status depends on streaming_done. But streaming_done depends on s.streaming_results_consumed_counter == (s.streaming_end_raddr - s.streaming_start_raddr) // s.streaming_stride.

What if the requests already sent out but streaming_results_consumed_counter not yet full? We seem wrongly send out more than enough/required requests towards memory.

Oh……I see, maybe we should also have a streaming_requests_sent_counter

tancheng · 2026-01-11T21:11:38Z

fu/single/MemUnitRTL.py

-          s.streaming_done @= s.from_mem_rdata.val & (s.streaming_raddr == s.streaming_end_raddr)
+          # Streaming LD is done when the last streaming result is consumed.
+          s.streaming_done @= (s.streaming_results_consumed_counter == \
+                  (s.streaming_end_raddr - s.streaming_start_raddr) // s.streaming_stride) &\


I don't think // is synthesizable. maybe we need to increment counter using streaming_stride

Ok, I change it back to increment counter
But DivRTL.py may also have synthesize problem,

VectorCGRA/fu/single/DivRTL.py

Lines 70 to 75 in 2d40b3a

if s.recv_opt.val:

if s.recv_opt.msg.operation == OPT_DIV:

s.send_out[0].msg.payload @= s.recv_in[s.in0_idx].msg.payload // s.recv_in[s.in1_idx].msg.payload

s.send_out[0].msg.predicate @= s.recv_in[s.in0_idx].msg.predicate & \

s.recv_in[s.in1_idx].msg.predicate & \

s.reached_vector_factor

yes, so we use @HobbitQia's Verilog DIV for synthesis:

VectorCGRA/cgra/test/CgraTemplateRTL_test.py

Line 55 in 2d40b3a

fuType2RTL["Div" ] = ExclusiveDivRTL

tancheng · 2026-01-12T22:02:53Z

fu/single/MemUnitRTL.py

+
+    @update_ff
+    def update_operation_reg():
+      s.operation_reg <<= trunc(s.recv_opt.msg.operation, operation_nbits)


Relying on whether the next op is streaming_ld, is not safe. any other idea?

Currently not, previously I detected the rising edge of recv_opt.val and whether recv_opt.msg==OPT_STREAM_LD to decide whether we should enter the streaming status, but found that recv_opt.val kept high even if recv_opt.msg changed at adjacent cycles. So that's why now we detect whether recv_opt.msg changes.

Any example to illustrate why it is not safe? Both recv_opt.msg and operation_reg are registers value, which should be stable within the whole clock cycle.

What if next op is still streaming_ld? (though may not happen)

We have s.ctrl_addr_inport indicate the current op's index, which can be used in this case?

Yeah that's a good idea.

tancheng · 2026-01-12T22:03:49Z

fu/flexible/FlexibleFuRTL.py

+    s.streaming_start_raddr = InPort(AddrType)
+    s.streaming_stride = InPort(AddrType)
+    s.streaming_end_raddr = InPort(AddrType)
+    # This is for blocking fu_crossbar and routing_crossbar
+    # when performing streaming LD operation.
+    s.streaming_done = OutPort(b1)


Why do we need these port be available for all FU units? Can we keep them be specific for ld/st unit?

Keep them specific for ld/st unit seems to have the verilog translation issue. So I add these ports for all FU units, just like those redundant interfaces for MemUnitRTL and PhiRTL in Fu.py.

VectorCGRA/fu/basic/Fu.py

Lines 53 to 61 in 2d40b3a

# Redundant interface, only used by PhiRTL.

s.clear = InPort(b1)

# Components.

# Redundant interfaces for MemUnit

s.to_mem_raddr = SendIfcRTL(DataAddrType)

s.from_mem_rdata = RecvIfcRTL(DataType)

s.to_mem_waddr = SendIfcRTL(DataAddrType)

s.to_mem_wdata = SendIfcRTL(DataType)

I saw you connect streaming_xxx with tile's wires via CMD msg. Then it sounds each tile can only be configured once.

Also, then the stride/start/end can only be filled by user, instead of previous operations. I don't think this is correct?

I discussed with @ShangkunLi , current scenario is affine loop, where (base, step, bound) of streaming LD is fixed and should be configured statically.

I previously thoughput about streaming LD should be a operation that waits for its input operands (base, step, bound) from other operations. But anyway, let me know if you and Shangkun have made an agreement.

tile just wait

Tile not just waits, it also sends out data, right?

What about a bit in data pkg/format from data_mem_controller indicating streaming_done?

Yes, it sends out data, but MemUnit is idle, we could have utilized it.

We might need extra command to route this bit to the target tile, this should be considered during CGRA mapping. And streaming_done may be several cycles late than expected if tile is far away from controller, but it's fine if we only use it to unblock the FU.

I'm okay with either centralized (controller) or distributed (tile) ways.

The bit should be embedded by the mem_wrapper or whom counting the number of requests.

I saw you connect streaming_xxx with tile's wires via CMD msg. Then it sounds each tile can only be configured once.

Also, then the stride/start/end can only be filled by user, instead of previous operations. I don't think this is correct?

How about we keep it in MemUnitRTL, so that we can update it in the future to let stride/start/end filled by predecessor DFG nodes?

It is fine. But I don't like:

# Interfaces for streaming LD. s.streaming_start_raddr = InPort(AddrType) s.streaming_stride = InPort(AddrType) s.streaming_end_raddr = InPort(AddrType)

I suggest the FU includes a

VectorCGRA/tile/TileRTL.py

Line 73 in 9271efa

s.recv_from_controller_pkt = RecvIfcRTL(CtrlPktType)

And whenever the the cmd related to streaming_xxx, the memUnit's local wire/reg would be updated.

Moreover, let's have a standalone StreamingMemUnitRTL for this.

yyan7223 added 2 commits January 10, 2026 00:58

Rethinking about streaming LD.

54abe5d

Merge branch 'master' of https://github.com/yyan7223/VectorCGRA

0eebe1c

yyan7223 requested review from ShangkunLi and tancheng January 10, 2026 01:15

tancheng reviewed Jan 10, 2026

View reviewed changes

Fixes bugs.

6b02077

yyan7223 requested a review from tancheng January 10, 2026 10:52

Fixes bugs.

656a863

Pipeline different LD requests within one streaming LD operation.

5ef973e

tancheng reviewed Jan 11, 2026

View reviewed changes

Fixes bugs.

896773c

yyan7223 requested a review from tancheng January 12, 2026 07:31

tancheng reviewed Jan 12, 2026

View reviewed changes

yyan7223 requested a review from tancheng January 13, 2026 02:09

yyan7223 and others added 2 commits January 21, 2026 20:46

Merge branch 'tancheng:master' into master

ef4d57d

StreamingMemUnitRTL.

9969470

	if s.recv_opt.val:
	if s.recv_opt.msg.operation == OPT_DIV:
	s.send_out[0].msg.payload @= s.recv_in[s.in0_idx].msg.payload // s.recv_in[s.in1_idx].msg.payload
	s.send_out[0].msg.predicate @= s.recv_in[s.in0_idx].msg.predicate & \
	s.recv_in[s.in1_idx].msg.predicate & \
	s.reached_vector_factor

	# Redundant interface, only used by PhiRTL.
	s.clear = InPort(b1)

	# Components.
	# Redundant interfaces for MemUnit
	s.to_mem_raddr = SendIfcRTL(DataAddrType)
	s.from_mem_rdata = RecvIfcRTL(DataType)
	s.to_mem_waddr = SendIfcRTL(DataAddrType)
	s.to_mem_wdata = SendIfcRTL(DataType)

Rethinking about streaming LD #249

Are you sure you want to change the base?

Rethinking about streaming LD #249

Uh oh!

Conversation

yyan7223 commented Jan 10, 2026

Uh oh!

tancheng commented Jan 10, 2026

Uh oh!

yyan7223 commented Jan 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tancheng commented Jan 10, 2026

Uh oh!

yyan7223 commented Jan 10, 2026

Uh oh!

tancheng commented Jan 10, 2026

Uh oh!

yyan7223 commented Jan 10, 2026

Uh oh!

tancheng commented Jan 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants