Skip to content

Conversation

@yyan7223
Copy link
Collaborator

After talking with @ShangkunLi , I re-design the streaming LD, here is the timing graph:
1768007022731
As you can see, recv_opt.rdy keeps to be 0 until streaming LD finishes, which means element will block itself during streaming LD. The fu_crossbar and routing_crossbar will also be blocked.

@ShangkunLi , we can perform the streaming LD by simply configuring the (base, step, bound)
1768007301352
and sending a OPT_STREAM_LD (it does not consume any operands so routing_crossbar is idle).
1768007490820
Please check cgra/test/CgraRTL_streaming_read_test.py for configuration about streaming LD, feel free to contact me if there are any concerns.

@tancheng , I think we should deprecate the PR#235 (previous design of streaming LD) , now we do the streaming LD control directly in the MemUnitRTL.

@tancheng
Copy link
Owner

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

@yyan7223
Copy link
Collaborator Author

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant.
So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

@@ -0,0 +1,345 @@
"""
==========================================================================
CgraRTL_fir_test.py
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment

kTotalCtrlSteps = 3
src_ctrl_pkt = []
src_opt_pkt = [
# tile 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly describe what the test does.

@tancheng
Copy link
Owner

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

@yyan7223
Copy link
Collaborator Author

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

Thanks for reminding, now we have:

  elif s.from_mem_rdata.val & s.from_mem_rdata.rdy:
    # Moves to next address only when current read finishes.
    s.streaming_raddr <<= s.streaming_raddr + s.streaming_stride

So if the consumer tile cannot consume the streamed data in time, its input channel becomes full. Then producer's MemUnitRTL has s.send_out[0].rdy=0 and s.from_mem_rdata.rdy=0, which blocks the uncompleted reading by stopping incrementing s.streaming_raddr.

@yyan7223 yyan7223 requested a review from tancheng January 10, 2026 10:52
@tancheng
Copy link
Owner

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

Thanks for reminding, now we have:

  elif s.from_mem_rdata.val & s.from_mem_rdata.rdy:
    # Moves to next address only when current read finishes.
    s.streaming_raddr <<= s.streaming_raddr + s.streaming_stride

So if the consumer tile cannot consume the streamed data in time, its input channel becomes full. Then producer's MemUnitRTL has s.send_out[0].rdy=0 and s.from_mem_rdata.rdy=0, which blocks the uncompleted reading by stopping incrementing s.streaming_raddr.

Hmm, that's actually a problem.. If the memUnit perform this s.streaming_raddr + s.streaming_stride, the second request is blocked by the first request's response. They are not pipelined. Your previous PR is more pipelinable.

@yyan7223
Copy link
Collaborator Author

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

Thanks for reminding, now we have:

  elif s.from_mem_rdata.val & s.from_mem_rdata.rdy:
    # Moves to next address only when current read finishes.
    s.streaming_raddr <<= s.streaming_raddr + s.streaming_stride

So if the consumer tile cannot consume the streamed data in time, its input channel becomes full. Then producer's MemUnitRTL has s.send_out[0].rdy=0 and s.from_mem_rdata.rdy=0, which blocks the uncompleted reading by stopping incrementing s.streaming_raddr.

Hmm, that's actually a problem.. If the memUnit perform this s.streaming_raddr + s.streaming_stride, the second request is blocked by the first request's response. They are not pipelined. Your previous PR is more pipelinable.

Ok then we can achieve the pipeline by performing s.streaming_raddr + s.streaming_stride as long as to_mem_raddr.rdy==1, and use a counter to decide when to quit streaming. For example, if streaming LD request 4 data in total, then we can do counter+1 everytime s.from_mem_rdata.val & s.from_mem_rdata.rdy, and quit streaming if counter==3. How do you think?

@tancheng
Copy link
Owner

What's the functionality of #235? recv_opt.rdy keeps to be 0 as well, doesn't it?

PR#235 modifies DataMemControllerRTL and DataMemWrapperRTL to support streaming LD, but I found it further needs to interact with FU to control recv_opt.rdy to be 0 during streaming, and it's not that elegant. So we re-design the streaming LD function from the scratch, control recv_opt.rdy directly in MemUnitRTL.

Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked.

Thanks for reminding, now we have:

  elif s.from_mem_rdata.val & s.from_mem_rdata.rdy:
    # Moves to next address only when current read finishes.
    s.streaming_raddr <<= s.streaming_raddr + s.streaming_stride

So if the consumer tile cannot consume the streamed data in time, its input channel becomes full. Then producer's MemUnitRTL has s.send_out[0].rdy=0 and s.from_mem_rdata.rdy=0, which blocks the uncompleted reading by stopping incrementing s.streaming_raddr.

Hmm, that's actually a problem.. If the memUnit perform this s.streaming_raddr + s.streaming_stride, the second request is blocked by the first request's response. They are not pipelined. Your previous PR is more pipelinable.

Ok then we can achieve the pipeline by performing s.streaming_raddr + s.streaming_stride as long as to_mem_raddr.rdy==1, and use a counter to decide when to quit streaming. For example, if streaming LD request 4 data in total, then we can do counter+1 everytime s.from_mem_rdata.val & s.from_mem_rdata.rdy, and quit streaming if counter==3. How do you think?

Sounds good to me.

s.to_mem_raddr.val @= s.streaming_status & ~s.already_sent_raddr
# Keep issuing new LD requests as long as address buffer has free space,
# so that LD requests can be processed in pipeline.
s.to_mem_raddr.val @= s.streaming_status & s.to_mem_raddr.rdy
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. streaming_status depends on streaming_done. But streaming_done depends on s.streaming_results_consumed_counter == (s.streaming_end_raddr - s.streaming_start_raddr) // s.streaming_stride.

What if the requests already sent out but streaming_results_consumed_counter not yet full? We seem wrongly send out more than enough/required requests towards memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh……I see, maybe we should also have a streaming_requests_sent_counter

s.streaming_done @= s.from_mem_rdata.val & (s.streaming_raddr == s.streaming_end_raddr)
# Streaming LD is done when the last streaming result is consumed.
s.streaming_done @= (s.streaming_results_consumed_counter == \
(s.streaming_end_raddr - s.streaming_start_raddr) // s.streaming_stride) &\
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think // is synthesizable. maybe we need to increment counter using streaming_stride

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I change it back to increment counter
But DivRTL.py may also have synthesize problem,

if s.recv_opt.val:
if s.recv_opt.msg.operation == OPT_DIV:
s.send_out[0].msg.payload @= s.recv_in[s.in0_idx].msg.payload // s.recv_in[s.in1_idx].msg.payload
s.send_out[0].msg.predicate @= s.recv_in[s.in0_idx].msg.predicate & \
s.recv_in[s.in1_idx].msg.predicate & \
s.reached_vector_factor

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so we use @HobbitQia's Verilog DIV for synthesis:

fuType2RTL["Div" ] = ExclusiveDivRTL

@yyan7223 yyan7223 requested a review from tancheng January 12, 2026 07:31

@update_ff
def update_operation_reg():
s.operation_reg <<= trunc(s.recv_opt.msg.operation, operation_nbits)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on whether the next op is streaming_ld, is not safe. any other idea?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not, previously I detected the rising edge of recv_opt.val and whether recv_opt.msg==OPT_STREAM_LD to decide whether we should enter the streaming status, but found that recv_opt.val kept high even if recv_opt.msg changed at adjacent cycles. So that's why now we detect whether recv_opt.msg changes.

Any example to illustrate why it is not safe? Both recv_opt.msg and operation_reg are registers value, which should be stable within the whole clock cycle.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if next op is still streaming_ld? (though may not happen)

We have s.ctrl_addr_inport indicate the current op's index, which can be used in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good idea.

Comment on lines +62 to +67
s.streaming_start_raddr = InPort(AddrType)
s.streaming_stride = InPort(AddrType)
s.streaming_end_raddr = InPort(AddrType)
# This is for blocking fu_crossbar and routing_crossbar
# when performing streaming LD operation.
s.streaming_done = OutPort(b1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these port be available for all FU units? Can we keep them be specific for ld/st unit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep them specific for ld/st unit seems to have the verilog translation issue. So I add these ports for all FU units, just like those redundant interfaces for MemUnitRTL and PhiRTL in Fu.py.

# Redundant interface, only used by PhiRTL.
s.clear = InPort(b1)
# Components.
# Redundant interfaces for MemUnit
s.to_mem_raddr = SendIfcRTL(DataAddrType)
s.from_mem_rdata = RecvIfcRTL(DataType)
s.to_mem_waddr = SendIfcRTL(DataAddrType)
s.to_mem_wdata = SendIfcRTL(DataType)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you connect streaming_xxx with tile's wires via CMD msg. Then it sounds each tile can only be configured once.

Also, then the stride/start/end can only be filled by user, instead of previous operations. I don't think this is correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with @ShangkunLi , current scenario is affine loop, where (base, step, bound) of streaming LD is fixed and should be configured statically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously thoughput about streaming LD should be a operation that waits for its input operands (base, step, bound) from other operations. But anyway, let me know if you and Shangkun have made an agreement.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tile just wait

Tile not just waits, it also sends out data, right?

  1. What about a bit in data pkg/format from data_mem_controller indicating streaming_done?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, it sends out data, but MemUnit is idle, we could have utilized it.
  2. We might need extra command to route this bit to the target tile, this should be considered during CGRA mapping. And streaming_done may be several cycles late than expected if tile is far away from controller, but it's fine if we only use it to unblock the FU.

I'm okay with either centralized (controller) or distributed (tile) ways.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit should be embedded by the mem_wrapper or whom counting the number of requests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you connect streaming_xxx with tile's wires via CMD msg. Then it sounds each tile can only be configured once.

Also, then the stride/start/end can only be filled by user, instead of previous operations. I don't think this is correct?

How about we keep it in MemUnitRTL, so that we can update it in the future to let stride/start/end filled by predecessor DFG nodes?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine. But I don't like:

    # Interfaces for streaming LD.
    s.streaming_start_raddr = InPort(AddrType)
    s.streaming_stride = InPort(AddrType)
    s.streaming_end_raddr = InPort(AddrType)

I suggest the FU includes a

s.recv_from_controller_pkt = RecvIfcRTL(CtrlPktType)

And whenever the the cmd related to streaming_xxx, the memUnit's local wire/reg would be updated.

Moreover, let's have a standalone StreamingMemUnitRTL for this.

@yyan7223 yyan7223 requested a review from tancheng January 13, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants