-
Notifications
You must be signed in to change notification settings - Fork 27
Rethinking about streaming LD #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
What's the functionality of #235? |
PR#235 modifies |
| @@ -0,0 +1,345 @@ | |||
| """ | |||
| ========================================================================== | |||
| CgraRTL_fir_test.py | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update comment
| kTotalCtrlSteps = 3 | ||
| src_ctrl_pkt = [] | ||
| src_opt_pkt = [ | ||
| # tile 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly describe what the test does.
Either should be fine. If going with this PR's strategy, we need to make sure if the consumer cannot consume the streamed data in time, the uncompleted reading is being blocked. |
Thanks for reminding, now we have: So if the consumer tile cannot consume the streamed data in time, its input channel becomes full. Then producer's MemUnitRTL has |
Hmm, that's actually a problem.. If the memUnit perform this |
Ok then we can achieve the pipeline by performing |
Sounds good to me. |
fu/single/MemUnitRTL.py
Outdated
| s.to_mem_raddr.val @= s.streaming_status & ~s.already_sent_raddr | ||
| # Keep issuing new LD requests as long as address buffer has free space, | ||
| # so that LD requests can be processed in pipeline. | ||
| s.to_mem_raddr.val @= s.streaming_status & s.to_mem_raddr.rdy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct. streaming_status depends on streaming_done. But streaming_done depends on s.streaming_results_consumed_counter == (s.streaming_end_raddr - s.streaming_start_raddr) // s.streaming_stride.
What if the requests already sent out but streaming_results_consumed_counter not yet full? We seem wrongly send out more than enough/required requests towards memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh……I see, maybe we should also have a streaming_requests_sent_counter
fu/single/MemUnitRTL.py
Outdated
| s.streaming_done @= s.from_mem_rdata.val & (s.streaming_raddr == s.streaming_end_raddr) | ||
| # Streaming LD is done when the last streaming result is consumed. | ||
| s.streaming_done @= (s.streaming_results_consumed_counter == \ | ||
| (s.streaming_end_raddr - s.streaming_start_raddr) // s.streaming_stride) &\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think // is synthesizable. maybe we need to increment counter using streaming_stride
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I change it back to increment counter
But DivRTL.py may also have synthesize problem,
VectorCGRA/fu/single/DivRTL.py
Lines 70 to 75 in 2d40b3a
| if s.recv_opt.val: | |
| if s.recv_opt.msg.operation == OPT_DIV: | |
| s.send_out[0].msg.payload @= s.recv_in[s.in0_idx].msg.payload // s.recv_in[s.in1_idx].msg.payload | |
| s.send_out[0].msg.predicate @= s.recv_in[s.in0_idx].msg.predicate & \ | |
| s.recv_in[s.in1_idx].msg.predicate & \ | |
| s.reached_vector_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, so we use @HobbitQia's Verilog DIV for synthesis:
| fuType2RTL["Div" ] = ExclusiveDivRTL |
fu/single/MemUnitRTL.py
Outdated
|
|
||
| @update_ff | ||
| def update_operation_reg(): | ||
| s.operation_reg <<= trunc(s.recv_opt.msg.operation, operation_nbits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on whether the next op is streaming_ld, is not safe. any other idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently not, previously I detected the rising edge of recv_opt.val and whether recv_opt.msg==OPT_STREAM_LD to decide whether we should enter the streaming status, but found that recv_opt.val kept high even if recv_opt.msg changed at adjacent cycles. So that's why now we detect whether recv_opt.msg changes.
Any example to illustrate why it is not safe? Both recv_opt.msg and operation_reg are registers value, which should be stable within the whole clock cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if next op is still streaming_ld? (though may not happen)
We have s.ctrl_addr_inport indicate the current op's index, which can be used in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good idea.
| s.streaming_start_raddr = InPort(AddrType) | ||
| s.streaming_stride = InPort(AddrType) | ||
| s.streaming_end_raddr = InPort(AddrType) | ||
| # This is for blocking fu_crossbar and routing_crossbar | ||
| # when performing streaming LD operation. | ||
| s.streaming_done = OutPort(b1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need these port be available for all FU units? Can we keep them be specific for ld/st unit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep them specific for ld/st unit seems to have the verilog translation issue. So I add these ports for all FU units, just like those redundant interfaces for MemUnitRTL and PhiRTL in Fu.py.
Lines 53 to 61 in 2d40b3a
| # Redundant interface, only used by PhiRTL. | |
| s.clear = InPort(b1) | |
| # Components. | |
| # Redundant interfaces for MemUnit | |
| s.to_mem_raddr = SendIfcRTL(DataAddrType) | |
| s.from_mem_rdata = RecvIfcRTL(DataType) | |
| s.to_mem_waddr = SendIfcRTL(DataAddrType) | |
| s.to_mem_wdata = SendIfcRTL(DataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you connect streaming_xxx with tile's wires via CMD msg. Then it sounds each tile can only be configured once.
Also, then the stride/start/end can only be filled by user, instead of previous operations. I don't think this is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed with @ShangkunLi , current scenario is affine loop, where (base, step, bound) of streaming LD is fixed and should be configured statically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously thoughput about streaming LD should be a operation that waits for its input operands (base, step, bound) from other operations. But anyway, let me know if you and Shangkun have made an agreement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tile just wait
Tile not just waits, it also sends out data, right?
- What about a bit in data pkg/format from data_mem_controller indicating
streaming_done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, it sends out data, but MemUnit is idle, we could have utilized it.
- We might need extra command to route this bit to the target tile, this should be considered during CGRA mapping. And streaming_done may be several cycles late than expected if tile is far away from controller, but it's fine if we only use it to unblock the FU.
I'm okay with either centralized (controller) or distributed (tile) ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bit should be embedded by the mem_wrapper or whom counting the number of requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you connect streaming_xxx with tile's wires via CMD msg. Then it sounds each tile can only be configured once.
Also, then the stride/start/end can only be filled by user, instead of previous operations. I don't think this is correct?
How about we keep it in MemUnitRTL, so that we can update it in the future to let stride/start/end filled by predecessor DFG nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine. But I don't like:
# Interfaces for streaming LD.
s.streaming_start_raddr = InPort(AddrType)
s.streaming_stride = InPort(AddrType)
s.streaming_end_raddr = InPort(AddrType)
I suggest the FU includes a
Line 73 in 9271efa
| s.recv_from_controller_pkt = RecvIfcRTL(CtrlPktType) |
And whenever the the cmd related to streaming_xxx, the memUnit's local wire/reg would be updated.
Moreover, let's have a standalone StreamingMemUnitRTL for this.
After talking with @ShangkunLi , I re-design the streaming LD, here is the timing graph:

As you can see,
recv_opt.rdykeeps to be 0 until streaming LD finishes, which meanselementwill block itself during streaming LD. Thefu_crossbarandrouting_crossbarwill also be blocked.@ShangkunLi , we can perform the streaming LD by simply configuring the (base, step, bound)


and sending a
OPT_STREAM_LD(it does not consume any operands so routing_crossbar is idle).Please check
cgra/test/CgraRTL_streaming_read_test.pyfor configuration about streaming LD, feel free to contact me if there are any concerns.@tancheng , I think we should deprecate the PR#235 (previous design of streaming LD) , now we do the streaming LD control directly in the
MemUnitRTL.