Conversation
|
@microsoft-github-policy-service agree |
|
why not convert |
While the two gather op are both gathers, my understanding is that they operate at different levels (@python3kgae can correct me). tts::GatherOp is it is from |
|
Hi @bmyerz0, just wanted to check if my understanding is correct: the current lowering of triton::GatherOp to linalg is fine, and the only concern is the naming of tts::GatherOp to avoid confusion. Please let me know if there’s anything else I should address. Thanks! |
Yes, I do not think it is good to lower triton::GatherOp to tts::GatherOp. The suggestion is to lower triton::GatherOp directly to linalg. |
But in some npu hardware, 'gather' op is needed. I thought trition-shared should provide a middle-layer dialect to reserve this semantic. If we directly convert 'trition gather' to 'linalg.generic', we will lost it. Could we consider 'tts gather'? |
|
Yes, for the sake of backends flexibility, I think it is reasonable to make any tts op lowering to linalg.generic optional. But we should still provide the lowering. And I think that tts should include two different ops, one from tl.load vs tl.gather. |
|
Maybe we should use tts.gather for tl.load and introduce a new ttx.gather for tl.gather? |
|
The current PR implements a fully functional lowering for triton::GatherOp on the CPU backend. Would it be possible to merge this PR first? Support ttx.gather op for npu hardware can be addressed separately, if needed. |
| ([32], [64], 0), | ||
| ([4, 4], [8, 4], 0), | ||
| ([128, 64], [256, 64], 0), | ||
| ([128, 64], [128, 128], 1), |
There was a problem hiding this comment.
these all appear increase the size of the tensor. Can you have test cases that contract the size of the tensor?
I think the direct triton::GatherOp to linalg that you have is a good solution. It mirrors what we do for other ops on tensors. |
Added support for
triton::GatherOpconversion. Using cf.assert for out-of-bound indices, consistent with the CUDA backend behavior (which triggers a device assert).