Permuting after loading from `PermutedDimsArray` for coalesced access

I was messing around with a batched matmul kernel, and noticed a slowdown of up to 2-4x when passing `PermutedDimsArray`. I then made a version that took the underlying array and permuted the tile after loading, which was way closer to the contiguous baseline. I assume the difference is that it was able to load with coalesced access and permute within registers.

After #69, would it be possible to use type information like `StrideDivBy` to load non-contiguous data as contiguous and permute *after* the load itself? I'm not sure how this relates to a sentence from the docstring: `stride_div_by[i] = 4 means stride[i] is divisible by 4 (enables vectorized access)`, and what optimizations can already be made by the compiler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Permuting after loading from `PermutedDimsArray` for coalesced access #71

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Permuting after loading from PermutedDimsArray for coalesced access #71

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Permuting after loading from `PermutedDimsArray` for coalesced access #71