Skip to content

Permuting after loading from PermutedDimsArray for coalesced access #71

@AntonOresten

Description

@AntonOresten

I was messing around with a batched matmul kernel, and noticed a slowdown of up to 2-4x when passing PermutedDimsArray. I then made a version that took the underlying array and permuted the tile after loading, which was way closer to the contiguous baseline. I assume the difference is that it was able to load with coalesced access and permute within registers.

After #69, would it be possible to use type information like StrideDivBy to load non-contiguous data as contiguous and permute after the load itself? I'm not sure how this relates to a sentence from the docstring: stride_div_by[i] = 4 means stride[i] is divisible by 4 (enables vectorized access), and what optimizations can already be made by the compiler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions