I was messing around with a batched matmul kernel, and noticed a slowdown of up to 2-4x when passing PermutedDimsArray. I then made a version that took the underlying array and permuted the tile after loading, which was way closer to the contiguous baseline. I assume the difference is that it was able to load with coalesced access and permute within registers.
After #69, would it be possible to use type information like StrideDivBy to load non-contiguous data as contiguous and permute after the load itself? I'm not sure how this relates to a sentence from the docstring: stride_div_by[i] = 4 means stride[i] is divisible by 4 (enables vectorized access), and what optimizations can already be made by the compiler.