Thank you for sharing your amazing solutions. Could you give more information on the logic of L2 cache swizzling used in your kernel ?
Is this similar to the "Scheduling and L2 cache" in Kernel 6 of this blog, where the author improved L2 cache hit:
https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog
Thanks,
Cong