Hey was wondering how doable would it be to target this machinery toward torch.compile generated Triton kernels () ? Another question would be whether there are any fundamental limitations in not being able to support matrix operations ?
I am happy to put in a PR for these if I can get some rough direction. Thanks !