Skip to content

Conversation

@sdatkinson
Copy link
Owner

The current implementation of grouped convolutions is worse than just doing the non-grouped operation on the full matrix with zeroes on the off-block-diagonals.

I tried making an array of blocks, but the GEMM overhead appears to dominate.

In order to improve, the approach needs to improve on overhead. Compile-time improvements and/or specialized implementations for specific sizes are likely to work.

This is better than what's currently on main, but it's not good enough. "Little steps".

@sdatkinson sdatkinson merged commit 12f93a2 into main Jan 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants