I'm profiling the models here and I'm getting poor results when grouped convolutions are used.
For example, 1x1_groups looks at scaling upt he number of groups in an 8-channel 1x1:
This plot makes it look like the overhead is what's dominating the calculation since the compute time scales linearly with the number of groups.
This is bad because it means that grouped convolutions are basically useless--better to just use the full matrix with a ton of zeroes.
I suspect that compile-time optimizations may solve this, but I'm surprised that it's this bad.