We need to better optimize performance at small numbers of antenna. This issue will track the implementation of this. The strategy involved will likely require we employ a separate kernel for the off-diagonal blocks versus the diagonal blocks.
The implementation of this will be in the diagonal branch.