There are lot of not understood memory movements around the Cshifts called from Staple (CovShift) which we need to investigate (see profile).

Concerns are around:
- Small Host to Device memory transfer in between loops over the dimensions (small green lines between the 4 large red / green blocks).
- The large transfers at the start / end of each dim loop.
See notes from 17th of Jan and 31st of Jan.