At the moment, xGPU copies the data back to the host in whatever format the kernel uses. This isn't necessarily the most intuitive format, as it often means a tiled-triangular order. We should support reordering on the device, and in addition to supporting the existing triangular ordering, we should also support a full matrix.