First, thanks for your sharing! However, when reading your code, I think that there may be some mistakes in src/sptr/attention/attention_cuda_kernel.cu, as following:
// start from h83
int n_h = blockDim.x;
int h_idx = blockIdx.y * n_h + threadIdx.y;
According to definition of thread size dim3 threads(hdim, n_h);, blockDim.x is hdim while n_h is blockDim.y. And this mistake will not cause the program error when n_h=h because in that case, blockIdx.y will always be 0 since the y size of blocks is 1.
Maybe there are some mistakes in my analysis. If you agree with my opinion, please fix this.
Looking forward to your reply!