Hi, thanks for the great work and for open-sourcing the code!
While reading the CUDA kernel knnquery_cuda_kernel (in lib/pointops/src/knnquery/knnquery_cuda.cu), I noticed that the pointer offset for dist2 seems to be missing.
Code snippet (current version):
new_xyz += bs_idx * m * 3 + pt_idx * 3;
xyz += bs_idx * n * 3;
idx += bs_idx * m * nsample + pt_idx * nsample;
// dist2 is not offset here
Later in the kernel, all threads in the same batch (bs_idx) will write to dist2[i], which means different threads may overwrite each other’s results. This looks like a bug.
I think it should be:
dist2 += bs_idx * m * nsample + pt_idx * nsample;
so that each thread writes its own (bs_idx, pt_idx, :) slice of dist2.
Thanks!