-
Notifications
You must be signed in to change notification settings - Fork 22
Description
A synchronization problem on Pascal GPU
When I use pba on Pascal GPU, it always break at "quit on numeric errors". After tracing back, I found following piece of code in ProgramCU.cu at line 1624:
__syncthreads();
if(cam >= num) return;
//save all the results?
value[index] = sum;
if(threadIdx.x < 16) value[index] += value[index + 16];
if(threadIdx.x < 8)
//write back
if(threadIdx.x < 8)
{
float temp = value[index] + value[index + 8];
int wpos = threadIdx.x + (cam << 3);
if(add_existing_dq) temp += jtjd[wpos];
jtjd[wpos] = temp;
jtjdi[wpos] = temp == 0? 0 : 1 / (temp);
}
Since __syncthreads() is before the assignment of value[], it will cause minus temp, and cause minus jtjd[] and jtjdi[], and then NAN will occur in rsqrt_kernel_large kernel in the following step.
Doing __syncthreads() after assignment of value[] fixes this problem. Also I think there is an unnecessary if():
if(cam >= num) return;
//save all the results?
value[index] = sum;
__syncthreads();
if(threadIdx.x < 16) value[index] += value[index + 16];
// if(threadIdx.x < 8)
//write back
if(threadIdx.x < 8)
{
float temp = value[index] + value[index + 8];
int wpos = threadIdx.x + (cam << 3);
if(add_existing_dq) temp += jtjd[wpos];
jtjd[wpos] = temp;
jtjdi[wpos] = temp == 0? 0 : 1 / (temp);
}
I'm not sure if it is correct. And also, why this only happen on Pascal GPU? I tried the old code on Maxwell and Turing GPU, with CUDA 10.2/11.1/11.4 on Ubuntu 18.04/Windows 10 ,"quit on numeric errors" will never happen. This is really strange!
I also replace the global texture with texture object for newer CUDA version in my fork here:
https://github.com/cnhzcy14/pba