Skip to content

A synchronization problem on Pascal GPU #8

@cnhzcy14

Description

@cnhzcy14

A synchronization problem on Pascal GPU

When I use pba on Pascal GPU, it always break at "quit on numeric errors". After tracing back, I found following piece of code in ProgramCU.cu at line 1624:

    __syncthreads();

    if(cam >= num) return;
    //save all the results?
    value[index] = sum; 
    if(threadIdx.x < 16) value[index] += value[index + 16];
    if(threadIdx.x < 8)  

    //write back 
    if(threadIdx.x < 8)    
    {
        float temp =    value[index] + value[index + 8];
        int wpos = threadIdx.x + (cam << 3);
		if(add_existing_dq) temp += jtjd[wpos];
        jtjd[wpos] =  temp;
        jtjdi[wpos] = temp == 0? 0 : 1 / (temp);
    }

Since __syncthreads() is before the assignment of value[], it will cause minus temp, and cause minus jtjd[] and jtjdi[], and then NAN will occur in rsqrt_kernel_large kernel in the following step.

Doing __syncthreads() after assignment of value[] fixes this problem. Also I think there is an unnecessary if():

    if(cam >= num) return;
    //save all the results?
    value[index] = sum; 
	__syncthreads();
    if(threadIdx.x < 16) value[index] += value[index + 16];
    // if(threadIdx.x < 8)  

    //write back 
    if(threadIdx.x < 8)    
    {
        float temp =    value[index] + value[index + 8];
        int wpos = threadIdx.x + (cam << 3);
		if(add_existing_dq) temp += jtjd[wpos];
        jtjd[wpos] =  temp;
        jtjdi[wpos] = temp == 0? 0 : 1 / (temp);
    }

I'm not sure if it is correct. And also, why this only happen on Pascal GPU? I tried the old code on Maxwell and Turing GPU, with CUDA 10.2/11.1/11.4 on Ubuntu 18.04/Windows 10 ,"quit on numeric errors" will never happen. This is really strange!

I also replace the global texture with texture object for newer CUDA version in my fork here:
https://github.com/cnhzcy14/pba

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions