Evaluate the performance on AMD cards for using the OpenCL 2.0 equivalents of warp vote and warp shuffle functions.
Based on the performance results, decide if there should be a separate kernel implementation for AMD GPUs (and OpenCL 2.0) and update the code as needed.