after a few thousand iterations of this kind of loop
{
kernel_events = cl.enqueueNDRangeKernel(.......,true)
read_event =cl.enqueueReadBuffer(.......,true)
cl.waitForEvents([read_event]);
}
on my NVIDIA SDK/hardward (GTX 970), it start getting very very slow, like 1000x slower than the loop take for the first few thousand iteration.
The only thing that prevents this is passing false on the last parament, so no events are returned, this it runs fine for basically unlimited iterations.
on a side now, it also seems like with the NVIDIA SDK, the enqeueXXX functions block on their own without doing waitForEvents or cl.finish etc...