-
Notifications
You must be signed in to change notification settings - Fork 2
Description
The avx_add, avx_mul, avx_mac, and avx_fma hang when compiled with Nvidia. Oddly, the avx_fmac test seems to run fine.
Backtrace:
Thread 5 (Thread 0x155552ebd700 (LWP 1842419)):
#0 0x0000155553f0e5ae in pthread_barrier_wait () from /lib64/libpthread.so.0
#1 0x0000000000403696 in avx_add (args_in=0x15554c000be0) at src/x86/avx.c:43
#2 0x0000000000402f77 in simd_thread (args_in=0x609310) at src/simd.c:38
#3 0x0000155553f071ca in start_thread () from /lib64/libpthread.so.0
#4 0x00001555534918d3 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x155555522fc0 (LWP 1842402)):
#0 0x0000155553f086cd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1 0x0000000000401b5c in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:177
Looking inside, it seems like perhaps r_max is overflowing and then somehow becoming zero, which would naturally kill the r_max := 2*r_max progression.
It does not seem to depend on the choice of flags (although AFAIK Nvidia is
rather aggressive in vectorization).
This happened on an AMD EPYC 7H12, but I don't think it's related to AMD instructions.
First guess is that the _mm256_add_pd() or something else in the timed loops is a dummy function and runs in zero-time, causing the loop to be zero-time and r_max to increase without bound, eventually overflowing.
I really don't have time to look into this now, but this needs to be addressed for any of the Nvidia content to be taken seriously. At the very least, we could start checking for r_max overflow.