-
Notifications
You must be signed in to change notification settings - Fork 2
Description
There are two problems with the SSE tests, one is fixable but the other is problematic:
-
GCC will replace the SSE (128-bit) intrinsics in the
sse_*tests with AVX instructions, presumably because there is no added benefit to using AVX over SSE. (Why they would do this I have no idea...).This is annoying, but could be solved by replacing any
-mavxbuild flags with-mssein thesse*.cbuilds. -
Even when explicit SSE instructions are used, the performance behaves as if they are AVX instructions! If (the misnamed)
VADDPS_LATENCYis set to the SSE latency, then it will produce the expected SSE performance (FLOP/s = 4 x freq). But increasing this "latency" (which is actually the loop unrolling) will only increase the flop rate up to the AVX rate (8 x freq). When registers are depleted, it will start to drop (as expected).I have counted the numbers of FLOPs via perf, and it is reporting the correct number. (If
r_maxis hard-coded to 100 million with a latency of 5, then it produces ~2 billion FLOPs (4 x 5 x 0.1e9 = 2e9).I cannot find any error in the
sse.ctimings, so I am worried that it may be an optimization happening inside of the CPU. I am very reluctant to pursue this possibility, so I am just noting it here and moving on.
If these problems are not resolved, then it may be better to just disable the SSE timing test for now.