SSE functions behave as if AVX instructions

There are two problems with the SSE tests, one is fixable but the other is problematic:

1. GCC will replace the SSE (128-bit) intrinsics in the `sse_*` tests with AVX instructions, presumably because there is no added benefit to using AVX over SSE.  (Why they would do this I have no idea...).  

   This is annoying, but could be solved by replacing any `-mavx` build flags with `-msse` in the `sse*.c` builds.

2. Even when explicit SSE instructions are used, the performance behaves as if they are AVX instructions!  If (the misnamed) `VADDPS_LATENCY` is set to the SSE latency, then it will produce the expected SSE performance (FLOP/s = 4 x freq).  But increasing this "latency" (which is actually the loop unrolling) will only increase the flop rate up to the AVX rate (8 x freq).  When registers are depleted, it will start to drop (as expected).

   I have counted the numbers of FLOPs via perf, and it is reporting the correct number.  (If `r_max` is hard-coded to 100 million with a latency of 5, then it produces ~2 billion FLOPs (4 x 5 x 0.1e9 = 2e9).

   I cannot find any error in the `sse.c` timings, so I am worried that it may be an optimization happening inside of the CPU.  I am very reluctant to pursue this possibility, so I am just noting it here and moving on.
   
If these problems are not resolved, then it may be better to just disable the SSE timing test for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SSE functions behave as if AVX instructions #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SSE functions behave as if AVX instructions #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions