Small performance improvement for avx-512 on Skylake-SP#191
Small performance improvement for avx-512 on Skylake-SP#191rdolbeau wants to merge 3 commits intoFFTW:masterfrom
Conversation
xiegengxin
left a comment
There was a problem hiding this comment.
I think the scale should be 8 when using Scatter/Gather to access 2 32-bit elements as one 64-bit.
| const __m256i index = _mm256_set_epi32(7 * ivs, 6 * ivs, 5 * ivs, 4 * ivs, 3 * ivs, 2 * ivs, 1 * ivs, 0 * ivs); | ||
|
|
||
| return _mm512_i32gather_ps(index, x, 4); | ||
| return (V)_mm512_i32gather_pd(index, x, 4); |
There was a problem hiding this comment.
return (V)_mm512_i32gather_pd(index, x, 8);
Right?
There was a problem hiding this comment.
Wrote that code a while ago, but I think 4 is correct; the indices are still referring to the original datatype - single precision value of 4 bytes. The_pd variant is used only to access 64 bits at a time explicitly.
| /* pretend pair of single are a double */ | ||
| const __m256i index = _mm256_set_epi32(7 * ovs, 6 * ovs, 5 * ovs, 4 * ovs, 3 * ovs, 2 * ovs, 1 * ovs, 0 * ovs); | ||
|
|
||
| _mm512_i32scatter_pd(x, index, (__m512d)v, 4); |
There was a problem hiding this comment.
_mm512_i32scatter_pd(x, index, (__m512d)v, 8);
There was a problem hiding this comment.
Same here - ovs is a stride in 4-bytes elements, so the index vector is also in 4-bytes element.
|
I've updated one of the commit messages after (finally) testing the code on KNL. |
…assemble/disassemble the vector in 128 bits chunks. This is faster on Skylake, but will not work on Knights Landing (as KNL lacks AVX512DQ), so I've added an --enable-avx512-scattergather option to retain the old behavior and enable compiling/using AVX512 on KNL. This should help with FFTW#143.
This should improves slightly the performance by reducing the number of uops needed to do the gather/scatter.
|
@stevengj @matteo-frigo Can I merge this (old) one? I don't think KNL not liking the new code will be much of a problem by now, as I think most of the KNL-based systems have been retired (and there's a configure option to produce a KNL-friendly version anyway, as I still do own a KNL myself :-) ) |
This improves performance a bit on the Skylake-SP cores (Xeon Scalable), by replacing gather/scatter by slightly more efficient code: breaking down the instruction in 128 bits chunk for DP, and going for 64-bits scatter/gather (instead of 32) in SP. The original gather/scatter code is still available for DP, as it's probably faster on Knights Landing (KNL, Xeon Phi 72xx). The SP code should be a win on KNL as well.
Tested with make check/bigcheck, and for performance on synthetic code, could probably use some real-life testing for performance.