-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Hello,
Thank you for creating this library. I ported it to zig here https://github.com/steelcake/zint/blob/main/src/fastlanes.zig
I have found disabling avx512 is much better on ryzen cpus for my code. I have tested this in an EPYC and a midrange desktop CPU.
On my implementation it speeds up delta and ffor encodings about two times.
I have also tested it on this repo by comparing
RUSTFLAGS='-C target-cpu=native' cargo bench --profile release
with
RUSTFLAGS='-C target-cpu=native -C target-feature=-avx512f' cargo bench --profile release
The only difference I could see was in rle decode. It goes from 15GB/s to 19GB/s
Disabling avx512 also leads to 2% slower performance on bitpacking on this repo which wasn't the case on my implementation.
Manually unrolling the transpose loop (like it is done here) seems to prevent the compiler from vectorizing it with avx512 so it also removes some of the disadvantage on my codebase but it generates a huge amount of assembly and feels like a hack. And disabling avx512 completely actually yields even better perf than just unrolling transpose loops on my case.
Are there other tricks similar to unrolling the tranpose loops that helps remove disadvantages of avx512? (I couldn't find any other difference between this implementation and mine)
Also curious why the code in this repo seems to be fine even with avx512 for the most part and it makes such a huge difference on mine. Maybe it is because rust/zig difference?