avx512 effect on performance

Hello,

Thank you for creating this library. I ported it to zig here https://github.com/steelcake/zint/blob/main/src/fastlanes.zig

I have found disabling `avx512` is much better on ryzen cpus for my code. I have tested this in an EPYC and a midrange desktop CPU.

On my implementation it speeds up delta and ffor encodings about two times.

I have  also tested it on this repo by comparing 

```
RUSTFLAGS='-C target-cpu=native' cargo bench --profile release
```

with

```
RUSTFLAGS='-C target-cpu=native -C target-feature=-avx512f' cargo bench --profile release
```

The only difference I could see was in rle decode. It goes from 15GB/s to 19GB/s

Disabling avx512 also leads to 2% slower performance on bitpacking on this repo which wasn't the case on my implementation.

Manually unrolling the transpose loop (like it is done here) seems to prevent the compiler from vectorizing it with avx512 so it also removes some of the disadvantage on my codebase but it generates a huge amount of assembly and feels like a hack. And disabling avx512 completely actually yields even better perf than just unrolling transpose loops on my case.

Are there other tricks similar to unrolling the tranpose loops that helps remove disadvantages of avx512? (I couldn't find any other difference between this implementation and mine)

Also curious why the code in this repo seems to be fine even with avx512 for the most part and it makes such a huge difference on mine. Maybe it is because rust/zig difference?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

avx512 effect on performance #110

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

avx512 effect on performance #110

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions