Skip to content

avx512 effect on performance #110

@ozgrakkurt

Description

@ozgrakkurt

Hello,

Thank you for creating this library. I ported it to zig here https://github.com/steelcake/zint/blob/main/src/fastlanes.zig

I have found disabling avx512 is much better on ryzen cpus for my code. I have tested this in an EPYC and a midrange desktop CPU.

On my implementation it speeds up delta and ffor encodings about two times.

I have also tested it on this repo by comparing

RUSTFLAGS='-C target-cpu=native' cargo bench --profile release

with

RUSTFLAGS='-C target-cpu=native -C target-feature=-avx512f' cargo bench --profile release

The only difference I could see was in rle decode. It goes from 15GB/s to 19GB/s

Disabling avx512 also leads to 2% slower performance on bitpacking on this repo which wasn't the case on my implementation.

Manually unrolling the transpose loop (like it is done here) seems to prevent the compiler from vectorizing it with avx512 so it also removes some of the disadvantage on my codebase but it generates a huge amount of assembly and feels like a hack. And disabling avx512 completely actually yields even better perf than just unrolling transpose loops on my case.

Are there other tricks similar to unrolling the tranpose loops that helps remove disadvantages of avx512? (I couldn't find any other difference between this implementation and mine)

Also curious why the code in this repo seems to be fine even with avx512 for the most part and it makes such a huge difference on mine. Maybe it is because rust/zig difference?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions