Skip to content

[DRAFT] Perf: vectorize take_bits (up to 2.5x speedup)#8967

Draft
gstvg wants to merge 1 commit intoapache:mainfrom
gstvg:perf_vectorize_take_bits
Draft

[DRAFT] Perf: vectorize take_bits (up to 2.5x speedup)#8967
gstvg wants to merge 1 commit intoapache:mainfrom
gstvg:perf_vectorize_take_bits

Conversation

@gstvg
Copy link
Copy Markdown
Contributor

@gstvg gstvg commented Dec 8, 2025

Which issue does this PR close?

Part of #279, related to #8879

Rationale for this change

EDIT: with -C target-cpu=sapphirerapids:

take bool 512           time:   [360.27 ns 360.48 ns 360.69 ns]
                        change: [−36.426% −36.251% −36.137%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 1024          time:   [468.75 ns 470.56 ns 474.14 ns]
                        change: [−54.442% −54.266% −54.083%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 8192          time:   [3.0099 µs 3.0481 µs 3.0882 µs]
                        change: [−59.645% −59.419% −59.186%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

Vectorized take_bits for indices without nulls (https://rust.godbolt.org/z/YnqnqccMK)
Bitmask packing with intrinsics for sse2/avx2 (generic enough to be used in other parts of the code)

Are these changes tested?

Currently no, tests will be added if this PR is decided to move forward

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 8, 2025
@gstvg gstvg force-pushed the perf_vectorize_take_bits branch from 0475fa3 to 37c1e6b Compare December 8, 2025 07:47
@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented Dec 8, 2025

Hi @Dandandan, do you think this speedup justifies the usage of intrinsics ? Taking into account that it affects not only boolean arrays but every array with nulls, and that they can also be used in other parts of the code, like cmp kernels. Thanks!

@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented Dec 8, 2025

The justification to use intrinsics is that currently llvm generates suboptimal vectorization for bitmask packing, see llvm/llvm-project#96395 and llvm/llvm-project#121691

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Dec 10, 2025

run benchmark take_kernels

@alamb-ghbot
Copy link
Copy Markdown

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing perf_vectorize_take_bits (37c1e6b) to a67cd19 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=perf_vectorize_take_bits
Results will be posted here when complete

@alamb-ghbot
Copy link
Copy Markdown

🤖: Benchmark completed

Details

group                                                                     main                                   perf_vectorize_take_bits
-----                                                                     ----                                   ------------------------
take bool 1024                                                            1.06   1331.0±5.50ns        ? ?/sec    1.00  1257.5±33.70ns        ? ?/sec
take bool 512                                                             1.05   731.5±13.23ns        ? ?/sec    1.00    693.9±8.14ns        ? ?/sec
take bool null indices 1024                                               1.00  1628.8±22.38ns        ? ?/sec    1.00  1633.4±36.37ns        ? ?/sec
take bool null values 1024                                                1.18      2.6±0.02µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.08      3.3±0.06µs        ? ?/sec    1.00      3.1±0.01µs        ? ?/sec
take check bounds i32 1024                                                1.00   1621.6±4.54ns        ? ?/sec    1.00   1625.4±8.99ns        ? ?/sec
take check bounds i32 512                                                 1.00    830.6±2.59ns        ? ?/sec    1.04    867.2±7.44ns        ? ?/sec
take i32 1024                                                             1.02    728.7±2.67ns        ? ?/sec    1.00    711.5±1.52ns        ? ?/sec
take i32 512                                                              1.00    394.0±3.23ns        ? ?/sec    1.12    441.8±5.24ns        ? ?/sec
take i32 null indices 1024                                                1.00   998.8±20.07ns        ? ?/sec    1.00    995.2±9.91ns        ? ?/sec
take i32 null values 1024                                                 1.10      2.1±0.18µs        ? ?/sec    1.00  1889.7±35.88ns        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.6±0.08µs        ? ?/sec    1.02      2.7±0.08µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00      8.4±0.11µs        ? ?/sec    1.00      8.4±0.10µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      9.1±0.29µs        ? ?/sec    1.14     10.4±0.12µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     20.8±0.23µs        ? ?/sec    1.00     20.8±0.12µs        ? ?/sec
take str 1024                                                             1.00     11.0±0.27µs        ? ?/sec    1.00     11.0±0.27µs        ? ?/sec
take str 512                                                              1.01      5.3±0.06µs        ? ?/sec    1.00      5.3±0.03µs        ? ?/sec
take str null indices 1024                                                1.02      6.9±0.04µs        ? ?/sec    1.00      6.8±0.04µs        ? ?/sec
take str null indices 512                                                 1.00      3.3±0.03µs        ? ?/sec    1.02      3.4±0.02µs        ? ?/sec
take str null values 1024                                                 1.04      8.7±0.14µs        ? ?/sec    1.00      8.4±0.09µs        ? ?/sec
take str null values null indices 1024                                    1.00      6.4±0.04µs        ? ?/sec    1.02      6.6±0.04µs        ? ?/sec
take stringview 1024                                                      1.00    776.3±5.20ns        ? ?/sec    1.01   784.6±15.08ns        ? ?/sec
take stringview 512                                                       1.00    478.4±2.06ns        ? ?/sec    1.16   554.2±16.13ns        ? ?/sec
take stringview null indices 1024                                         1.00  1399.2±29.55ns        ? ?/sec    1.00  1396.4±23.55ns        ? ?/sec
take stringview null indices 512                                          1.00    684.5±4.05ns        ? ?/sec    1.03   705.3±24.91ns        ? ?/sec
take stringview null values 1024                                          1.15      2.1±0.01µs        ? ?/sec    1.00  1789.2±22.60ns        ? ?/sec
take stringview null values null indices 1024                             1.01      2.9±0.03µs        ? ?/sec    1.00      2.9±0.02µs        ? ?/sec

@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented Dec 11, 2025

BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench take_kernels

I forgot to mention that I ran the benchmarks with -C target-cpu=sapphirerapids 🤦 . I believe that with broadwell and up the performance starts to improve.
But if our main target is the generic cpu, than the added complexity is definitely not worth it, and we can safely close this, no problem

@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented Dec 11, 2025

Just tested on AMD, also lower improvements:

Details

-C target-cpu=znver3

take bool 512           time:   [483.26 ns 483.65 ns 484.09 ns]
                        change: [−12.699% −12.572% −12.446%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 1024          time:   [872.58 ns 873.03 ns 873.43 ns]
                        change: [−13.936% −13.860% −13.775%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 8192          time:   [5.6960 µs 5.7022 µs 5.7096 µs]
                        change: [−23.644% −23.566% −23.486%] (p = 0.00 < 0.05)
                        Performance has improved.

-C target-cpu=znver4

take bool 512           time:   [447.41 ns 447.57 ns 447.74 ns]
                        change: [−9.4770% −9.2941% −9.0973%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 1024          time:   [787.11 ns 787.30 ns 787.49 ns]
                        change: [−12.403% −12.247% −12.099%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 8192          time:   [5.6537 µs 5.6553 µs 5.6569 µs]
                        change: [−14.204% −14.055% −13.929%] (p = 0.00 < 0.05)
                        Performance has improved.

Lower gains in zen 4 than zen 3 is due to llvm/llvm-project#91370, targeting znver3 for Zen4 CPU improves a bit.

-C target-cpu=znver5

take bool 512           time:   [325.74 ns 325.76 ns 325.79 ns]
                        change: [−6.9312% −6.6598% −6.4780%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 1024          time:   [492.13 ns 492.42 ns 492.67 ns]
                        change: [−22.795% −22.746% −22.694%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 8192          time:   [3.3653 µs 3.3656 µs 3.3658 µs]
                        change: [−25.208% −25.171% −25.137%] (p = 0.00 < 0.05)
                        Performance has improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants