[DRAFT] Perf: vectorize take_bits (up to 2.5x speedup) by gstvg · Pull Request #8967 · apache/arrow-rs

gstvg · 2025-12-08T07:37:02Z

Which issue does this PR close?

Part of #279, related to #8879

Rationale for this change

EDIT: with -C target-cpu=sapphirerapids:

take bool 512           time:   [360.27 ns 360.48 ns 360.69 ns]
                        change: [−36.426% −36.251% −36.137%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 1024          time:   [468.75 ns 470.56 ns 474.14 ns]
                        change: [−54.442% −54.266% −54.083%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 8192          time:   [3.0099 µs 3.0481 µs 3.0882 µs]
                        change: [−59.645% −59.419% −59.186%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

Vectorized take_bits for indices without nulls (https://rust.godbolt.org/z/YnqnqccMK)
Bitmask packing with intrinsics for sse2/avx2 (generic enough to be used in other parts of the code)

Are these changes tested?

Currently no, tests will be added if this PR is decided to move forward

gstvg · 2025-12-08T07:49:21Z

Hi @Dandandan, do you think this speedup justifies the usage of intrinsics ? Taking into account that it affects not only boolean arrays but every array with nulls, and that they can also be used in other parts of the code, like cmp kernels. Thanks!

gstvg · 2025-12-08T08:16:32Z

The justification to use intrinsics is that currently llvm generates suboptimal vectorization for bitmask packing, see llvm/llvm-project#96395 and llvm/llvm-project#121691

alamb · 2025-12-10T23:17:39Z

run benchmark take_kernels

alamb-ghbot · 2025-12-10T23:17:46Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing perf_vectorize_take_bits (37c1e6b) to a67cd19 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=perf_vectorize_take_bits
Results will be posted here when complete

alamb-ghbot · 2025-12-10T23:28:24Z

🤖: Benchmark completed

Details

group                                                                     main                                   perf_vectorize_take_bits
-----                                                                     ----                                   ------------------------
take bool 1024                                                            1.06   1331.0±5.50ns        ? ?/sec    1.00  1257.5±33.70ns        ? ?/sec
take bool 512                                                             1.05   731.5±13.23ns        ? ?/sec    1.00    693.9±8.14ns        ? ?/sec
take bool null indices 1024                                               1.00  1628.8±22.38ns        ? ?/sec    1.00  1633.4±36.37ns        ? ?/sec
take bool null values 1024                                                1.18      2.6±0.02µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.08      3.3±0.06µs        ? ?/sec    1.00      3.1±0.01µs        ? ?/sec
take check bounds i32 1024                                                1.00   1621.6±4.54ns        ? ?/sec    1.00   1625.4±8.99ns        ? ?/sec
take check bounds i32 512                                                 1.00    830.6±2.59ns        ? ?/sec    1.04    867.2±7.44ns        ? ?/sec
take i32 1024                                                             1.02    728.7±2.67ns        ? ?/sec    1.00    711.5±1.52ns        ? ?/sec
take i32 512                                                              1.00    394.0±3.23ns        ? ?/sec    1.12    441.8±5.24ns        ? ?/sec
take i32 null indices 1024                                                1.00   998.8±20.07ns        ? ?/sec    1.00    995.2±9.91ns        ? ?/sec
take i32 null values 1024                                                 1.10      2.1±0.18µs        ? ?/sec    1.00  1889.7±35.88ns        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.6±0.08µs        ? ?/sec    1.02      2.7±0.08µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00      8.4±0.11µs        ? ?/sec    1.00      8.4±0.10µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      9.1±0.29µs        ? ?/sec    1.14     10.4±0.12µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     20.8±0.23µs        ? ?/sec    1.00     20.8±0.12µs        ? ?/sec
take str 1024                                                             1.00     11.0±0.27µs        ? ?/sec    1.00     11.0±0.27µs        ? ?/sec
take str 512                                                              1.01      5.3±0.06µs        ? ?/sec    1.00      5.3±0.03µs        ? ?/sec
take str null indices 1024                                                1.02      6.9±0.04µs        ? ?/sec    1.00      6.8±0.04µs        ? ?/sec
take str null indices 512                                                 1.00      3.3±0.03µs        ? ?/sec    1.02      3.4±0.02µs        ? ?/sec
take str null values 1024                                                 1.04      8.7±0.14µs        ? ?/sec    1.00      8.4±0.09µs        ? ?/sec
take str null values null indices 1024                                    1.00      6.4±0.04µs        ? ?/sec    1.02      6.6±0.04µs        ? ?/sec
take stringview 1024                                                      1.00    776.3±5.20ns        ? ?/sec    1.01   784.6±15.08ns        ? ?/sec
take stringview 512                                                       1.00    478.4±2.06ns        ? ?/sec    1.16   554.2±16.13ns        ? ?/sec
take stringview null indices 1024                                         1.00  1399.2±29.55ns        ? ?/sec    1.00  1396.4±23.55ns        ? ?/sec
take stringview null indices 512                                          1.00    684.5±4.05ns        ? ?/sec    1.03   705.3±24.91ns        ? ?/sec
take stringview null values 1024                                          1.15      2.1±0.01µs        ? ?/sec    1.00  1789.2±22.60ns        ? ?/sec
take stringview null values null indices 1024                             1.01      2.9±0.03µs        ? ?/sec    1.00      2.9±0.02µs        ? ?/sec

gstvg · 2025-12-11T00:04:08Z

BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench take_kernels

I forgot to mention that I ran the benchmarks with -C target-cpu=sapphirerapids 🤦 . I believe that with broadwell and up the performance starts to improve.
But if our main target is the generic cpu, than the added complexity is definitely not worth it, and we can safely close this, no problem

gstvg · 2025-12-11T02:00:38Z

Just tested on AMD, also lower improvements:

Details

-C target-cpu=znver3

take bool 512           time:   [483.26 ns 483.65 ns 484.09 ns]
                        change: [−12.699% −12.572% −12.446%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 1024          time:   [872.58 ns 873.03 ns 873.43 ns]
                        change: [−13.936% −13.860% −13.775%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 8192          time:   [5.6960 µs 5.7022 µs 5.7096 µs]
                        change: [−23.644% −23.566% −23.486%] (p = 0.00 < 0.05)
                        Performance has improved.

-C target-cpu=znver4

take bool 512           time:   [447.41 ns 447.57 ns 447.74 ns]
                        change: [−9.4770% −9.2941% −9.0973%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 1024          time:   [787.11 ns 787.30 ns 787.49 ns]
                        change: [−12.403% −12.247% −12.099%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 8192          time:   [5.6537 µs 5.6553 µs 5.6569 µs]
                        change: [−14.204% −14.055% −13.929%] (p = 0.00 < 0.05)
                        Performance has improved.

Lower gains in zen 4 than zen 3 is due to llvm/llvm-project#91370, targeting znver3 for Zen4 CPU improves a bit.

-C target-cpu=znver5

take bool 512           time:   [325.74 ns 325.76 ns 325.79 ns]
                        change: [−6.9312% −6.6598% −6.4780%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 1024          time:   [492.13 ns 492.42 ns 492.67 ns]
                        change: [−22.795% −22.746% −22.694%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 8192          time:   [3.3653 µs 3.3656 µs 3.3658 µs]
                        change: [−25.208% −25.171% −25.137%] (p = 0.00 < 0.05)
                        Performance has improved.

github-actions bot added the arrow Changes to the arrow crate label Dec 8, 2025

perf: vectorize take_bits

37c1e6b

gstvg force-pushed the perf_vectorize_take_bits branch from 0475fa3 to 37c1e6b Compare December 8, 2025 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Perf: vectorize take_bits (up to 2.5x speedup)#8967

[DRAFT] Perf: vectorize take_bits (up to 2.5x speedup)#8967
gstvg wants to merge 1 commit intoapache:mainfrom
gstvg:perf_vectorize_take_bits

gstvg commented Dec 8, 2025 •

edited

Loading

Uh oh!

gstvg commented Dec 8, 2025

Uh oh!

gstvg commented Dec 8, 2025

Uh oh!

alamb commented Dec 10, 2025

Uh oh!

alamb-ghbot commented Dec 10, 2025

Uh oh!

alamb-ghbot commented Dec 10, 2025

Uh oh!

gstvg commented Dec 11, 2025

Uh oh!

gstvg commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gstvg commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Uh oh!

gstvg commented Dec 8, 2025

Uh oh!

gstvg commented Dec 8, 2025

Uh oh!

alamb commented Dec 10, 2025

Uh oh!

alamb-ghbot commented Dec 10, 2025

Uh oh!

alamb-ghbot commented Dec 10, 2025

Uh oh!

gstvg commented Dec 11, 2025

Uh oh!

gstvg commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gstvg commented Dec 8, 2025 •

edited

Loading