Skip to content

Conversation

@zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Aug 20, 2025

Rationale for this change

In order to support special form (#47374), the kernels have to respect the selection vector. Currently none of the kernels does. And it's almost impossible for us to make all existing kernels to respect the selection vector at once (and we probably never will). Thus we need an incremental way to add selection-vector-aware kernels on demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be executed "selection-vector-aware"-ly in a general manner - the idea is to first "gather" selected rows from the batch into a new batch, evaluate the expression on the new batch, then "scatter" the result rows into the positions where they belong in the original batch.

This makes the take and scatter functions dependencies of the exec facilities, which is in compute core (libarrow). And take is already in compute core. Now we need to move scatter.

I'm implementing the selective execution of kernels in #47377, including invoking take and scatter as explained above. And I have to write tests of that in exec_test.cc which is deliberately declared to be NOT depending on libarrow_compute.

What changes are included in this PR?

Move scatter compute function into compute core.

Are these changes tested?

Yes. Manually tested.

Are there any user-facing changes?

None.

@zanmato1984
Copy link
Contributor Author

Hi @raulcd @pitrou , mind to take a look? Thanks.

@pitrou
Copy link
Member

pitrou commented Aug 20, 2025

  1. Can you add the rationale directly to the PR description so that we don't have to click through the issue link?
  2. Which non-compute functionality will required the scatter function?

@zanmato1984
Copy link
Contributor Author

zanmato1984 commented Aug 20, 2025

  1. Can you add the rationale directly to the PR description so that we don't have to click through the issue link?

Yes, done.

  1. Which non-compute functionality will required the scatter function?

It will be required by another piece of compute functionality in libarrow (NOT libarrow_compute) - exec.cc. You can see it in my draft PR https://github.com/apache/arrow/pull/47377/files#diff-5c72e5d873cc3d8515401c31ddc1953949989e20057d557f5fd1581c65be3d8bR905-R906 . Though I can actually do this change in #47377 , I want to split things as much as possible - #47377 is a tough one (but I believe it's the right way).

@pitrou
Copy link
Member

pitrou commented Aug 20, 2025

What is the size increase of libarrow.so in a typical release build?

@zanmato1984
Copy link
Contributor Author

On my M1 Mac, the libarrow.so size increased from 10.36M (10,858,816) to 10.80M (11,323,664), diff 453.95K (464,848).

@pitrou
Copy link
Member

pitrou commented Aug 21, 2025

I also tried locally on Ubuntu 22.04, on a RelWithDebInfo build:

  • Before:
$ ls -la libarrow*.so.2200.0.0
-rwxrwxr-x 1 antoine antoine 11031504 août  21 11:14 libarrow_acero.so.2200.0.0
-rwxrwxr-x 1 antoine antoine 96039376 août  21 11:14 libarrow_compute.so.2200.0.0
-rwxrwxr-x 1 antoine antoine  1015536 août  21 11:14 libarrow_cuda.so.2200.0.0
-rwxrwxr-x 1 antoine antoine 12404648 août  21 11:14 libarrow_dataset.so.2200.0.0
-rwxrwxr-x 1 antoine antoine 89553280 août  21 11:13 libarrow.so.2200.0.0
-rwxrwxr-x 1 antoine antoine  7548160 août  21 11:14 libarrow_testing.so.2200.0.0
$ size --format=GNU libarrow*.so.2200.0.0
      text       data        bss      total filename
   1231480     436253       1088    1668821 libarrow_acero.so.2200.0.0
  12275032    2116773      44632   14436437 libarrow_compute.so.2200.0.0
    123700      85770       3016     212486 libarrow_cuda.so.2200.0.0
   1028813     613674       3832    1646319 libarrow_dataset.so.2200.0.0
  10766408    3843396    2180945   16790749 libarrow.so.2200.0.0
   1008311     497462       2600    1508373 libarrow_testing.so.2200.0.0
  • After:
$ ls -la libarrow*.so.2200.0.0
-rwxrwxr-x 1 antoine antoine 11031504 août  21 11:15 libarrow_acero.so.2200.0.0
-rwxrwxr-x 1 antoine antoine 92921808 août  21 11:15 libarrow_compute.so.2200.0.0
-rwxrwxr-x 1 antoine antoine  1015536 août  21 11:15 libarrow_cuda.so.2200.0.0
-rwxrwxr-x 1 antoine antoine 12404648 août  21 11:15 libarrow_dataset.so.2200.0.0
-rwxrwxr-x 1 antoine antoine 92660976 août  21 11:15 libarrow.so.2200.0.0
-rwxrwxr-x 1 antoine antoine  7548160 août  21 11:15 libarrow_testing.so.2200.0.0
$ size --format=GNU libarrow*.so.2200.0.0
      text       data        bss      total filename
   1231480     436253       1088    1668821 libarrow_acero.so.2200.0.0
  11678104    2069268      45544   13792916 libarrow_compute.so.2200.0.0
    123700      85770       3016     212486 libarrow_cuda.so.2200.0.0
   1028813     613674       3832    1646319 libarrow_dataset.so.2200.0.0
  11360776    3889882    2183513   17434171 libarrow.so.2200.0.0
   1008311     497462       2600    1508373 libarrow_testing.so.2200.0.0

So, libarrow.so grows in size by ~4%. @raulcd

@raulcd
Copy link
Member

raulcd commented Aug 22, 2025

A 4% increase in libarrow.so size (from ~85MB to ~93MB) isn't great for memory footprint / minimal deployments, also this sets a precedent and future implications for moving functionality back from Arrow compute which isn't great. There doesn't seem to be much of an alternative though. Making this conditional would add complexity without much benefit. Breaking up vector_swizzle.cc also doesn't seem to be possible as both scatter and inverse_permutation are required.

I can't think of any alternatives and the necessity for selection-vector-aware seems justified.

We probably should have some guidelines of when/how those changes are acceptable / justified for future reference.

@zanmato1984
Copy link
Contributor Author

Hi @pitrou @raulcd , do you think we can move on with this? Thanks.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Aug 26, 2025
@zanmato1984
Copy link
Contributor Author

I'll merge. Thanks @pitrou @raulcd for reviewing!

@zanmato1984 zanmato1984 merged commit 6f6138b into apache:main Aug 27, 2025
39 of 40 checks passed
@zanmato1984 zanmato1984 removed the awaiting merge Awaiting merge label Aug 27, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6f6138b.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

kou pushed a commit that referenced this pull request Sep 3, 2025
#47448)

### Rationale for this change

The Meson configuration for compute was broken after the vector swizzle change was introduced; this gets the Meson configuration at parity with CMake again

### What changes are included in this PR?

Move `vector_swizzle.cc` to `libarrow` from `libarrow_compute`.

See also: #47378

### Are these changes tested?

Yes 
### Are there any user-facing changes?

No
* GitHub Issue: #47446

Authored-by: Will Ayd <william.ayd@icloud.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants