-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-47376: [C++][Compute] Support selective execution for kernels #47377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
|
|
) ### Rationale for this change In order to support special form (#47374), the kernels have to respect the selection vector. Currently none of the kernels does. And it's almost impossible for us to make all existing kernels to respect the selection vector at once (and we probably never will). Thus we need an incremental way to add selection-vector-aware kernels on demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be executed "selection-vector-aware"-ly in a general manner - the idea is to first "gather" selected rows from the batch into a new batch, evaluate the expression on the new batch, then "scatter" the result rows into the positions where they belong in the original batch. This makes the `take` and `scatter` functions dependencies of the exec facilities, which is in compute core (libarrow). And `take` is already in compute core. Now we need to move `scatter`. I'm implementing the selective execution of kernels in #47377, including invoking `take` and `scatter` as explained above. And I have to write tests of that in `exec_test.cc` which is deliberately declared to be NOT depending on libarrow_compute. ### What changes are included in this PR? Move scatter compute function into compute core. ### Are these changes tested? Yes. Manually tested. ### Are there any user-facing changes? None. * GitHub Issue: #47375 Authored-by: Rossi Sun <zanmato1984@gmail.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
0cce8c3 to
99a3217
Compare
99098d2 to
c005bbb
Compare
c502acc to
469ead1
Compare
469ead1 to
9a1c49e
Compare
|
Attaching benchmark results. The benchmark employs a trivial kernel that does nothing but spins specified number of times to simulate CPU intensity, and the number of rows of the batch is 4k. Baseline: Regular kernel, no selection vector. click to expandSparse: Selective kernel, with a selection vector of selectivity from 0 to 100%. click to expandDense: Regular kernel enclosed by gather/scatter, with a selection vector of selectivity from 0 to 100%. click to expand |
|
Some interesting comparisons to note:
|
|
Hi @pitrou @bkietz @westonpace @felipecrv , I know this is a big one, but I do hope some of you can help to review this PR - this is the most critical prerequisite for the if_else special form. Appreciated! |
966d999 to
9072a34
Compare
9072a34 to
7c26ec8
Compare
|
Kindly ping @pitrou @bkietz @westonpace @felipecrv . |
|
I have a subsequent PR depending on this one (and it's almost ready for quite a while in my local), I would really appreciate if some reviewer could help to proceed on this one. Thanks a lot. @pitrou @bkietz @westonpace @felipecrv |
|
I'll be on vacation next week, so I won't be able to take a look at this before ~10 days. |
|
No problem at all, thanks for the heads-up! Just wanted to make sure this PR stays on the radar. Have a great vacation! |
7c26ec8 to
6e5186b
Compare
|
I've had my next PR for special form almost ready, only some comment and doc left. I sent it in my own repo: zanmato1984#64 just in case you want to see how the selective execution is actually utilized to implement special forms. That PR is derived from this one so it contains the same content (once this one get merged I'll rebase that one). So I really hope this one can be reviewed and merged soon. @pitrou @bkietz @westonpace @felipecrv Thanks. |
|
Kindly ping @pitrou , @felipecrv . Did you have a chance to take a look? Thanks. |
| using ArrayKernelSelectiveExec = Status (*)(KernelContext*, const ExecSpan&, | ||
| const SelectionVectorSpan&, ExecResult*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a suggestion yet, but if we plan to support bitmaps as well, it would probably be better to pass something here that can be either a selection vector or a bitmap mask. The alternative being yet another KernelExec -- ArrayKernelMaskedExec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to think that adding another KernelExec will probably be best.
Selection vectors are better than bitmask for very selective filters. Bitmasks are better when the filter is not very selective. Bitmaps are less important than selection vectors because if the filter is not selected computing on every value is not as bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Shall we do that in follow-up PRs?
| /// handling (intersect validity bitmaps of inputs). | ||
| /// \brief Add a kernel with given input/output types and exec API, no selective exec | ||
| /// API, no required state initialization, preallocation for fixed-width types, and | ||
| /// default null handling (intersect validity bitmaps of inputs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can keep this one as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is being as clear as the rest of the style ("no required state" etc.)
| if (selection_vector_) { | ||
| selection_length_ = selection_vector_->length(); | ||
| } else { | ||
| selection_length_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's less confusing if, without a selection, the "length of the selection" be the length of the whole array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I may see it otherwise.
The naming of the three selection_*_ members implies they are tightly coupled (with selection_vector_ being the "leader"). If selection_vector_ is null, then the value of selection_length_ makes no sense, then 0 is more close to the meaning of "nonsense" (less than -1 though) I guess?
cpp/src/arrow/compute/exec.cc
Outdated
| while (indices_begin + num_indices < indices_end && | ||
| *(indices_begin + num_indices) < chunk_row_id_end) { | ||
| ++num_indices; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slow placeholder, right? You will have to do an Exponential search from n - 1 to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Replaced with a log(N) complexity std::lower_bound().
| /// | ||
| /// We are not yet using this so this is mostly a placeholder for now. | ||
| /// | ||
| /// [1]: http://cidrdb.org/cidr2005/papers/P19.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
| void SetSlice(int64_t offset, int64_t length, int32_t index_back_shift = 0); | ||
|
|
||
| int32_t operator[](int64_t i) const { | ||
| return indices_[i + offset_] - index_back_shift_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you use this class in loops, you will probably get better assembly if it's copied into a local variable (in the "stack") before the loop to get SROA [1] to kick in and then you can keep all these members in registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that exposing the index_back_shift would be too verbose and error-prone. Better use some encapsulation to hide it. Maybe let the span accept a lambda, within which we can write more compiler-friendly code meanwhile keep the index_back_shift hidden?
|
|
||
| inline void Spin(volatile int64_t count) { | ||
| while (count-- > 0) { | ||
| // Do nothing, just burn CPU cycles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compiler probably optimizes this away
ok, now I see the volatile.
| VisitSelectionVectorSpanInline(const SelectionVectorSpan& selection, | ||
| OnSelectionFn&& on_selection) { | ||
| for (int64_t i = 0; i < selection.length(); ++i) { | ||
| RETURN_NOT_OK(on_selection(selection[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, returning a Status is a cheap and simple (to the compiler) operation, but in practice it's not. Consider requiring a function that returns bool. If you always returns true, the inlines will remove the branches for early-return inside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't get it. Could you elaborate a bit?
|
I think this looks good. I hope @pitrou is open and excited to the idea of kernels fused with filtering from selection vectors. |
Haha. I'm open on the principle. I just need to make time to look at the many details, sorry... |
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
This reverts commit db970f1e89722dcf65b09553c51ed1b1a4ede3ce.
| std::vector<Datum> values(batch.num_values()); | ||
| for (int i = 0; i < batch.num_values(); ++i) { | ||
| if (batch[i].is_scalar()) { | ||
| // XXX: Skip gather for scalars since it is not currently supported by Take. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it's not necessary. But the drawback is we lose the ability to uniformly call Take on any Datum - have to make sure it's not scalar and go through a special path, like here, for scalar.
I think maybe we can simply return the scalar as is for Take (to allow the uniform invoking on arbitrary Datum). Or we insist that taking scalar makes no sense and we do special checks everywhere.
| /// handling (intersect validity bitmaps of inputs). | ||
| /// \brief Add a kernel with given input/output types and exec API, no selective exec | ||
| /// API, no required state initialization, preallocation for fixed-width types, and | ||
| /// default null handling (intersect validity bitmaps of inputs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is being as clear as the rest of the style ("no required state" etc.)
| using ArrayKernelSelectiveExec = Status (*)(KernelContext*, const ExecSpan&, | ||
| const SelectionVectorSpan&, ExecResult*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Shall we do that in follow-up PRs?
| return ExecuteBatch(batch, listener); | ||
| } | ||
|
|
||
| Datum WrapResults(const std::vector<Datum>& inputs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's override of a public method of its parent class KernelExecutor::WrapResults().
| } else { | ||
| DCHECK(val.is_array()); | ||
| arrays.emplace_back(val.make_array()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus, as a quite independent free function, I think it's no harm to extend it a little bit to support chunked array?
cpp/src/arrow/compute/exec.cc
Outdated
| while (indices_begin + num_indices < indices_end && | ||
| *(indices_begin + num_indices) < chunk_row_id_end) { | ||
| ++num_indices; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Replaced with a log(N) complexity std::lower_bound().
| return kernel_->selective_exec(kernel_ctx_, input, *selection, out); | ||
| } | ||
| return kernel_->exec(kernel_ctx_, input, out); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pre-condition that non-null selection implies non-null
selective_execis very specific.
Sorry I don't get it. The two callsites both have the possibility that selection is non-null, and we need to make sure that selective_exec is also non-null. In other word, if we inline it, the code would be exactly the same in these two places.
Or are you suggesting something performance-wise?
| void SetSlice(int64_t offset, int64_t length, int32_t index_back_shift = 0); | ||
|
|
||
| int32_t operator[](int64_t i) const { | ||
| return indices_[i + offset_] - index_back_shift_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that exposing the index_back_shift would be too verbose and error-prone. Better use some encapsulation to hide it. Maybe let the span accept a lambda, within which we can write more compiler-friendly code meanwhile keep the index_back_shift hidden?
| VisitSelectionVectorSpanInline(const SelectionVectorSpan& selection, | ||
| OnSelectionFn&& on_selection) { | ||
| for (int64_t i = 0; i < selection.length(); ++i) { | ||
| RETURN_NOT_OK(on_selection(selection[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't get it. Could you elaborate a bit?
Rationale for this change
In order to support special form (#47374), being able to "selective"-ly execute the kernel becomes a prerequisite. As mentioned in #47374, we need an incremental way to add selective kernels on demand, meanwhile accommodate arbitrary legacy kernels to be executed selectively in a general manner.
What changes are included in this PR?
ArrayKernelSelectiveExec(KernelContext*, const ExecSpan&, const SelectionVectorSpan&, ExecResult*)in the kernel. This is the entry for selectively executing the kernel on a batch with a given selection vector. The kernel author can provide a dedicated implementation for such kernel API so the kernel can be executed "sparse"-ly - only the rows indicated by the selection vector will be processed. Otherwise the selective execution will fall back to a general "dense" way - gather the selected rows into a new contiguous (dense) batch, execute the kernel using the non-selective exec API, then scatter the result back to the original row positions.ScalarExecutorwith dense execution ability.Are these changes tested?
Tested and benchmarked.
Are there any user-facing changes?
None.