Skip to content

Conversation

@zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Aug 20, 2025

Rationale for this change

In order to support special form (#47374), being able to "selective"-ly execute the kernel becomes a prerequisite. As mentioned in #47374, we need an incremental way to add selective kernels on demand, meanwhile accommodate arbitrary legacy kernels to be executed selectively in a general manner.

What changes are included in this PR?

  1. Enrich the selection vector/span functionalities.
  2. Introduce an optional API ArrayKernelSelectiveExec(KernelContext*, const ExecSpan&, const SelectionVectorSpan&, ExecResult*) in the kernel. This is the entry for selectively executing the kernel on a batch with a given selection vector. The kernel author can provide a dedicated implementation for such kernel API so the kernel can be executed "sparse"-ly - only the rows indicated by the selection vector will be processed. Otherwise the selective execution will fall back to a general "dense" way - gather the selected rows into a new contiguous (dense) batch, execute the kernel using the non-selective exec API, then scatter the result back to the original row positions.
  3. Extend ScalarExecutor with dense execution ability.

Are these changes tested?

Tested and benchmarked.

Are there any user-facing changes?

None.

@zanmato1984 zanmato1984 marked this pull request as draft August 20, 2025 07:17
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@zanmato1984 zanmato1984 changed the title Special form #2: Selective kernel execution GH-47376: [C++][Compute] Support selective execution for kernels Aug 20, 2025
@github-actions
Copy link

⚠️ GitHub issue #47376 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #47376 has no components, please add labels for components.

zanmato1984 added a commit that referenced this pull request Aug 27, 2025
)

### Rationale for this change

In order to support special form (#47374), the kernels have to respect the selection vector. Currently none of the kernels does. And it's almost impossible for us to make all existing kernels to respect the selection vector at once (and we probably never will). Thus we need an incremental way to add selection-vector-aware kernels on demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be executed "selection-vector-aware"-ly in a general manner - the idea is to first "gather" selected rows from the batch into a new batch, evaluate the expression on the new batch, then "scatter" the result rows into the positions where they belong in the original batch.

This makes the `take` and `scatter` functions dependencies of the exec facilities, which is in compute core (libarrow). And `take` is already in compute core. Now we need to move `scatter`.

I'm implementing the selective execution of kernels in #47377, including invoking `take` and `scatter` as explained above. And I have to write tests of that in `exec_test.cc` which is deliberately declared to be NOT depending on libarrow_compute.

### What changes are included in this PR?

Move scatter compute function into compute core.

### Are these changes tested?

Yes. Manually tested.

### Are there any user-facing changes?

None.
* GitHub Issue: #47375

Authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
@zanmato1984 zanmato1984 force-pushed the fix/gh-47376 branch 2 times, most recently from 0cce8c3 to 99a3217 Compare August 28, 2025 04:11
@zanmato1984 zanmato1984 force-pushed the fix/gh-47376 branch 2 times, most recently from 99098d2 to c005bbb Compare September 10, 2025 07:00
@zanmato1984 zanmato1984 force-pushed the fix/gh-47376 branch 7 times, most recently from c502acc to 469ead1 Compare September 12, 2025 01:20
@zanmato1984 zanmato1984 marked this pull request as ready for review September 13, 2025 01:09
@zanmato1984
Copy link
Contributor Author

Attaching benchmark results. The benchmark employs a trivial kernel that does nothing but spins specified number of times to simulate CPU intensity, and the number of rows of the batch is 4k.

Baseline: Regular kernel, no selection vector.

click to expand
BM_ExecBaseline/kernel_intensity:0/num_rows:4096                                 2292 ns         2292 ns       307973 items_per_second=1.78713G/s
BM_ExecBaseline/kernel_intensity:20/num_rows:4096                               22858 ns        22856 ns        30544 items_per_second=179.211M/s
BM_ExecBaseline/kernel_intensity:40/num_rows:4096                               43164 ns        43161 ns        16245 items_per_second=94.9008M/s
BM_ExecBaseline/kernel_intensity:60/num_rows:4096                               63424 ns        63419 ns        11244 items_per_second=64.5862M/s
BM_ExecBaseline/kernel_intensity:80/num_rows:4096                               94636 ns        94625 ns         7412 items_per_second=43.2867M/s
BM_ExecBaseline/kernel_intensity:100/num_rows:4096                             113438 ns       113416 ns         6185 items_per_second=36.1148M/s

Sparse: Selective kernel, with a selection vector of selectivity from 0 to 100%.

click to expand
BM_ExecSelective/sparse/selectivity:0/kernel_intensity:0/num_rows:4096            358 ns          358 ns      2036026 items_per_second=11.4467G/s
BM_ExecSelective/sparse/selectivity:20/kernel_intensity:0/num_rows:4096           892 ns          892 ns       782333 items_per_second=4.59122G/s
BM_ExecSelective/sparse/selectivity:50/kernel_intensity:0/num_rows:4096          1711 ns         1711 ns       405743 items_per_second=2.39461G/s
BM_ExecSelective/sparse/selectivity:100/kernel_intensity:0/num_rows:4096         3144 ns         3137 ns       225033 items_per_second=1.30557G/s
BM_ExecSelective/sparse/selectivity:0/kernel_intensity:20/num_rows:4096           331 ns          331 ns      2011963 items_per_second=12.3736G/s
BM_ExecSelective/sparse/selectivity:20/kernel_intensity:20/num_rows:4096         5008 ns         5007 ns       139395 items_per_second=818.003M/s
BM_ExecSelective/sparse/selectivity:50/kernel_intensity:20/num_rows:4096        11977 ns        11976 ns        59925 items_per_second=342.019M/s
BM_ExecSelective/sparse/selectivity:100/kernel_intensity:20/num_rows:4096       23337 ns        23336 ns        30261 items_per_second=175.525M/s
BM_ExecSelective/sparse/selectivity:0/kernel_intensity:40/num_rows:4096           352 ns          352 ns      2041411 items_per_second=11.6226G/s
BM_ExecSelective/sparse/selectivity:20/kernel_intensity:40/num_rows:4096         8984 ns         8983 ns        76190 items_per_second=455.96M/s
BM_ExecSelective/sparse/selectivity:50/kernel_intensity:40/num_rows:4096        22603 ns        22511 ns        31604 items_per_second=181.959M/s
BM_ExecSelective/sparse/selectivity:100/kernel_intensity:40/num_rows:4096       45411 ns        45408 ns        15878 items_per_second=90.2046M/s
BM_ExecSelective/sparse/selectivity:0/kernel_intensity:60/num_rows:4096           356 ns          356 ns      1930640 items_per_second=11.4986G/s
BM_ExecSelective/sparse/selectivity:20/kernel_intensity:60/num_rows:4096        16540 ns        16539 ns        42808 items_per_second=247.659M/s
BM_ExecSelective/sparse/selectivity:50/kernel_intensity:60/num_rows:4096        45119 ns        45117 ns        14423 items_per_second=90.7867M/s
BM_ExecSelective/sparse/selectivity:100/kernel_intensity:60/num_rows:4096      116450 ns       116443 ns         5983 items_per_second=35.176M/s
BM_ExecSelective/sparse/selectivity:0/kernel_intensity:80/num_rows:4096           341 ns          341 ns      2030557 items_per_second=12.0073G/s
BM_ExecSelective/sparse/selectivity:20/kernel_intensity:80/num_rows:4096        19608 ns        19607 ns        35973 items_per_second=208.904M/s
BM_ExecSelective/sparse/selectivity:50/kernel_intensity:80/num_rows:4096        48072 ns        48069 ns        14415 items_per_second=85.2115M/s
BM_ExecSelective/sparse/selectivity:100/kernel_intensity:80/num_rows:4096       95814 ns        95808 ns         7362 items_per_second=42.7521M/s
BM_ExecSelective/sparse/selectivity:0/kernel_intensity:100/num_rows:4096          354 ns          354 ns      1978016 items_per_second=11.578G/s
BM_ExecSelective/sparse/selectivity:20/kernel_intensity:100/num_rows:4096       23879 ns        23877 ns        29476 items_per_second=171.545M/s
BM_ExecSelective/sparse/selectivity:50/kernel_intensity:100/num_rows:4096       58874 ns        58870 ns        11791 items_per_second=69.577M/s
BM_ExecSelective/sparse/selectivity:100/kernel_intensity:100/num_rows:4096     117665 ns       117519 ns         6018 items_per_second=34.854M/s

Dense: Regular kernel enclosed by gather/scatter, with a selection vector of selectivity from 0 to 100%.

click to expand
BM_ExecSelective/dense/selectivity:0/kernel_intensity:0/num_rows:4096            2986 ns         2985 ns       232954 items_per_second=1.372G/s
BM_ExecSelective/dense/selectivity:20/kernel_intensity:0/num_rows:4096           5137 ns         5136 ns       139173 items_per_second=797.448M/s
BM_ExecSelective/dense/selectivity:50/kernel_intensity:0/num_rows:4096           8414 ns         8413 ns        80579 items_per_second=486.869M/s
BM_ExecSelective/dense/selectivity:100/kernel_intensity:0/num_rows:4096          8773 ns         8756 ns        80353 items_per_second=467.815M/s
BM_ExecSelective/dense/selectivity:0/kernel_intensity:20/num_rows:4096           2973 ns         2973 ns       234434 items_per_second=1.37767G/s
BM_ExecSelective/dense/selectivity:20/kernel_intensity:20/num_rows:4096          9483 ns         9482 ns        74308 items_per_second=431.964M/s
BM_ExecSelective/dense/selectivity:50/kernel_intensity:20/num_rows:4096         19647 ns        19646 ns        36317 items_per_second=208.492M/s
BM_ExecSelective/dense/selectivity:100/kernel_intensity:20/num_rows:4096        31157 ns        31155 ns        22985 items_per_second=131.473M/s
BM_ExecSelective/dense/selectivity:0/kernel_intensity:40/num_rows:4096           3012 ns         3012 ns       227259 items_per_second=1.36004G/s
BM_ExecSelective/dense/selectivity:20/kernel_intensity:40/num_rows:4096         16151 ns        16147 ns        42675 items_per_second=253.665M/s
BM_ExecSelective/dense/selectivity:50/kernel_intensity:40/num_rows:4096         35706 ns        35704 ns        20202 items_per_second=114.722M/s
BM_ExecSelective/dense/selectivity:100/kernel_intensity:40/num_rows:4096        59021 ns        59017 ns        10707 items_per_second=69.4042M/s
BM_ExecSelective/dense/selectivity:0/kernel_intensity:60/num_rows:4096           3020 ns         3020 ns       234537 items_per_second=1.35637G/s
BM_ExecSelective/dense/selectivity:20/kernel_intensity:60/num_rows:4096         18433 ns        18432 ns        27608 items_per_second=222.225M/s
BM_ExecSelective/dense/selectivity:50/kernel_intensity:60/num_rows:4096         55907 ns        55903 ns        10499 items_per_second=73.2698M/s
BM_ExecSelective/dense/selectivity:100/kernel_intensity:60/num_rows:4096       108566 ns       108559 ns         9644 items_per_second=37.7305M/s
BM_ExecSelective/dense/selectivity:0/kernel_intensity:80/num_rows:4096           3000 ns         3000 ns       239374 items_per_second=1.36546G/s
BM_ExecSelective/dense/selectivity:20/kernel_intensity:80/num_rows:4096         24141 ns        24140 ns        29050 items_per_second=169.68M/s
BM_ExecSelective/dense/selectivity:50/kernel_intensity:80/num_rows:4096         55833 ns        55830 ns        12568 items_per_second=73.3658M/s
BM_ExecSelective/dense/selectivity:100/kernel_intensity:80/num_rows:4096       102989 ns       102963 ns         6737 items_per_second=39.7812M/s
BM_ExecSelective/dense/selectivity:0/kernel_intensity:100/num_rows:4096          3010 ns         3010 ns       231996 items_per_second=1.36102G/s
BM_ExecSelective/dense/selectivity:20/kernel_intensity:100/num_rows:4096        28224 ns        28222 ns        24899 items_per_second=145.134M/s
BM_ExecSelective/dense/selectivity:50/kernel_intensity:100/num_rows:4096        65519 ns        65515 ns        10595 items_per_second=62.5201M/s
BM_ExecSelective/dense/selectivity:100/kernel_intensity:100/num_rows:4096      122453 ns       122444 ns         5579 items_per_second=33.452M/s

@zanmato1984
Copy link
Contributor Author

Some interesting comparisons to note:

  1. When selectivity is high, sparse execution is slightly slower than the baseline due to the indirection introduced by accessing the selection vector:
    BM_ExecBaseline/kernel_intensity:0/num_rows:4096                                 2292 ns         2292 ns       307973 items_per_second=1.78713G/s
    BM_ExecSelective/sparse/selectivity:100/kernel_intensity:0/num_rows:4096         3144 ns         3137 ns       225033 items_per_second=1.30557G/s
    
  2. When selectivity is low, sparse execution is much faster than the baseline because the (unnecessary) computation of most rows are skipped:
    BM_ExecBaseline/kernel_intensity:0/num_rows:4096                                 2292 ns         2292 ns       307973 items_per_second=1.78713G/s
    BM_ExecSelective/sparse/selectivity:0/kernel_intensity:0/num_rows:4096            358 ns          358 ns      2036026 items_per_second=11.4467G/s
    
    Therefore we can reasonably expect better performance for cases like if (unlikely_condition) then heavy_kernel() else ligth_kernel() end. (Assume there is if_else special form in place, that executes each branch using a selection vector, and the kernels in both branches have selective exec.)
  3. As the kernel's CPU intensity increases, the performance benefit becomes more significant:
    BM_ExecBaseline/kernel_intensity:100/num_rows:4096                             113438 ns       113416 ns         6185 items_per_second=36.1148M/s
    BM_ExecSelective/sparse/selectivity:0/kernel_intensity:100/num_rows:4096          354 ns          354 ns      1978016 items_per_second=11.578G/s
    
  4. Falling back to dense execution (when the kernel doesn't supply a selective exec) is slow: up to 4x slower than the baseline and 3x slower than sparse execution:
    BM_ExecBaseline/kernel_intensity:0/num_rows:4096                                 2292 ns         2292 ns       307973 items_per_second=1.78713G/s
    BM_ExecSelective/sparse/selectivity:100/kernel_intensity:0/num_rows:4096         3144 ns         3137 ns       225033 items_per_second=1.30557G/s
    BM_ExecSelective/dense/selectivity:100/kernel_intensity:0/num_rows:4096          8773 ns         8756 ns        80353 items_per_second=467.815M/s
    
    This is not very surprising because dense execution does another two function invocations under the hood (gather/scatter).
  5. Low selectivity is also beneficial to dense execution due to the reduction of the unnecessary computation - sometimes even beats the overhead of the aforementioned extra function invocations:
    BM_ExecBaseline/kernel_intensity:20/num_rows:4096                               22858 ns        22856 ns        30544 items_per_second=179.211M/s
    BM_ExecSelective/dense/selectivity:0/kernel_intensity:20/num_rows:4096           2973 ns         2973 ns       234434 items_per_second=1.37767G/s
    BM_ExecSelective/dense/selectivity:20/kernel_intensity:20/num_rows:4096          9483 ns         9482 ns        74308 items_per_second=431.964M/s
    BM_ExecSelective/dense/selectivity:50/kernel_intensity:20/num_rows:4096         19647 ns        19646 ns        36317 items_per_second=208.492M/s
    
  6. High CPU intensity of the kernel also amortizes the aforementioned dense execution overhead.
    BM_ExecBaseline/kernel_intensity:0/num_rows:4096                                 2292 ns         2292 ns       307973 items_per_second=1.78713G/s
    BM_ExecSelective/dense/selectivity:100/kernel_intensity:0/num_rows:4096          8773 ns         8756 ns        80353 items_per_second=467.815M/s
    
    BM_ExecBaseline/kernel_intensity:100/num_rows:4096                             113438 ns       113416 ns         6185 items_per_second=36.1148M/s
    BM_ExecSelective/dense/selectivity:100/kernel_intensity:100/num_rows:4096      122453 ns       122444 ns         5579 items_per_second=33.452M/s
    
    4x amortized to 1.08x.

@zanmato1984
Copy link
Contributor Author

Hi @pitrou @bkietz @westonpace @felipecrv , I know this is a big one, but I do hope some of you can help to review this PR - this is the most critical prerequisite for the if_else special form.

Appreciated!

@zanmato1984 zanmato1984 force-pushed the fix/gh-47376 branch 4 times, most recently from 966d999 to 9072a34 Compare October 7, 2025 23:44
@zanmato1984
Copy link
Contributor Author

Kindly ping @pitrou @bkietz @westonpace @felipecrv .

@zanmato1984
Copy link
Contributor Author

I have a subsequent PR depending on this one (and it's almost ready for quite a while in my local), I would really appreciate if some reviewer could help to proceed on this one. Thanks a lot. @pitrou @bkietz @westonpace @felipecrv

@pitrou
Copy link
Member

pitrou commented Oct 30, 2025

I'll be on vacation next week, so I won't be able to take a look at this before ~10 days.

@zanmato1984
Copy link
Contributor Author

No problem at all, thanks for the heads-up! Just wanted to make sure this PR stays on the radar. Have a great vacation!

@zanmato1984
Copy link
Contributor Author

I've had my next PR for special form almost ready, only some comment and doc left. I sent it in my own repo: zanmato1984#64 just in case you want to see how the selective execution is actually utilized to implement special forms.

That PR is derived from this one so it contains the same content (once this one get merged I'll rebase that one). So I really hope this one can be reviewed and merged soon. @pitrou @bkietz @westonpace @felipecrv Thanks.

@zanmato1984
Copy link
Contributor Author

Kindly ping @pitrou , @felipecrv . Did you have a chance to take a look? Thanks.

Comment on lines +562 to +563
using ArrayKernelSelectiveExec = Status (*)(KernelContext*, const ExecSpan&,
const SelectionVectorSpan&, ExecResult*);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a suggestion yet, but if we plan to support bitmaps as well, it would probably be better to pass something here that can be either a selection vector or a bitmap mask. The alternative being yet another KernelExec -- ArrayKernelMaskedExec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to think that adding another KernelExec will probably be best.

Selection vectors are better than bitmask for very selective filters. Bitmasks are better when the filter is not very selective. Bitmaps are less important than selection vectors because if the filter is not selected computing on every value is not as bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Shall we do that in follow-up PRs?

/// handling (intersect validity bitmaps of inputs).
/// \brief Add a kernel with given input/output types and exec API, no selective exec
/// API, no required state initialization, preallocation for fixed-width types, and
/// default null handling (intersect validity bitmaps of inputs).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can keep this one as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is being as clear as the rest of the style ("no required state" etc.)

if (selection_vector_) {
selection_length_ = selection_vector_->length();
} else {
selection_length_ = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's less confusing if, without a selection, the "length of the selection" be the length of the whole array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I may see it otherwise.

The naming of the three selection_*_ members implies they are tightly coupled (with selection_vector_ being the "leader"). If selection_vector_ is null, then the value of selection_length_ makes no sense, then 0 is more close to the meaning of "nonsense" (less than -1 though) I guess?

Comment on lines 491 to 494
while (indices_begin + num_indices < indices_end &&
*(indices_begin + num_indices) < chunk_row_id_end) {
++num_indices;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slow placeholder, right? You will have to do an Exponential search from n - 1 to 0.

https://en.wikipedia.org/wiki/Exponential_search

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Replaced with a log(N) complexity std::lower_bound().

Comment on lines -133 to 134
///
/// We are not yet using this so this is mostly a placeholder for now.
///
/// [1]: http://cidrdb.org/cidr2005/papers/P19.pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

void SetSlice(int64_t offset, int64_t length, int32_t index_back_shift = 0);

int32_t operator[](int64_t i) const {
return indices_[i + offset_] - index_back_shift_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you use this class in loops, you will probably get better assembly if it's copied into a local variable (in the "stack") before the loop to get SROA [1] to kick in and then you can keep all these members in registers.

[1] https://blog.regehr.org/archives/1603

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that exposing the index_back_shift would be too verbose and error-prone. Better use some encapsulation to hide it. Maybe let the span accept a lambda, within which we can write more compiler-friendly code meanwhile keep the index_back_shift hidden?


inline void Spin(volatile int64_t count) {
while (count-- > 0) {
// Do nothing, just burn CPU cycles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compiler probably optimizes this away

ok, now I see the volatile.

VisitSelectionVectorSpanInline(const SelectionVectorSpan& selection,
OnSelectionFn&& on_selection) {
for (int64_t i = 0; i < selection.length(); ++i) {
RETURN_NOT_OK(on_selection(selection[i]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, returning a Status is a cheap and simple (to the compiler) operation, but in practice it's not. Consider requiring a function that returns bool. If you always returns true, the inlines will remove the branches for early-return inside the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get it. Could you elaborate a bit?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 17, 2025
@felipecrv
Copy link
Contributor

I think this looks good. I hope @pitrou is open and excited to the idea of kernels fused with filtering from selection vectors.

@pitrou
Copy link
Member

pitrou commented Dec 17, 2025

I think this looks good. I hope @pitrou is open and excited to the idea of kernels fused with filtering from selection vectors.

Haha. I'm open on the principle. I just need to make time to look at the many details, sorry...

Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 17, 2025
std::vector<Datum> values(batch.num_values());
for (int i = 0; i < batch.num_values(); ++i) {
if (batch[i].is_scalar()) {
// XXX: Skip gather for scalars since it is not currently supported by Take.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's not necessary. But the drawback is we lose the ability to uniformly call Take on any Datum - have to make sure it's not scalar and go through a special path, like here, for scalar.

I think maybe we can simply return the scalar as is for Take (to allow the uniform invoking on arbitrary Datum). Or we insist that taking scalar makes no sense and we do special checks everywhere.

/// handling (intersect validity bitmaps of inputs).
/// \brief Add a kernel with given input/output types and exec API, no selective exec
/// API, no required state initialization, preallocation for fixed-width types, and
/// default null handling (intersect validity bitmaps of inputs).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is being as clear as the rest of the style ("no required state" etc.)

Comment on lines +562 to +563
using ArrayKernelSelectiveExec = Status (*)(KernelContext*, const ExecSpan&,
const SelectionVectorSpan&, ExecResult*);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Shall we do that in follow-up PRs?

return ExecuteBatch(batch, listener);
}

Datum WrapResults(const std::vector<Datum>& inputs,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's override of a public method of its parent class KernelExecutor::WrapResults().

} else {
DCHECK(val.is_array());
arrays.emplace_back(val.make_array());
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, as a quite independent free function, I think it's no harm to extend it a little bit to support chunked array?

Comment on lines 491 to 494
while (indices_begin + num_indices < indices_end &&
*(indices_begin + num_indices) < chunk_row_id_end) {
++num_indices;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Replaced with a log(N) complexity std::lower_bound().

return kernel_->selective_exec(kernel_ctx_, input, *selection, out);
}
return kernel_->exec(kernel_ctx_, input, out);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pre-condition that non-null selection implies non-null selective_exec is very specific.

Sorry I don't get it. The two callsites both have the possibility that selection is non-null, and we need to make sure that selective_exec is also non-null. In other word, if we inline it, the code would be exactly the same in these two places.

Or are you suggesting something performance-wise?

void SetSlice(int64_t offset, int64_t length, int32_t index_back_shift = 0);

int32_t operator[](int64_t i) const {
return indices_[i + offset_] - index_back_shift_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that exposing the index_back_shift would be too verbose and error-prone. Better use some encapsulation to hide it. Maybe let the span accept a lambda, within which we can write more compiler-friendly code meanwhile keep the index_back_shift hidden?

VisitSelectionVectorSpanInline(const SelectionVectorSpan& selection,
OnSelectionFn&& on_selection) {
for (int64_t i = 0; i < selection.length(); ++i) {
RETURN_NOT_OK(on_selection(selection[i]));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get it. Could you elaborate a bit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants