Skip to content

Implements Sum,sum_checked,min,max,is Distict,inverse for REE. #7933

Closed
ghost wants to merge 3 commits intomainfrom
unknown repository
Closed

Implements Sum,sum_checked,min,max,is Distict,inverse for REE. #7933
ghost wants to merge 3 commits intomainfrom
unknown repository

Conversation

@ghost
Copy link
Copy Markdown

@ghost ghost commented Jul 15, 2025

Which issue does this PR close?

This PR works towards closing the larger REE Epic

Rationale for this change

Add operations onto the REE datatype such as

  • Sum
  • Sum_checked
  • IS DISTINCT FROM
  • Max/Min
  • Distinct (Arrow already handles this apparently)

What changes are included in this PR?

Allows for REE columns to be used for the previously mentioned functions correctly and efficently.

Are these changes tested?

Yes, comprehensive tests have been added in arrow-ord/src/cmp.rs and arrow-arith/src/aggregate.rs:

  • Basic functionality tests: Same values, different values, mixed scenarios
  • Edge case tests: Empty arrays, single runs, all nulls, mixed nulls and values
  • Data type tests: Float64 and Timestamp types
  • Both operations: distinct and not_distinct operations
  • The tests verify that REE distinct operations work correctly without array expansion and handle null values properly.

Are there any user-facing changes?

Performance improvement: REE distinct operations are now much faster for datasets with repeated values
No API changes: Existing distinct() and not_distinct() functions work the same way but are now more efficient for REE arrays
No breaking changes: All existing functionality is preserved

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

…ncludes helper functions for expanding REE into logical represention
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 15, 2025
@ghost ghost force-pushed the baah/AggregationSupport branch from 4ded72e to 05fb3c0 Compare July 17, 2025 18:47
@ghost ghost force-pushed the baah/AggregationSupport branch from 05fb3c0 to 72bd81a Compare July 17, 2025 18:52
Comment thread arrow-arith/src/aggregate.rs Outdated
Comment thread arrow-ord/src/cmp.rs Outdated
Comment thread arrow-ord/src/cmp.rs Outdated
Comment thread arrow-ord/src/cmp.rs Outdated
Comment thread arrow-arith/src/aggregate.rs Outdated
Comment thread arrow-arith/src/aggregate.rs Outdated
Comment thread arrow-arith/src/aggregate.rs Outdated
Comment thread arrow-array/src/array/mod.rs Outdated
Comment thread arrow-array/src/array/run_array.rs
Comment thread arrow-array/src/array/run_array.rs Outdated
Comment thread arrow-array/src/array/run_array.rs Outdated
Comment thread arrow-ord/src/cmp.rs Outdated
Comment thread arrow-ord/src/cmp.rs Outdated
Comment thread arrow-ord/src/lib.rs Outdated
@ghost ghost force-pushed the baah/AggregationSupport branch from 5688ad3 to 9d00687 Compare July 24, 2025 15:01
@ghost ghost changed the title [Draft] implements Sum,sum_checked,min,max,is Distict,inverse for REE. Implements Sum,sum_checked,min,max,is Distict,inverse for REE. Jul 24, 2025
@ghost ghost marked this pull request as ready for review July 25, 2025 13:54
@Jefffrey
Copy link
Copy Markdown
Contributor

Jefffrey commented Feb 5, 2026

Just want to checkup on this PR, if it's still being worked on? Looking to see if we can get REE support pushed along in arrow-rs

@brunal
Copy link
Copy Markdown
Contributor

brunal commented Feb 8, 2026

According to #3520 (comment), the work was done while @rich-t-kid-datadog was an intern, and he's not available anymore. I can pick it up -- which should mostly be resolving merge conflicts.

@Jefffrey
Copy link
Copy Markdown
Contributor

Jefffrey commented Feb 9, 2026

That sounds good, I'm willing to help review this to get it along

@brunal
Copy link
Copy Markdown
Contributor

brunal commented Feb 10, 2026

This isn't ready as-is:

  • aggregate works, but is inefficient for min-max: it goes through all values (cost O(n*log(n)) for n values), when it could just look at the run-ends child array (cost O(# runs) < O(n))
  • cmp only implements distinct, and is inefficient as well (binary searches for logical->physical index)

I'll send a first PR just for aggregate, and later a second PR for cmp.

@brunal
Copy link
Copy Markdown
Contributor

brunal commented Feb 13, 2026

Aggregate work done in #9409. In particular, it takes care of sliced Ree arrays.

@brunal
Copy link
Copy Markdown
Contributor

brunal commented Feb 21, 2026

CMP done in #9448

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 18, 2026

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft March 18, 2026 14:16
@ghost ghost closed this by deleting the head repository Mar 25, 2026
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants