Skip to content

feat: Add Data/LazyFrame.gather#27028

Draft
gab23r wants to merge 13 commits intopola-rs:mainfrom
gab23r:frame.gather
Draft

feat: Add Data/LazyFrame.gather#27028
gab23r wants to merge 13 commits intopola-rs:mainfrom
gab23r:frame.gather

Conversation

@gab23r
Copy link
Copy Markdown
Contributor

@gab23r gab23r commented Mar 24, 2026

Fixes: #27023

Claude code was used. I carefully review it and think it is correct.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Mar 24, 2026
@gab23r gab23r changed the title feat: Add Data/LazyFrame.gather feat(python): Add Data/LazyFrame.gather Mar 24, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 63.41463% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.43%. Comparing base (4a25cac) to head (dd05a44).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-plan/src/plans/ir/tree_format.rs 0.00% 5 Missing ⚠️
...rates/polars-python/src/lazyframe/visitor/nodes.rs 0.00% 5 Missing ⚠️
crates/polars-plan/src/plans/ir/format.rs 0.00% 4 Missing ⚠️
crates/polars-plan/src/plans/ir/dot.rs 0.00% 3 Missing ⚠️
crates/polars-plan/src/plans/visitor/hash.rs 0.00% 3 Missing ⚠️
crates/polars-python/src/lazyframe/general.rs 87.50% 2 Missing ⚠️
py-polars/src/polars/dataframe/frame.py 60.00% 1 Missing and 1 partial ⚠️
py-polars/src/polars/lazyframe/frame.py 60.00% 1 Missing and 1 partial ⚠️
crates/polars-mem-engine/src/executors/gather.rs 91.66% 1 Missing ⚠️
.../polars-plan/src/plans/conversion/dsl_to_ir/mod.rs 0.00% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #27028      +/-   ##
==========================================
- Coverage   81.51%   81.43%   -0.08%     
==========================================
  Files        1810     1812       +2     
  Lines      249583   249826     +243     
  Branches     3141     3143       +2     
==========================================
+ Hits       203444   203447       +3     
- Misses      45334    45572     +238     
- Partials      805      807       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ritchie46
Copy link
Copy Markdown
Member

This is not how we would want this to be implemented. The expressions are now evaluated multiple times and could not always be removed with CSEE. In a gather it is very like it cannot be.

@gab23r gab23r marked this pull request as draft March 25, 2026 16:03
@gab23r
Copy link
Copy Markdown
Contributor Author

gab23r commented Mar 25, 2026

Understood, the eager part, I used _select_rows.
For the lazy part, do you want a proper gather implementation in rust ?

@gab23r
Copy link
Copy Markdown
Contributor Author

gab23r commented Mar 25, 2026

I went ahead and implemented a proper gather node for lazy execution.
I did a small benchmark out of curiosity:

import polars as pl
import numpy as np

n_rows, n_indices = 1_000_000, 100_000

df = pl.DataFrame({"a": np.random.randn(n_rows), "b": np.random.randn(n_rows)})
indices = np.random.randint(0, n_rows, size=n_indices).tolist()

%timeit _ = df.gather(indices)
# 29.8 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit _ = df.lazy().gather(indices).collect()
# 16.1 ms ± 610 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# x1.84 speedup

For streaming, this will fallback to in memory

@gab23r gab23r changed the title feat(python): Add Data/LazyFrame.gather feat: Add Data/LazyFrame.gather Mar 25, 2026
@github-actions github-actions bot added the changes-dsl Do not merge if this label is present and red. label Mar 25, 2026
@gab23r gab23r marked this pull request as ready for review March 25, 2026 22:41
@gab23r gab23r requested review from orlp and wence- as code owners March 25, 2026 22:41
Copy link
Copy Markdown
Member

@orlp orlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indices needs to support arbitrary expressions, not just take an Arc<[IdxSize]> (which is a type I don't like seeing regardless). This means you need a dedicated node in streaming with two independent inputs (since they might not have the same length), and something similar for in-memory.

@gab23r
Copy link
Copy Markdown
Contributor Author

gab23r commented Mar 26, 2026

i added the dedicated node, so now LazyFrame.gather can take an expression. But I am not sure how to manage the dst -> ir.

this make the plan a bit weird:

lf = pl.LazyFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]})

print(lf.gather([0]).explain())
# GATHER
#   DF ["a", "b"]; PROJECT */2 COLUMNS
#   SELECT [Series.strict_cast(UInt32)]
#     DF []; PROJECT */0 COLUMNS

@gab23r
Copy link
Copy Markdown
Contributor Author

gab23r commented Mar 26, 2026

The gather takes two independent inputs, so maybe accepting Expr was maybe a foot-gun... I am afraid that here we will miss CSE, WDYT ?

@gab23r gab23r marked this pull request as draft April 15, 2026 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changes-dsl Do not merge if this label is present and red. enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add gather to Dataframe

3 participants