Conversation
Data/LazyFrame.gatherData/LazyFrame.gather
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #27028 +/- ##
==========================================
- Coverage 81.51% 81.43% -0.08%
==========================================
Files 1810 1812 +2
Lines 249583 249826 +243
Branches 3141 3143 +2
==========================================
+ Hits 203444 203447 +3
- Misses 45334 45572 +238
- Partials 805 807 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This is not how we would want this to be implemented. The expressions are now evaluated multiple times and could not always be removed with CSEE. In a gather it is very like it cannot be. |
|
Understood, the eager part, I used |
|
I went ahead and implemented a proper gather node for lazy execution. import polars as pl
import numpy as np
n_rows, n_indices = 1_000_000, 100_000
df = pl.DataFrame({"a": np.random.randn(n_rows), "b": np.random.randn(n_rows)})
indices = np.random.randint(0, n_rows, size=n_indices).tolist()
%timeit _ = df.gather(indices)
# 29.8 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit _ = df.lazy().gather(indices).collect()
# 16.1 ms ± 610 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# x1.84 speedupFor streaming, this will fallback to in memory |
Data/LazyFrame.gatherData/LazyFrame.gather
orlp
left a comment
There was a problem hiding this comment.
Indices needs to support arbitrary expressions, not just take an Arc<[IdxSize]> (which is a type I don't like seeing regardless). This means you need a dedicated node in streaming with two independent inputs (since they might not have the same length), and something similar for in-memory.
|
i added the dedicated node, so now LazyFrame.gather can take an expression. But I am not sure how to manage the dst -> ir. this make the plan a bit weird: lf = pl.LazyFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]})
print(lf.gather([0]).explain())
# GATHER
# DF ["a", "b"]; PROJECT */2 COLUMNS
# SELECT [Series.strict_cast(UInt32)]
# DF []; PROJECT */0 COLUMNS |
|
The gather takes two independent inputs, so maybe accepting Expr was maybe a foot-gun... I am afraid that here we will miss CSE, WDYT ? |
Fixes: #27023
Claude code was used. I carefully review it and think it is correct.