feat: Enable rowgroup skipping for float columns#26805
feat: Enable rowgroup skipping for float columns#26805ritchie46 merged 1 commit intopola-rs:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #26805 +/- ##
==========================================
+ Coverage 81.30% 81.57% +0.26%
==========================================
Files 1802 1802
Lines 246972 246996 +24
Branches 3086 3086
==========================================
+ Hits 200810 201484 +674
+ Misses 45371 44722 -649
+ Partials 791 790 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This is great ! It would be even greater if, as a user, I could specify that my input files do not have any NaNs and Polars would then optimize all predicates. Later on, this capability could also be used if Polars stored in custom metadata in the parquet files it writes itself, so that it could optimize all queries on them without the user specifying that option manually. Speaking of Parquet, it looks odd to me that the implementation does not seem Parquet-specific, or is skip_batches.rs a Parquet-specific functionality? I mean, other data formats could very well contain the information, right? (I don't know Polars internals and I am sorry if this kind of discussion is not appreciated) |
|
Polars could technically store a "no NaN" flag in the files it writes, since Parquet supports arbitrary custom key-value metadata, but files written that way would only be optimizable by Polars itself, since other readers would just ignore the custom fields. So while it works in a Polars-only pipeline, it's not a general solution. The proper fix is to standardize this at the spec level, and there is actually an ongoing effort to do exactly that: apache/parquet-format#514 proposes adding both an IEEE754TotalOrder column order and explicit nan_count fields to the Parquet statistics schema. Once that lands, any Parquet writer could embed nan_count=0 in standard metadata and any Parquet reader — Polars or otherwise — could fully trust float statistics without needing user hints. As for other file formats: this is inherently Parquet-specific, not because of an arbitrary implementation choice, but because formats like CSV and IPC simply don't store per-chunk column statistics at all. There is nothing to read before touching the data, so row group skipping has nowhere to hook into regardless of NaN. Your idea of a user-facing "I guarantee no NaNs" hint is a nice one and makes intuitive sense! The challenge though is that the NaN guarantee alone isn't quite enough. To actually decide which row groups to skip, Polars also needs per-row-group min/max bounds, which come from the Parquet metadata. If the file has those statistics, Polars can already use them once NaN is out of the way; if it doesn't, there is nothing to skip on regardless. The clean solution really is at the spec level, which is exactly what apache/parquet-format#514 is working towards. And thanks for bringing this up, this is exactly the kind of discussion we appreciate, so please don't hesitate! |
|
Thanks for the nice explanation |
|
Nice one! |
Float columns were previously excluded from Parquet row-group skipping because min/max statistics don't track NaN. This PR selectively enables it based on the operator and literal value, allowing skipping only in cases that are correct despite the missing NaN information.
What we can now skip
col < x/col <= x: NaN never satisfies<or<=, so a hidden NaN can't cause a false skipcol == x(finitex): NaN ≠ any finite value, so if stats show no match we can safely skipcol > NaN/col >= NaN: Nothing is greater than NaN underTotalOrd, so we can skip everythingcol.is_between(a, b)(finite bounds): Even thoughcol >= aalone is unsafe, thecol <= bhalf blocks NaN (NaN is never ≤ any finite value), so the conjunction is safeWhat we still block
col > x/col >= x(finitex): A hidden NaN satisfies>for any finitex, but won't appear in the max stat — skipping could miss matching rowscol == NaN: Stats don't track NaN, so we can't prove its absenceFixes #26238