Skip to content

feat: Enable rowgroup skipping for float columns#26805

Merged
ritchie46 merged 1 commit intopola-rs:mainfrom
azimafroozeh:feat/enable_predicate_pushdown_for_float
Mar 5, 2026
Merged

feat: Enable rowgroup skipping for float columns#26805
ritchie46 merged 1 commit intopola-rs:mainfrom
azimafroozeh:feat/enable_predicate_pushdown_for_float

Conversation

@azimafroozeh
Copy link
Copy Markdown
Collaborator

@azimafroozeh azimafroozeh commented Mar 4, 2026

Float columns were previously excluded from Parquet row-group skipping because min/max statistics don't track NaN. This PR selectively enables it based on the operator and literal value, allowing skipping only in cases that are correct despite the missing NaN information.

What we can now skip

  • col < x / col <= x: NaN never satisfies < or <=, so a hidden NaN can't cause a false skip
  • col == x (finite x): NaN ≠ any finite value, so if stats show no match we can safely skip
  • col > NaN / col >= NaN: Nothing is greater than NaN under TotalOrd, so we can skip everything
  • col.is_between(a, b) (finite bounds): Even though col >= a alone is unsafe, the col <= b half blocks NaN (NaN is never ≤ any finite value), so the conjunction is safe

What we still block

  • col > x / col >= x (finite x): A hidden NaN satisfies > for any finite x, but won't appear in the max stat — skipping could miss matching rows
  • col == NaN: Stats don't track NaN, so we can't prove its absence

Fixes #26238

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Mar 4, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.57%. Comparing base (9f1a742) to head (6db0d52).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #26805      +/-   ##
==========================================
+ Coverage   81.30%   81.57%   +0.26%     
==========================================
  Files        1802     1802              
  Lines      246972   246996      +24     
  Branches     3086     3086              
==========================================
+ Hits       200810   201484     +674     
+ Misses      45371    44722     -649     
+ Partials      791      790       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gdementen
Copy link
Copy Markdown

This is great ! It would be even greater if, as a user, I could specify that my input files do not have any NaNs and Polars would then optimize all predicates. Later on, this capability could also be used if Polars stored in custom metadata in the parquet files it writes itself, so that it could optimize all queries on them without the user specifying that option manually.

Speaking of Parquet, it looks odd to me that the implementation does not seem Parquet-specific, or is skip_batches.rs a Parquet-specific functionality? I mean, other data formats could very well contain the information, right? (I don't know Polars internals and I am sorry if this kind of discussion is not appreciated)

@azimafroozeh
Copy link
Copy Markdown
Collaborator Author

azimafroozeh commented Mar 5, 2026

Polars could technically store a "no NaN" flag in the files it writes, since Parquet supports arbitrary custom key-value metadata, but files written that way would only be optimizable by Polars itself, since other readers would just ignore the custom fields. So while it works in a Polars-only pipeline, it's not a general solution. The proper fix is to standardize this at the spec level, and there is actually an ongoing effort to do exactly that: apache/parquet-format#514 proposes adding both an IEEE754TotalOrder column order and explicit nan_count fields to the Parquet statistics schema. Once that lands, any Parquet writer could embed nan_count=0 in standard metadata and any Parquet reader — Polars or otherwise — could fully trust float statistics without needing user hints.

As for other file formats: this is inherently Parquet-specific, not because of an arbitrary implementation choice, but because formats like CSV and IPC simply don't store per-chunk column statistics at all. There is nothing to read before touching the data, so row group skipping has nowhere to hook into regardless of NaN.

Your idea of a user-facing "I guarantee no NaNs" hint is a nice one and makes intuitive sense! The challenge though is that the NaN guarantee alone isn't quite enough. To actually decide which row groups to skip, Polars also needs per-row-group min/max bounds, which come from the Parquet metadata. If the file has those statistics, Polars can already use them once NaN is out of the way; if it doesn't, there is nothing to skip on regardless. The clean solution really is at the spec level, which is exactly what apache/parquet-format#514 is working towards.

And thanks for bringing this up, this is exactly the kind of discussion we appreciate, so please don't hesitate!

@gdementen
Copy link
Copy Markdown

Thanks for the nice explanation

@ritchie46
Copy link
Copy Markdown
Member

Nice one!

@ritchie46 ritchie46 merged commit dda10f3 into pola-rs:main Mar 5, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet predicate pushdown not working for Float columns

3 participants