feat: Enable rowgroup skipping for float columns by azimafroozeh · Pull Request #26805 · pola-rs/polars

azimafroozeh · 2026-03-04T21:10:16Z

Float columns were previously excluded from Parquet row-group skipping because min/max statistics don't track NaN. This PR selectively enables it based on the operator and literal value, allowing skipping only in cases that are correct despite the missing NaN information.

What we can now skip

col < x / col <= x: NaN never satisfies < or <=, so a hidden NaN can't cause a false skip
col == x (finite x): NaN ≠ any finite value, so if stats show no match we can safely skip
col > NaN / col >= NaN: Nothing is greater than NaN under TotalOrd, so we can skip everything
col.is_between(a, b) (finite bounds): Even though col >= a alone is unsafe, the col <= b half blocks NaN (NaN is never ≤ any finite value), so the conjunction is safe

What we still block

col > x / col >= x (finite x): A hidden NaN satisfies > for any finite x, but won't appear in the max stat — skipping could miss matching rows
col == NaN: Stats don't track NaN, so we can't prove its absence

Fixes #26238

codecov · 2026-03-04T22:03:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.57%. Comparing base (9f1a742) to head (6db0d52).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #26805      +/-   ##
==========================================
+ Coverage   81.30%   81.57%   +0.26%     
==========================================
  Files        1802     1802              
  Lines      246972   246996      +24     
  Branches     3086     3086              
==========================================
+ Hits       200810   201484     +674     
+ Misses      45371    44722     -649     
+ Partials      791      790       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gdementen · 2026-03-05T09:02:14Z

This is great ! It would be even greater if, as a user, I could specify that my input files do not have any NaNs and Polars would then optimize all predicates. Later on, this capability could also be used if Polars stored in custom metadata in the parquet files it writes itself, so that it could optimize all queries on them without the user specifying that option manually.

Speaking of Parquet, it looks odd to me that the implementation does not seem Parquet-specific, or is skip_batches.rs a Parquet-specific functionality? I mean, other data formats could very well contain the information, right? (I don't know Polars internals and I am sorry if this kind of discussion is not appreciated)

azimafroozeh · 2026-03-05T10:02:55Z

Polars could technically store a "no NaN" flag in the files it writes, since Parquet supports arbitrary custom key-value metadata, but files written that way would only be optimizable by Polars itself, since other readers would just ignore the custom fields. So while it works in a Polars-only pipeline, it's not a general solution. The proper fix is to standardize this at the spec level, and there is actually an ongoing effort to do exactly that: apache/parquet-format#514 proposes adding both an IEEE754TotalOrder column order and explicit nan_count fields to the Parquet statistics schema. Once that lands, any Parquet writer could embed nan_count=0 in standard metadata and any Parquet reader — Polars or otherwise — could fully trust float statistics without needing user hints.

As for other file formats: this is inherently Parquet-specific, not because of an arbitrary implementation choice, but because formats like CSV and IPC simply don't store per-chunk column statistics at all. There is nothing to read before touching the data, so row group skipping has nowhere to hook into regardless of NaN.

Your idea of a user-facing "I guarantee no NaNs" hint is a nice one and makes intuitive sense! The challenge though is that the NaN guarantee alone isn't quite enough. To actually decide which row groups to skip, Polars also needs per-row-group min/max bounds, which come from the Parquet metadata. If the file has those statistics, Polars can already use them once NaN is out of the way; if it doesn't, there is nothing to skip on regardless. The clean solution really is at the spec level, which is exactly what apache/parquet-format#514 is working towards.

And thanks for bringing this up, this is exactly the kind of discussion we appreciate, so please don't hesitate!

gdementen · 2026-03-05T10:34:41Z

Thanks for the nice explanation

ritchie46 · 2026-03-05T10:43:58Z

Nice one!

implement

6db0d52

azimafroozeh requested review from MarcoGorelli, alexander-beedie, c-peters, orlp and ritchie46 as code owners March 4, 2026 21:10

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Mar 4, 2026

ritchie46 merged commit dda10f3 into pola-rs:main Mar 5, 2026
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Enable rowgroup skipping for float columns#26805

feat: Enable rowgroup skipping for float columns#26805
ritchie46 merged 1 commit intopola-rs:mainfrom
azimafroozeh:feat/enable_predicate_pushdown_for_float

azimafroozeh commented Mar 4, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

gdementen commented Mar 5, 2026

Uh oh!

azimafroozeh commented Mar 5, 2026 •

edited

Loading

Uh oh!

gdementen commented Mar 5, 2026

Uh oh!

ritchie46 commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

azimafroozeh commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What we can now skip

What we still block

Uh oh!

codecov bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gdementen commented Mar 5, 2026

Uh oh!

azimafroozeh commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gdementen commented Mar 5, 2026

Uh oh!

ritchie46 commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

azimafroozeh commented Mar 4, 2026 •

edited

Loading

codecov bot commented Mar 4, 2026 •

edited

Loading

azimafroozeh commented Mar 5, 2026 •

edited

Loading