Skip to content

perf: single-pass plan traversal in Predicate::new#113

Merged
xudong963 merged 3 commits intodatafusion-contrib:mainfrom
zhuqi-lucas:single_pass_plan_traversal
Jan 20, 2026
Merged

perf: single-pass plan traversal in Predicate::new#113
xudong963 merged 3 commits intodatafusion-contrib:mainfrom
zhuqi-lucas:single_pass_plan_traversal

Conversation

@zhuqi-lucas
Copy link
Contributor

image

Summary

Optimize SpjNormalForm::new by consolidating multiple plan traversals into a single pass.

Problem

The original implementation of Predicate::new traversed the logical plan 4 times:

  1. Collect schema
  2. Collect columns and initialize equivalence classes
  3. Collect filters
  4. (in SpjNormalForm::new) Collect referenced tables

This redundant traversal caused significant overhead, especially for wide tables.

Solution

  • Merge all 4 traversals into a single plan.apply() call in Predicate::new
  • Add referenced_tables field to Predicate struct
  • Reuse referenced_tables from Predicate in SpjNormalForm::new
  • Pre-allocate vectors with known capacity

Benchmark Results

group                                        main                  single_pass
-----                                        ----                  -----------
spj_normal_form_new/cols=1                   1.00   940.3±8.40ns   1.01   951.7±20.96ns
spj_normal_form_new/cols=10                  1.12     4.7±0.03µs   1.00     4.2±0.03µs
spj_normal_form_new/cols=20                  1.23     9.0±0.12µs   1.00     7.4±0.04µs
spj_normal_form_new/cols=40                  1.28    17.5±0.17µs   1.00    13.7±0.08µs
spj_normal_form_new/cols=80                  1.34    36.9±0.25µs   1.00    27.6±0.20µs
spj_normal_form_new/cols=160                 1.34    76.3±0.32µs   1.00    56.8±0.35µs
spj_normal_form_new/cols=320                 1.35   168.9±1.83µs   1.00   124.7±1.70µs

Up to 35% performance improvement for wide tables (80+ columns).

Changes

  • src/rewrite/normal_form.rs: Refactored Predicate::new to use single-pass traversal

Testing

Existing tests pass without modification. This is a pure internal refactor with no API changes.

Copilot AI review requested due to automatic review settings January 16, 2026 04:09
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes SpjNormalForm::new by consolidating multiple plan traversals into a single pass, improving performance for wide tables.

Changes:

  • Refactored Predicate::new to use a single traversal that collects schema, columns, filters, and referenced tables in one pass
  • Added referenced_tables field to Predicate struct to store tables collected during traversal
  • Modified SpjNormalForm::new to reuse referenced_tables from Predicate instead of performing a separate traversal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zhuqi-lucas
Copy link
Contributor Author

cc @xudong963
I will iterate the performance improvements, because it should be complex for all optimizations.

})?;

// Initialize data structures with known capacity
let n = columns_info.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: n -> meaningful name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will address this!

impl Predicate {
/// Create a new Predicate by analyzing the given logical plan.
/// Uses single-pass traversal to collect schema, columns, filters, and referenced tables.
fn new(plan: &LogicalPlan) -> Result<Self> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding some unit tests for the new() method to ensure we're collecting the expected items in Predicate as previously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add the unit test!

@zhuqi-lucas
Copy link
Contributor Author

Thank you @xudong963 for review, addressed all comments now.

@zhuqi-lucas zhuqi-lucas requested a review from xudong963 January 20, 2026 03:20
Comment on lines +1283 to +1286
// Note: Join is not supported yet, so we test with a single table
// This test verifies the single-pass traversal works correctly
let plan = ctx
.sql("SELECT a, b FROM t1 WHERE a >= 0 AND b <= 100")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the test is duplicated with the above one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the mv algo supports join

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xudong963 We are still not supporting join now, see:

"joins are not supported yet".to_string(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the duplicated test now.

@zhuqi-lucas zhuqi-lucas requested a review from xudong963 January 20, 2026 05:33
@xudong963 xudong963 merged commit 02bad14 into datafusion-contrib:main Jan 20, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants