21566: Parquet: Add scan-level cache for adapted pruning setup (projection/predicate) to reuse CPU-only work across same-schema files by martin-augment · Pull Request #310 · martin-augment/datafusion

martin-augment · 2026-04-13T12:49:11Z

21566: To review by AI

Add a scan-local ParquetPruningSetupCache in opener.rs to reuse adapted projection, adapted predicate, and row-group pruning predicate setup for same-schema files. Implement PhysicalExprAdapterFactory::supports_reusable_rewrites() in schema_rewriter.rs to allow default adapters while defaulting custom ones to opt-out. Connect the cache to ParquetMorselizer in source.rs and introduce a regression test to ensure the setup is reused correctly across same-schema files while keeping it conservative to avoid per-file literal replacements.

Simplify the pruning setup cache in opener.rs by removing the wrapper entry struct, deriving PartialEq/Eq, and returning cloned setup values instead of an extra Arc. Extract the cache-or-build branch into build_or_get_pruning_setup and avoid repeated literal_columns.is_empty() checks. Move returned pruning setup fields directly into prepared and simplify the test-only counting adapter and cache regression test loop. Tighten the supports_reusable_rewrites doc comment in schema_rewriter.rs without changing the public interface.

Refactor cache in opener.rs to utilize a HashMap, computing cold misses outside the mutex for better performance, with a re-check on insert. Add cache-boundary tests for non-reusable adapters and diverse physical schemas. Clarify documentation in schema_rewriter.rs for supports_reusable_rewrites() to specify the same logical/physical schema rewrite inputs.

Enhance opener.rs with cache-key rationale comments for clarity. Update ParquetPruningSetupCache to handle concurrent cold misses by implementing a per-entry pending/ready state. Clarify the public contract of supports_reusable_rewrites() in schema_rewriter.rs.

…che access

coderabbitai · 2026-04-13T12:49:19Z

Walkthrough

This pull request introduces a pruning setup cache mechanism to the Parquet datasource, alongside configuration guidelines for AI agents. The cache reuses CPU-only pruning setup across multiple files sharing the same logical/physical schemas and scan-level projection/predicate identities. The ParquetMorselizer struct now includes a pruning_setup_cache field, and a new supports_reusable_rewrites method was added to the PhysicalExprAdapterFactory trait to indicate when rewritten expressions can be cached. Test infrastructure was expanded to validate cache behavior. Additionally, AI agent configuration files were created to establish review constraints regarding GitHub issue/PR links and agent configuration file reviews.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pr-21566-2026-04-13-12-49-06

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-13T12:54:03Z

Code Review

This PR introduces a scan-level cache (ParquetPruningSetupCache) that avoids redundant expression rewriting and pruning predicate construction across files sharing the same logical/physical schema. The design is well-motivated for scans over large numbers of same-schema files (the common case).

Correctness Issues

1. Blocking a tokio worker thread via Condvar::wait (medium severity)

prepare_filters is a synchronous function called from ParquetOpenState::transition() -> MorselPlanner::plan(), which executes on a tokio worker thread. When two concurrent file opens collide on the same cache key, the second caller blocks that tokio thread via std::sync::Condvar::wait until the first finishes.

Since build_pruning_setup is pure CPU work expected to complete in microseconds, this is unlikely to cause measurable thread starvation in practice. But it is technically incorrect for an async execution environment. A cleaner long-term approach would be to use tokio::sync::watch or tokio::sync::Notify. At minimum, document the reasoning for accepting the brief blocking.

2. Cache key uses pointer identity (Arc::as_ptr) -- correctness relies on a subtle invariant

predicate_ptr: predicate.map(physical_expr_ptr),
projection_expr_ptrs: projection.iter().map(|expr| physical_expr_ptr(&expr.expr)).collect(),

The per-scan scope (fresh Arc::default() in source.rs and build()) and the existing comment explain the design rationale. However, it is worth adding a debug_assert! or an explicit comment inside get_or_insert_with noting that pointer stability is guaranteed because the cache lifetime is strictly bounded to a single scan execution.

3. Race window in error path: entry removed after notify_all

entry.ready.notify_all();
drop(state);               // mutex released here
self.entries()?.remove(key); // new callers can find the Failed entry before this line runs

After drop(state) but before remove(key), a new caller can find the Failed entry and return an error without ever computing. This is technically correct, but surprising: a transient make_setup error will propagate to a caller that had no opportunity to retry. If retry-after-failure is intended, move remove before notify_all (still under the state lock) so new callers never observe the Failed entry.

Design Observations

4. Unbounded cache growth

ParquetPruningSetupCache has no eviction policy. For scans over schema-evolved data with many distinct physical schemas, each combination produces a cache entry that lives for the scan duration. This is likely fine in practice since the cache is scoped per-ParquetMorselizer instance and released when the scan finishes, but a brief comment to that effect would help reviewers verify the bound.

5. Missing test for literal-column bypass

The code correctly sets pruning_setup_reusable = !has_literal_columns, but there is no test verifying that files with partition (literal) columns bypass the cache and call the adapter once per file. This path is easy to miss in a future refactor.

Minor Notes

physical_expr_ptr casting via *const () is the correct way to extract the data pointer from a fat pointer -- good.
supports_reusable_rewrites() defaulting to false for custom implementations is the right safe default -- good.
The three tests covering cache reuse, non-reusable adapter bypass, and schema isolation are a solid baseline.
Factoring build_pruning_setup into its own function (separate from build_or_get_pruning_setup) is clean and makes the cache bypass path easy to follow.

Summary

The core approach is sound. The two items most worth addressing before merge are: (1) the Condvar::wait on a tokio thread -- at minimum document why it is acceptable; and (2) the error-path race -- move remove before notify_all if retry-after-failure is intended.

gemini-code-assist

Code Review

This pull request introduces a scan-local cache, ParquetPruningSetupCache, to reuse CPU-only pruning setups across Parquet files with identical physical schemas and expression inputs. It also adds rules for AI agents regarding PR linking and config file reviews. Several critical issues were identified in the implementation, including compilation errors due to SchemaRef not implementing Hash and DataFusionError not being clonable for the cache's error state. Additionally, the cache initialization in ParquetSource needs to be adjusted to ensure it is shared across partitions rather than being partition-local.

gemini-code-assist · 2026-04-13T12:54:54Z

datafusion/datasource-parquet/src/opener.rs

+        Mutex<HashMap<ParquetPruningSetupCacheKey, Arc<ParquetPruningSetupCacheEntry>>>,
+}
+
+#[derive(Debug, Clone, PartialEq, Eq, Hash)]


The ParquetPruningSetupCacheKey struct derives Hash, but it contains logical_file_schema and physical_file_schema of type SchemaRef (which is Arc<Schema>). In the arrow-rs crate, Schema does not implement the Hash trait. Consequently, Arc<Schema> also does not implement Hash, and this derive will cause a compilation error. You should implement Hash manually for ParquetPruningSetupCacheKey, perhaps by using pointer equality for the schemas if they are expected to be stable, or by hashing a unique representation of the schema.

gemini-code-assist · 2026-04-13T12:54:54Z

datafusion/datasource-parquet/src/opener.rs

+                    entry.ready.notify_all();
+                    drop(state);
+                    self.entries()?.remove(key);
+                    Err(DataFusionError::from(&shared_error))


The code attempts to return an error using DataFusionError::from(&shared_error), where shared_error is an Arc<DataFusionError>. However, DataFusionError does not implement Clone (as it contains non-clonable Box<dyn Error> variants) and there is no From<&DataFusionError> implementation. This will result in a compilation error. Consider storing a clonable error representation (like a string) in the cache or using a custom error type that can be cloned.

gemini-code-assist · 2026-04-13T12:54:54Z

datafusion/datasource-parquet/src/opener.rs

+                        return Ok(setup.clone());
+                    }
+                    ParquetPruningSetupCacheEntryState::Failed(e) => {
+                        return Err(DataFusionError::from(e));


Similar to the issue on line 300, DataFusionError::from(e) will fail to compile because e is a reference to an Arc<DataFusionError>, and DataFusionError is not clonable. You need a way to produce a new DataFusionError from the cached one, such as by capturing the error message.

gemini-code-assist · 2026-04-13T12:54:54Z

datafusion/datasource-parquet/src/source.rs

                encryption_factory: self.get_encryption_factory_with_config(),
                max_predicate_cache_size: self.max_predicate_cache_size(),
                reverse_row_groups: self.reverse_row_groups,
+                pruning_setup_cache: Arc::default(),


The pruning_setup_cache is initialized here with Arc::default() inside create_file_opener. Since create_file_opener is called once per partition, this cache is effectively partition-local rather than scan-level. To achieve true scan-level caching as intended by the PR title, the cache should be owned by ParquetSource (as an Arc) so that it is shared across all partitions of the same DataSourceExec instance.

Suggested change

pruning_setup_cache: Arc::default(),

pruning_setup_cache: Arc::clone(&self.pruning_setup_cache),

augmentcode · 2026-04-13T12:55:30Z

🤖 Augment PR Summary

Summary: This PR adds a scan-local cache for Parquet pruning setup (schema adaptation + pruning predicate construction) so CPU-only work can be reused across multiple files that share the same physical schema.

Changes:

Introduced ParquetPruningSetupCache with a per-scan Mutex/Condvar-backed entry state machine (Pending/Ready/Failed).
Keyed cached entries by logical schema, physical schema, and identity of scan-level projection/predicate expressions.
Refactored filter preparation to use build_or_get_pruning_setup, returning adapted projection/predicate and a pre-built PruningPredicate.
Extended PhysicalExprAdapterFactory with supports_reusable_rewrites() and enabled reuse for the default adapter factory.
Threaded the cache through Parquet openers (ParquetMorselizer/ParquetSource) as an Arc shared within a scan.
Added tests validating reuse for same-schema files, skipping reuse for non-reusable adapter factories, and isolating reuse by physical schema.

Technical Notes: Reuse is intentionally disabled when file-specific literal column replacement occurs, and when adapter factories don’t opt in to reusable rewrites.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-04-13T12:55:31Z

datafusion/datasource-parquet/src/opener.rs

+        };
+
+        if should_compute {
+            let setup_result = make_setup();


If make_setup() panics after inserting the Pending entry, the state never transitions and subsequent callers can block forever in the Condvar wait loop. Consider ensuring the cache entry is moved to a terminal state (or removed) even when the setup computation unwinds.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a57b22e. Configure here.}

cursor · 2026-04-13T12:56:32Z

datafusion/datasource-parquet/src/opener.rs

+                    drop(state);
+                    self.entries()?.remove(key);
+                    Err(DataFusionError::from(&shared_error))
+                }


Missing panic guard leaves Condvar waiters stuck forever

Medium Severity

In get_or_insert_with, make_setup() is called on line 284 before the entry's inner state lock is acquired. If make_setup() panics, notify_all() is never called, the entry state remains Pending, and the inner Mutex is not poisoned. Any threads blocked on entry.ready.wait(state) will wait indefinitely. A drop guard (or equivalent) that transitions the entry to Failed and calls notify_all() on unwind is needed to prevent this hang.

Additional Locations (1)

datafusion/datasource-parquet/src/opener.rs#L302-L324

^{Reviewed by Cursor Bugbot for commit a57b22e. Configure here.}

cursor · 2026-04-13T12:56:32Z

datafusion/datasource-parquet/src/opener.rs

+                    }
+                }
+            }
+        }


Blocking Condvar wait inside Future::poll risks thread starvation

Low Severity

Condvar::wait is a blocking call that can be reached from ParquetOpenFuture::poll → plan() → transition() → prepare_filters() → build_or_get_pruning_setup() → get_or_insert_with(). Using a blocking synchronization primitive inside Future::poll blocks the async executor's worker thread, which is a well-known anti-pattern in async Rust and can cause thread starvation under concurrent partition processing.

Additional Locations (1)

datafusion/datasource-parquet/src/opener.rs#L617-L650

^{Reviewed by Cursor Bugbot for commit a57b22e. Configure here.}

martin-augment · 2026-04-13T13:14:24Z

/coderabbit review

martin-augment · 2026-04-13T13:18:56Z

coderabbit review

martin-augment · 2026-04-13T13:19:23Z

@CodeRabbit review

coderabbitai · 2026-04-13T13:19:30Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

datafusion/datasource-parquet/src/source.rs (1)
545-575: ⚠️ Potential issue | 🟠 Major

Make the cache scan-shared, not opener-local.

Line 573 allocates a fresh ParquetPruningSetupCache inside create_file_opener, so reuse is limited to files handled by that one opener/execution partition. Same-schema files that land in different file groups will miss the cache entirely, which makes this optimization partition-local rather than scan-local. Hoist the Arc<ParquetPruningSetupCache> onto a scan-scoped object and clone it into each opener instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/datasource-parquet/src/source.rs` around lines 545 - 575, The
ParquetPruningSetupCache is being created inside create_file_opener (as
pruning_setup_cache: Arc::default() in the ParquetMorselizer) which makes the
cache opener-local; move the Arc<ParquetPruningSetupCache> allocation to a
scan-scoped owner (e.g., the struct that represents the whole parquet scan or
the builder that constructs all openers), store it there, and clone that Arc
into each ParquetOpener/ParquetMorselizer when constructing them (keep the field
name pruning_setup_cache and populate it by cloning the scan-scoped Arc instead
of Arc::default() inside create_file_opener).

🧹 Nitpick comments (1)

datafusion/datasource-parquet/src/opener.rs (1)
2327-2460: Add a truly concurrent cache-reuse test.

These tests open file1 and file2 serially, so they only validate reuse after the first setup has already completed. They never hit the new Pending/Condvar path in the cache entry state machine, which is the riskiest part of this change. Please add a case that starts two same-schema opens concurrently and asserts a single adapter creation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/datasource-parquet/src/opener.rs` around lines 2327 - 2460, The
tests currently open file1 and file2 serially so they never exercise the cache's
Pending/Condvar concurrency path; add a new async test (e.g.,
test_pruning_setup_cache_reuses_adapter_concurrent) that uses the same setup as
test_pruning_setup_cache_reuses_adapter but starts two opener.open(...) futures
concurrently (spawn or join them) for the two PartitionedFile instances so they
race to create the pruning setup and then await both; use the same
CountingPhysicalExprAdapterFactory and assert
create_count.load(Ordering::SeqCst) == 1 to ensure only one adapter is created,
referencing ParquetOpenerBuilder, CountingPhysicalExprAdapterFactory, and
opener.open to locate the code to change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@datafusion/datasource-parquet/src/source.rs`:
- Around line 545-575: The ParquetPruningSetupCache is being created inside
create_file_opener (as pruning_setup_cache: Arc::default() in the
ParquetMorselizer) which makes the cache opener-local; move the
Arc<ParquetPruningSetupCache> allocation to a scan-scoped owner (e.g., the
struct that represents the whole parquet scan or the builder that constructs all
openers), store it there, and clone that Arc into each
ParquetOpener/ParquetMorselizer when constructing them (keep the field name
pruning_setup_cache and populate it by cloning the scan-scoped Arc instead of
Arc::default() inside create_file_opener).

---

Nitpick comments:
In `@datafusion/datasource-parquet/src/opener.rs`:
- Around line 2327-2460: The tests currently open file1 and file2 serially so
they never exercise the cache's Pending/Condvar concurrency path; add a new
async test (e.g., test_pruning_setup_cache_reuses_adapter_concurrent) that uses
the same setup as test_pruning_setup_cache_reuses_adapter but starts two
opener.open(...) futures concurrently (spawn or join them) for the two
PartitionedFile instances so they race to create the pruning setup and then
await both; use the same CountingPhysicalExprAdapterFactory and assert
create_count.load(Ordering::SeqCst) == 1 to ensure only one adapter is created,
referencing ParquetOpenerBuilder, CountingPhysicalExprAdapterFactory, and
opener.open to locate the code to change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7125384f-371f-4ad2-8733-728e4024b176

📥 Commits

Reviewing files that changed from the base of the PR and between db7aad0 and a57b22e.

📒 Files selected for processing (5)

.cursor/rules.md
AGENTS.md
datafusion/datasource-parquet/src/opener.rs
datafusion/datasource-parquet/src/source.rs
datafusion/physical-expr-adapter/src/schema_rewriter.rs

kosiew and others added 6 commits April 12, 2026 16:50

Fix key reference in ParquetPruningSetupCache methods for improved ca…

b17fafc

…che access

Add AI config files

a57b22e

gemini-code-assist bot reviewed Apr 13, 2026

View reviewed changes

augmentcode bot reviewed Apr 13, 2026

View reviewed changes

cursor bot reviewed Apr 13, 2026

View reviewed changes

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

	pruning_setup_cache: Arc::default(),
	pruning_setup_cache: Arc::clone(&self.pruning_setup_cache),

Conversation

martin-augment commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Apr 13, 2026

Code Review

Correctness Issues

Design Observations

Minor Notes

Summary

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Apr 13, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 13, 2026

Choose a reason for hiding this comment

Missing panic guard leaves Condvar waiters stuck forever

Uh oh!

cursor bot Apr 13, 2026

Choose a reason for hiding this comment

Blocking Condvar wait inside Future::poll risks thread starvation

Uh oh!

martin-augment commented Apr 13, 2026

Uh oh!

martin-augment commented Apr 13, 2026

Uh oh!

martin-augment commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Apr 13, 2026 •

edited

Loading