18688: feat: allow custom caching via logical node#26
18688: feat: allow custom caching via logical node#26martin-augment wants to merge 4 commits intomainfrom
Conversation
WalkthroughThe changes introduce an extensible caching mechanism to DataFusion. A new ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: Custom Caching via Logical NodeThank you for this contribution! This PR adds a flexible extension point for custom caching strategies. Here's my detailed review: ✅ Strengths
🔍 Potential Issues1. Missing Physical Plan Implementation (Critical)The test only validates the logical plan structure but doesn't verify that the custom cache node can be executed. Important concern: Users implementing Recommendation: Add documentation explaining that users must:
Example needed in docs: // Users need both:
ctx.with_cache_producer(Arc::new(MyCacheProducer {}))
.with_physical_planner(/* custom planner with ExtensionPlanner */)2. Test Coverage Gaps
Recommendation: Add an integration test that: let result = cached_df.collect().await?;
// Verify results match original3. Documentation GapsThe
Recommendation: Expand the doc comment: /// Interface for applying a custom caching strategy.
///
/// # Overview
/// Implement this trait to provide custom caching behavior for DataFrames.
/// When registered via `SessionContext::with_cache_producer`, the `DataFrame::cache()`
/// method will create your custom logical node instead of materializing to a MemTable.
///
/// # Requirements
/// Your implementation must also provide an `ExtensionPlanner` that converts
/// your custom logical node to a physical plan. Register it via:
/// `SessionStateBuilder::with_extension_planner()`
///
/// # Example
/// See `datafusion/core/tests/user_defined/user_defined_plan.rs` for a complete
/// example of implementing custom logical nodes with physical planning.
4. Minor Code Quality IssuesIn
Recommendation: /// Create a test table with a custom cache producer for testing cache behavior.
///
/// Returns a DataFrame from the aggregate_test_100 table with a TestCacheProducer
/// that creates CacheNode instances when `cache()` is called.
pub async fn test_table_with_cache_producer() -> Result<DataFrame> {🎯 Performance Considerations
🔒 Security ConcernsNo security issues identified. The trait is properly bounded with 📝 API Consistency
🧪 Test Improvements NeededAdd these test cases:
📋 SummaryThis is a solid foundation for custom caching! The main concerns are: Must address:
Should address:
Nice to have:
The API design is sound and follows DataFusion patterns well. With better documentation and test coverage, this will be a valuable addition to DataFusion's extensibility. Review generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
datafusion/core/src/dataframe/mod.rs (1)
2347-2365: Cache producer path looks correct; consider clarifyingcachesemantics in docsThe new
cache_producerbranch cleanly builds anExtensionplan node and preserves the existing MemTable materialization behavior in theelsebranch, so existing users without a cache producer should see identical behavior.For sessions with a cache producer,
cache()now returns a logical extension node without eagerly materializing into memory; the actual caching semantics are delegated to the custom node. This is reasonable, but the doc comment (“Cache DataFrame as a memory table.”) is now only accurate for the fallback path.I suggest adjusting the doc comment to something like “Cache this DataFrame using the configured cache producer, falling back to an in-memory table when none is set” to avoid confusion for users wiring in custom caching.
datafusion/core/src/test_util/mod.rs (1)
28-28: Test cache node/producer wiring looks good; consider derivingCloneonCacheNodeThe added imports,
CacheNodeimplementation,TestCacheProducer, andtest_table_with_cache_producerhelper cleanly exercise the new cache producer pathway:
CacheNodecorrectly wraps a singleLogicalPlaninput, delegatesschema()to the input, exposes no extra expressions, and formats succinctly for EXPLAIN.with_exprs_and_inputsenforcing exactly one input and rebuilding from that input is appropriate for this simple wrapper node.TestCacheProducerreturns aCacheNodewrapped inArc<dyn UserDefinedLogicalNode>, andtest_table_with_cache_producerwires it into aSessionContextin a minimal, targeted way for tests.One small improvement: many user-defined logical nodes in DataFusion also derive
Cloneso they satisfy any blanket trait implementations and can be easily duplicated in tests. AddingClonetoCacheNode’s derive list would align with that pattern and future‑proof the test node:#[derive(Hash, Eq, PartialEq, PartialOrd, Debug, Clone)] struct CacheNode { input: LogicalPlan, }Functionally everything else here looks solid for exercising the custom cache logical node path.
Also applies to: 40-41, 49-53, 291-336, 338-344
datafusion/core/tests/dataframe/mod.rs (1)
2341-2362: Suggest collecting and verifying results for more comprehensive testing.The test correctly verifies that the optimized plan contains a
CacheNode, but it doesn't verify the actual execution behavior. The existingcache_test(lines 2308-2339) both checks the plan structure AND collects/verifies the cached results. Consider adding result collection and verification to ensure the custom caching mechanism works correctly end-to-end:let df_results = df.collect().await?; let cached_df_results = cached_df.collect().await?; assert_eq!(&df_results, &cached_df_results);This would provide stronger validation that the cache producer integration functions correctly.
datafusion/core/src/execution/context/mod.rs (1)
1893-1900: Trait design is solid; consider enhancing documentation.The
CacheProducertrait follows a clean, focused design pattern. The single-method interface with appropriate trait bounds (Debug + Sync + Send) is well-suited for a pluggable caching strategy.Consider expanding the documentation to include:
- When and how the
createmethod is invoked (e.g., "called by DataFrame::cache when a cache producer is registered")- Expected behavior of the returned
UserDefinedLogicalNode- Example implementation reference (e.g., pointing to test utilities)
Example enhancement:
/// Interface for applying a custom caching strategy. -/// Implement this trait and register via [`SessionState`] -/// to create a custom logical node for caching. +/// +/// Implement this trait and register via [`SessionContext::with_cache_producer`] +/// or [`SessionState::set_cache_producer`] to provide a custom caching implementation. +/// +/// When [`DataFrame::cache`] is called, the registered producer will be invoked +/// to create a custom logical node that wraps the DataFrame's plan. +/// +/// # Example +/// See `test_util::TestCacheProducer` for a reference implementation. pub trait CacheProducer: Debug + Sync + Send { /// Create a custom logical node for caching /// given a logical plan (of DF to cache). fn create(&self, plan: LogicalPlan) -> Result<Arc<dyn UserDefinedLogicalNode>>; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
datafusion/core/src/dataframe/mod.rs(2 hunks)datafusion/core/src/execution/context/mod.rs(2 hunks)datafusion/core/src/execution/session_state.rs(12 hunks)datafusion/core/src/test_util/mod.rs(4 hunks)datafusion/core/tests/dataframe/mod.rs(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
datafusion/core/src/test_util/mod.rs (1)
datafusion/core/src/execution/context/mod.rs (2)
create(1873-1877)create(1899-1899)
datafusion/core/src/execution/context/mod.rs (2)
datafusion/core/src/execution/session_state.rs (3)
with_cache_producer(1339-1345)cache_producer(368-370)cache_producer(1652-1654)datafusion/core/src/test_util/mod.rs (2)
create(186-195)create(333-335)
datafusion/core/src/execution/session_state.rs (1)
datafusion/core/src/execution/context/mod.rs (1)
with_cache_producer(472-475)
datafusion/core/src/dataframe/mod.rs (3)
datafusion/core/src/execution/session_state.rs (7)
cache_producer(368-370)cache_producer(1652-1654)task_ctx(288-290)task_ctx(833-835)new(973-1002)new(2052-2054)new(2353-2361)datafusion/core/src/execution/context/mod.rs (6)
new_with_state(355-361)schema(313-313)task_ctx(1698-1700)new(298-300)new(1954-1956)new(1988-1990)datafusion/expr/src/logical_plan/plan.rs (4)
schema(323-359)new(4502-4504)new(5127-5131)new(5193-5197)
datafusion/core/tests/dataframe/mod.rs (2)
datafusion/core/src/test_util/mod.rs (1)
test_table_with_cache_producer(339-344)datafusion/core/src/dataframe/mod.rs (1)
into_optimized_plan(1655-1658)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Analyze (rust)
- GitHub Check: claude-review
🔇 Additional comments (5)
datafusion/core/src/dataframe/mod.rs (1)
59-59: Import ofExtensionis appropriate and localizedThe new
use datafusion_expr::Extension;is needed for theLogicalPlan::Extensionconstruction incacheand does not affect other code paths. Looks good.datafusion/core/tests/dataframe/mod.rs (1)
64-64: LGTM: Import added correctly.The import of
test_table_with_cache_produceris properly placed and maintains the existing order in the test_util imports.datafusion/core/src/execution/context/mod.rs (1)
471-475: LGTM! Clean builder pattern integration.The implementation correctly follows DataFusion's builder pattern for registering pluggable components, consistent with
with_function_factoryand other similar methods.datafusion/core/src/execution/session_state.rs (2)
190-190: LGTM! SessionState integration follows established patterns.The
cache_producerfield and accessor methods are implemented consistently with the existingfunction_factorypattern in this file. TheOption<Arc<dyn CacheProducer>>type correctly models an optional, shared component.Also applies to: 362-370
958-958: LGTM! Complete and consistent SessionStateBuilder integration.The builder integration comprehensively handles the
cache_producerfield across all necessary code paths:
- Field declaration and initialization
- Builder method
with_cache_produceracceptingOptionfor flexibility- Propagation in
new_from_existingto preserve state when cloning- Correct threading through the
build()method- Mutable accessor for advanced use cases
- Debug formatting
All changes follow the established patterns in this file (comparable to
function_factoryintegration).Also applies to: 996-996, 1049-1049, 1338-1345, 1410-1410, 1447-1447, 1651-1654, 1694-1694
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that the physical execution is not currently tested. Adding a test that the non-cached DataFrame returns the same results as the cached one will make the test complete. |
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct that the physical execution is not currently tested. Adding a test that the non-cached DataFrame returns the same results as the cached one will make the test complete. |
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct that the docstring of the DataFrame::cache() method needs to be updated to explain that a custom CacheProducer could be used instead of the eagerly materializing MemTable implementation. |
18688: To review by AI