Skip to content

18657: Implement CatalogProviderList in FFI#17

Open
martin-augment wants to merge 1 commit intomainfrom
pr-18657-2025-11-12-20-48-40
Open

18657: Implement CatalogProviderList in FFI#17
martin-augment wants to merge 1 commit intomainfrom
pr-18657-2025-11-12-20-48-40

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

18657: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 12, 2025

Walkthrough

This change introduces FFI support for DataFusion's CatalogProviderList functionality. A new stable C-ABI wrapper struct FFI_CatalogProviderList is added with function pointers for core operations and internal state management. Associated conversion and trait implementations bridge between FFI and native types. The ForeignCatalogProvider inner field visibility is adjusted to crate-private. Test infrastructure is extended with a FixedCatalogProviderList implementation, factory function, and integration tests verifying round-trip behavior for catalog registration, listing, and retrieval. A test module exports these via a new create_catalog_list interface, while redundant integration test code is removed.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18657-2025-11-12-20-48-40

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 12, 2025

PR Review: Implement CatalogProviderList in FFI

Summary

This PR implements FFI support for CatalogProviderList, following the established patterns from catalog_provider.rs. The implementation is well-structured and consistent with the existing FFI codebase.

Code Quality and Structure ✅

Strengths:

  • Excellent consistency: The implementation closely mirrors FFI_CatalogProvider, maintaining consistency with existing patterns
  • Proper FFI design: Uses abi_stable types (RString, RVec, ROption) correctly for cross-FFI communication
  • Clean separation of concerns: Clear distinction between FFI_CatalogProviderList (producer side) and ForeignCatalogProviderList (consumer side)
  • Good documentation: Inline comments explain the purpose of structs and their usage
  • Proper memory management: Implements Drop trait and manual memory management via release function

Minor Observations:

  • Line 48-49 in catalog_provider_list.rs: Comment says "execution plan" but should say "catalog provider list" (copy-paste from other FFI modules)

Potential Issues and Bugs

1. Memory Leak Risk in register_catalog (catalog_provider_list.rs:206) ⚠️

let catalog = match catalog.as_any().downcast_ref::<ForeignCatalogProvider>()
{
    Some(s) => &s.0,
    None => &FFI_CatalogProvider::new(catalog, None),  // ⚠️ Creates temporary!
};

Issue: When creating a new FFI_CatalogProvider on the stack with a reference, the Drop implementation will be called when the reference goes out of scope, but the underlying catalog is being passed through FFI. This creates a temporary that gets dropped immediately after the call.

Impact: This could lead to use-after-free if the FFI side tries to use the catalog after the temporary is dropped.

Recommendation: Store the created FFI_CatalogProvider in a variable to extend its lifetime, or restructure to avoid the temporary:

let temp_catalog;
let catalog = match catalog.as_any().downcast_ref::<ForeignCatalogProvider>() {
    Some(s) => &s.0,
    None => {
        temp_catalog = FFI_CatalogProvider::new(catalog, None);
        &temp_catalog
    }
};

2. Inconsistent API with DataFusion CatalogProvider ⚠️

The register_catalog method in FFI_CatalogProvider returns ROption<FFI_CatalogProvider> (lines 35-39) but in the existing FFI_CatalogProvider::register_schema, it returns RResult<ROption<...>, RString> to handle errors.

Looking at the CatalogProvider trait, register_schema returns Result<Option<...>> while CatalogProviderList::register_catalog returns just Option<...>. So the implementation is correct, but this is a design inconsistency in the DataFusion API itself that carries over.

Note: This is actually correct per the trait definition, not a bug.

Performance Considerations ✅

  • Clone efficiency: The clone implementation uses Arc::clone for the provider, which is efficient (only clones the pointer)
  • String conversions: Appropriate use of RString conversions; no unnecessary allocations
  • Runtime handling: Properly clones Option<Handle> for async runtime management

Security Concerns

1. Unsafe Code Audit

The unsafe code is generally well-handled:

  • inner() and runtime() methods properly cast private_data pointers
  • Box::from_raw in release_fn_wrapper correctly pairs with Box::into_raw in constructors
  • ✅ Raw pointer dereferences are protected by FFI boundaries
  • Send and Sync implementations are appropriate given the Arc-wrapped providers

2. Potential Double-Free Risk ⚠️

The Drop implementation calls self.release, which calls Box::from_raw. Need to ensure that:

  1. ✅ Each FFI_CatalogProviderList instance has exactly one owner of the private_data
  2. ✅ Clone creates new ownership properly (it does via Box::into_raw)

Assessment: Appears safe, but deserves careful review.

3. Thread Safety

  • Properly implements Send and Sync for both FFI structs
  • Uses Arc for shared ownership across threads
  • No mutable state that could cause data races

Test Coverage

Unit Tests (catalog_provider_list.rs:235-283) ✅

Good coverage:

  • ✅ Round-trip FFI conversion test
  • ✅ Tests catalog_names() functionality
  • ✅ Tests replacing existing catalog
  • ✅ Tests adding new catalog
  • ✅ Tests retrieving non-existent catalog
  • ✅ Tests retrieving valid catalog

Integration Tests (ffi_catalog.rs) ✅

Good coverage:

  • ✅ Tests catalog provider through SessionContext
  • ✅ Tests catalog provider list through SessionContext
  • ✅ Validates end-to-end query execution

Test Helper (tests/catalog.rs:187-237) ✅

The FixedCatalogProviderList test fixture is well-designed:

  • ✅ Pre-populates with one catalog ("blue")
  • ✅ Restricts catalog names to four colors for testing
  • ✅ Uses log::warn! for violations (good for debugging)

Minor suggestion: Consider testing the validation logic (attempting to register a catalog with an invalid name like "purple") to ensure the restriction works correctly.

Style and Conventions ✅

  • ✅ Follows Apache 2.0 license header
  • ✅ Uses #[allow(non_camel_case_types)] consistently with other FFI modules
  • ✅ Proper use of #[repr(C)] for FFI structs
  • ✅ Consistent naming: FFI_* for producer, Foreign* for consumer
  • ✅ Good use of type aliases and imports

Recommendations

Critical

  1. Fix the memory management issue in register_catalog (catalog_provider_list.rs:206) to avoid use-after-free

Minor

  1. Update comment on line 48 to say "catalog provider list" instead of "execution plan"
  2. Consider adding a test case for the FixedCatalogProviderList validation logic
  3. Verify that the integration tests pass with --features integration-tests

Related Files Changed

  • catalog_provider.rs: Changed ForeignCatalogProvider from private to pub(crate) - necessary for cross-module usage
  • lib.rs: Added new module export - correct
  • tests/mod.rs: Added create_catalog_list to test module - correct
  • tests/ffi_integration.rs: Moved catalog test to separate file - good refactoring

Conclusion

This is a well-implemented PR that follows established FFI patterns in the DataFusion codebase. The main concern is the potential memory issue in register_catalog that should be addressed before merging. The test coverage is solid, and the code structure is clean and maintainable.

Recommendation: Request changes to address the memory management issue, then approve after fix.


📝 Review conducted by Claude Code

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

catalog: Arc<dyn CatalogProvider>,
) -> Option<Arc<dyn CatalogProvider>> {
if !["blue", "red", "green", "yellow"].contains(&name.as_str()) {
log::warn!("FixedCatalogProviderList only provides four schemas: blue, red, green, yellow");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message mentions "schemas" but this type manages catalogs; consider using "catalogs" for accuracy (also applies to other locations in the PR).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:typo; feedback:The Augment AI reviewer is correct that there is a copy/paste error in the warning message. It may lead to confusion when a developer investigates the logs.

module
.create_catalog_list()
.ok_or(DataFusionError::NotImplemented(
"External catalog provider failed to implement create_catalog"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message references create_catalog, but this branch calls create_catalog_list; consider updating the text to match the function.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:typo; feedback:The Augment AI reviewer is correct that there is a copy/paste error in the error message. It may lead to confusion when a developer investigates the logs.

/// This wrapper struct exists on the receiver side of the FFI interface, so it has
/// no guarantees about being able to access the data in `private_data`. Any functions
/// defined on this struct must only use the stable functions provided in
/// FFI_CatalogProviderList to interact with the foreign table provider.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comment says "foreign table provider"; here it should refer to the foreign catalog provider list for consistency.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:typo; feedback:The Augment AI reviewer is correct that there is a copy/paste error in docstring. The finding prevents confusion when a developer reads the method documentation.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f6b7e1 and 604ec93.

📒 Files selected for processing (7)
  • datafusion/ffi/src/catalog_provider.rs (1 hunks)
  • datafusion/ffi/src/catalog_provider_list.rs (1 hunks)
  • datafusion/ffi/src/lib.rs (1 hunks)
  • datafusion/ffi/src/tests/catalog.rs (2 hunks)
  • datafusion/ffi/src/tests/mod.rs (3 hunks)
  • datafusion/ffi/tests/ffi_catalog.rs (1 hunks)
  • datafusion/ffi/tests/ffi_integration.rs (0 hunks)
💤 Files with no reviewable changes (1)
  • datafusion/ffi/tests/ffi_integration.rs
🧰 Additional context used
🧬 Code graph analysis (4)
datafusion/ffi/src/tests/mod.rs (2)
datafusion/ffi/src/catalog_provider_list.rs (2)
  • catalog (203-203)
  • catalog (224-232)
datafusion/ffi/src/tests/catalog.rs (2)
  • catalog (216-218)
  • create_catalog_provider_list (234-237)
datafusion/ffi/src/tests/catalog.rs (1)
datafusion/ffi/src/catalog_provider_list.rs (7)
  • catalog (203-203)
  • catalog (224-232)
  • inner (72-75)
  • new (152-167)
  • as_any (193-195)
  • catalog_names (215-222)
  • register_catalog (197-213)
datafusion/ffi/tests/ffi_catalog.rs (4)
datafusion/ffi/src/tests/utils.rs (1)
  • get_module (56-87)
datafusion/ffi/src/tests/catalog.rs (3)
  • default (78-88)
  • default (139-145)
  • default (195-204)
datafusion/ffi/src/catalog_provider.rs (1)
  • new (183-199)
datafusion/ffi/src/catalog_provider_list.rs (1)
  • new (152-167)
datafusion/ffi/src/catalog_provider_list.rs (2)
datafusion/ffi/src/catalog_provider.rs (9)
  • runtime (89-92)
  • inner (84-87)
  • new (183-199)
  • from (213-215)
  • release_fn_wrapper (147-150)
  • drop (176-178)
  • clone_fn_wrapper (152-173)
  • clone (219-221)
  • as_any (225-227)
datafusion/ffi/src/lib.rs (1)
  • version (52-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (19)
datafusion/ffi/src/lib.rs (1)

29-29: LGTM!

The module addition follows the existing pattern and is appropriately placed in the module hierarchy.

datafusion/ffi/src/catalog_provider.rs (1)

207-207: LGTM!

The visibility change to pub(crate) is necessary for the new catalog_provider_list module to access the inner FFI_CatalogProvider (used at line 205 in catalog_provider_list.rs). The crate-private visibility appropriately limits exposure while enabling internal FFI bridging.

datafusion/ffi/src/tests/mod.rs (3)

37-38: LGTM!

The imports for the new catalog provider list functionality are correctly added.


67-68: LGTM!

The new create_catalog_list field follows the same pattern as the existing create_catalog field, providing a consistent API for test infrastructure.


131-131: LGTM!

The wiring correctly connects to the create_catalog_provider_list factory function from the catalog module.

datafusion/ffi/tests/ffi_catalog.rs (1)

29-54: LGTM!

The integration test correctly validates the FFI catalog provider by:

  • Creating a ForeignCatalogProvider from the FFI module
  • Registering it in a SessionContext
  • Querying a fully-qualified table path
  • Verifying the result count
datafusion/ffi/src/tests/catalog.rs (5)

31-31: LGTM!

The import for FFI_CatalogProviderList is correctly added to support the new test infrastructure.


34-34: LGTM!

The imports for CatalogProviderList and MemoryCatalogProviderList are correctly added.


187-205: LGTM!

The FixedCatalogProviderList test implementation correctly:

  • Wraps MemoryCatalogProviderList for delegation
  • Provides a default initialization with a "blue" catalog
  • Mirrors the pattern established by FixedCatalogProvider

207-232: LGTM!

The CatalogProviderList implementation correctly:

  • Delegates all methods to the inner MemoryCatalogProviderList
  • Validates catalog names against a whitelist in register_catalog
  • Uses log::warn! and returns None for invalid names (appropriate for an Option-returning method)

234-237: LGTM!

The factory function correctly creates an FFI_CatalogProviderList from the test implementation, following the same pattern as create_catalog_provider.

datafusion/ffi/src/catalog_provider_list.rs (8)

29-61: LGTM!

The FFI_CatalogProviderList struct is well-designed:

  • Uses #[repr(C)] and StableAbi for ABI stability
  • Function pointers follow the established pattern from FFI_CatalogProvider
  • Includes all necessary operations: register_catalog, catalog_names, catalog, clone, release, version
  • private_data pointer enables safe state management across FFI boundary

63-81: LGTM!

The Send/Sync implementations and helper methods correctly:

  • Mark the struct as thread-safe (necessary for FFI but inherently unsafe)
  • Provide safe access to the inner provider and runtime through private helper methods
  • Follow the exact pattern established in catalog_provider.rs

83-115: LGTM!

The catalog operation wrappers correctly:

  • Bridge between FFI types (RString, RVec, ROption) and native Rust types
  • Handle conversion between FFI_CatalogProvider and ForeignCatalogProvider
  • Propagate the runtime Handle for async operation support
  • Follow the established pattern from catalog_provider.rs

117-142: LGTM!

The memory management functions correctly:

  • release_fn_wrapper properly frees the Box-allocated private_data
  • clone_fn_wrapper creates a new instance with Arc::clone (cheap reference counting)
  • Follow the established pattern from catalog_provider.rs

144-168: LGTM!

The Drop implementation and constructor correctly:

  • Drop calls release to ensure cleanup
  • new() properly boxes private_data and leaks it as a raw pointer for FFI
  • All function pointers are initialized consistently
  • Follow the established pattern from catalog_provider.rs

170-190: LGTM!

The ForeignCatalogProviderList wrapper correctly:

  • Provides a safe interface on the receiver side of the FFI boundary
  • Implements Send/Sync (required for DataFusion's threading model)
  • Provides From conversion for ergonomic usage
  • Implements Clone via the function pointer
  • Follows the established pattern from ForeignCatalogProvider

192-233: LGTM!

The CatalogProviderList trait implementation correctly:

  • Bridges all required methods through FFI function pointers
  • Handles bidirectional conversion between ForeignCatalogProvider and FFI_CatalogProvider (lines 203-207)
  • Properly wraps non-FFI CatalogProviders when needed
  • Converts results back to Arc trait objects
  • Follows the same pattern as the CatalogProvider impl in catalog_provider.rs

235-283: LGTM!

The test provides comprehensive coverage:

  • Round-trip FFI conversion
  • Registering and listing catalogs
  • Replacing existing catalogs (returns Some)
  • Adding new catalogs (returns None)
  • Retrieving both existing and non-existent catalogs

This validates the FFI bridge works correctly across the boundary.

Comment on lines +56 to +81
#[tokio::test]
async fn test_catalog_list() -> datafusion_common::Result<()> {
let module = get_module()?;

let ffi_catalog_list =
module
.create_catalog_list()
.ok_or(DataFusionError::NotImplemented(
"External catalog provider failed to implement create_catalog"
.to_string(),
))?();
let foreign_catalog_list: ForeignCatalogProviderList = (&ffi_catalog_list).into();

let ctx = SessionContext::default();
ctx.register_catalog_list(Arc::new(foreign_catalog_list));

let df = ctx.table("blue.apple.purchases").await?;

let results = df.collect().await?;

assert_eq!(results.len(), 2);
let num_rows: usize = results.into_iter().map(|rb| rb.num_rows()).sum();
assert_eq!(num_rows, 5);

Ok(())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the error message for catalog list creation.

Line 64 references "create_catalog" in the error message, but it should reference "create_catalog_list" for consistency and clarity.

Apply this diff:

             .create_catalog_list()
             .ok_or(DataFusionError::NotImplemented(
-                    "External catalog provider failed to implement create_catalog"
+                    "External catalog provider failed to implement create_catalog_list"
                         .to_string(),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[tokio::test]
async fn test_catalog_list() -> datafusion_common::Result<()> {
let module = get_module()?;
let ffi_catalog_list =
module
.create_catalog_list()
.ok_or(DataFusionError::NotImplemented(
"External catalog provider failed to implement create_catalog"
.to_string(),
))?();
let foreign_catalog_list: ForeignCatalogProviderList = (&ffi_catalog_list).into();
let ctx = SessionContext::default();
ctx.register_catalog_list(Arc::new(foreign_catalog_list));
let df = ctx.table("blue.apple.purchases").await?;
let results = df.collect().await?;
assert_eq!(results.len(), 2);
let num_rows: usize = results.into_iter().map(|rb| rb.num_rows()).sum();
assert_eq!(num_rows, 5);
Ok(())
}
#[tokio::test]
async fn test_catalog_list() -> datafusion_common::Result<()> {
let module = get_module()?;
let ffi_catalog_list =
module
.create_catalog_list()
.ok_or(DataFusionError::NotImplemented(
"External catalog provider failed to implement create_catalog_list"
.to_string(),
))?();
let foreign_catalog_list: ForeignCatalogProviderList = (&ffi_catalog_list).into();
let ctx = SessionContext::default();
ctx.register_catalog_list(Arc::new(foreign_catalog_list));
let df = ctx.table("blue.apple.purchases").await?;
let results = df.collect().await?;
assert_eq!(results.len(), 2);
let num_rows: usize = results.into_iter().map(|rb| rb.num_rows()).sum();
assert_eq!(num_rows, 5);
Ok(())
}
🤖 Prompt for AI Agents
In datafusion/ffi/tests/ffi_catalog.rs around lines 56 to 81, the error message
raised when create_catalog_list() returns None incorrectly references
"create_catalog" — update the message to reference "create_catalog_list"
instead; modify the DataFusionError::NotImplemented string to say "External
catalog provider failed to implement create_catalog_list" so the log matches the
failing function name and improves clarity.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:typo; feedback:The CodeRabbit AI reviewer is correct that there is a copy/paste error in the error message. It may lead to confusion when a developer investigates the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants