Skip to content

18457: Support Arrow IPC Stream Files#10

Open
martin-augment wants to merge 43 commits intomainfrom
pr-18457-2025-11-10-09-28-02
Open

18457: Support Arrow IPC Stream Files#10
martin-augment wants to merge 43 commits intomainfrom
pr-18457-2025-11-10-09-28-02

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

@martin-augment martin-augment commented Nov 10, 2025

18457: To review by AI


Note

Add support for reading Arrow IPC stream files, auto-detecting file vs stream, and update execution sources, planning, and tests.

  • Arrow format & execution:
    • Add ArrowStreamFileSource/ArrowStreamFileOpener for Arrow IPC stream format (sequential-only), and keep ArrowFileSource/ArrowFileOpener for file format (range-parallel).
    • Introduce ArrowSource/ArrowOpener wrappers to unify both formats.
    • Auto-detect format via is_object_in_arrow_ipc_file_format and select source in ArrowFormat::create_physical_plan.
    • Enhance schema inference to handle stream format (infer_stream_schema) and mixed inputs; detect and fallback from file to stream.
    • Expose file_type as arrow or arrow_stream and support projection pushdown.
  • Tests:
    • Add sqllogictests for stream format including scans, filters, projections, joins, aggregates, unions, partitions, and EXPLAIN outputs in arrow_files.slt.
    • Add integration tests for register_arrow to auto-detect and join file/stream.
    • Add unit tests for schema inference, format detection, repartition behavior (not supported for stream), range errors, and projections in new/updated tests under datasource-arrow and core.
  • Misc:
    • Update schema adapter integration test to use ArrowSource::new_file_source naming.

Written by Cursor Bugbot for commit 2ddd690. This will update automatically on new commits. Configure here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 10, 2025

Walkthrough

The PR adds support for Arrow IPC stream format alongside existing file format support. It introduces runtime format detection, splits ArrowSource into specialized file and stream implementations, refactors schema inference to handle both formats, and adds comprehensive test coverage for both Arrow formats.

Changes

Cohort / File(s) Summary
Test Registration and Execution
datafusion/core/tests/execution/mod.rs, datafusion/core/tests/execution/register_arrow.rs
Added module declaration for register_arrow and introduced two Tokio-based integration tests for the register_arrow API: test_register_arrow_auto_detects_format() and test_register_arrow_join_file_and_stream(). Tests validate schema detection, row counts, and query execution on Arrow file and stream data sources.
Schema Adapter Tests
datafusion/core/tests/schema_adapter/schema_adapter_integration_tests.rs
Updated test to use ArrowSource::new_file_source() instead of ArrowSource::new(), aligning with the new source specialization API.
Arrow Source Specialization
datafusion/datasource-arrow/src/source.rs
Refactored Arrow IPC handling by introducing ArrowFileSource and ArrowStreamFileSource as specialized implementations, replacing the monolithic ArrowSource. Converted ArrowSource into a wrapper delegating to inner concrete sources. Split ArrowOpener into ArrowFileOpener and ArrowStreamFileOpener with tailored behavior. Added constructors new_file_source() and new_stream_file_source(). Updated file type identifiers to "arrow" and "arrow_stream".
Arrow Format Detection and Inference
datafusion/datasource-arrow/src/file_format.rs
Rewrote schema inference to support both Arrow IPC file and stream formats. Added runtime format detection via is_object_in_arrow_ipc_file_format() helper. Introduced infer_stream_schema() with fallback FileReader→StreamReader logic. Updated create_physical_plan() to select appropriate ArrowSource variant based on detected format. Enhanced error handling for metadata parsing.
Module Documentation
datafusion/datasource-arrow/src/mod.rs
Added module-level documentation reference for ArrowFormat.
SQL Logic Tests
datafusion/sqllogictest/test_files/arrow_files.slt
Added comprehensive test suite for Arrow IPC stream format with parity checks against file format, including projection pushdown, filtering, aggregations, joins, unions, GROUP BY, DISTINCT, subqueries, and partitioned tables.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18457-2025-11-10-09-28-02

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Ok(true) => Arc::new(ArrowSource::new_file_source(table_schema)),
Ok(false) => Arc::new(ArrowSource::new_stream_file_source(table_schema)),
Err(e) => Err(e)?,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Arrow: Mixed Formats Cause Scan Failures

The create_physical_plan method only checks the first file to determine format (file vs stream), then applies that same source type to all files in the scan. When multiple Arrow files are queried and they have mixed formats (some IPC file format, some IPC stream format), the wrong source type gets used for files after the first one, causing read errors. Each file should be checked individually or all files should be required to use the same format.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct that there might be a mix of plain and stream files in the different file groups. The finding prevents wrong behavior when processing an object store location with mixed file types.

@claude
Copy link
Copy Markdown

claude bot commented Nov 10, 2025

Code Review for PR #10: Support Arrow IPC Stream Files

Overview

This PR adds support for reading Arrow IPC stream format files alongside the existing file format support. The implementation includes auto-detection, appropriate execution paths, and comprehensive test coverage.


Strengths

  1. Excellent Documentation

    • Clear module-level docs explaining the confusing naming (ArrowStreamFileSource for stream format vs ArrowFileSource for file format)
    • Well-commented fallback logic in schema inference (file_format.rs:156-165)
    • Comprehensive explanation of why repartitioning isn't supported for stream format (source.rs:211-219)
  2. Comprehensive Test Coverage

    • Unit tests for both formats with various scenarios (ranges, projections, repartitioning)
    • Integration tests for auto-detection and joining file/stream formats
    • SQL logic tests covering scans, filters, projections, joins, aggregates, unions, and partitions
    • Edge case testing (empty files, corrupted files, short streams)
  3. Robust Error Handling

    • Graceful fallback from file format to stream format during schema inference
    • Proper error messages for unsupported operations (e.g., ranges with stream format)
    • Validation of metadata length (file_format.rs:425-429)
  4. Good Architecture

    • Clean separation between file and stream format implementations
    • Wrapper pattern (ArrowSource) provides unified interface
    • Proper use of traits (FileSource, FileOpener)

⚠️ Issues and Concerns

1. Critical: Potential Panic in Production Code

Location: file_format.rs:346, 358

footer_len_buf[..].try_into().unwrap(),
footer_buf[..footer_len].try_into().unwrap(),

Issue: Using unwrap() can cause panics if slice conversion fails. This is particularly risky with potentially malformed files.

Recommendation: Replace with proper error handling:

footer_len_buf[..].try_into()
    .map_err(|_| exec_datafusion_err!("Invalid footer length buffer"))?

2. Bug: Empty File Handling

Location: file_format.rs:479-493, test at line 713

Issue: The test acknowledges that empty files cause errors, but this is left unhandled:

// currently errors because it tries to read 0..6 from an empty file
assert!(result.is_err(), "Empty file should error");

Recommendation: Add explicit validation before reading:

if bytes.len() < 6 {
    return Err(ArrowError::ParseError(
        "File too small to be a valid Arrow IPC file".to_string()
    ).into());
}

Impact: Users will see confusing object_store errors instead of clear Arrow format errors.


3. Performance: Multiple File Reads During Format Detection

Location: file_format.rs:186-212

Issue: Format detection requires reading the first file in the group:

  1. First, is_object_in_arrow_ipc_file_format() reads bytes 0-6
  2. Then during execution, the file is read again

For large distributed systems with many files, this adds latency.

Recommendation: Consider caching the detection result or passing it through the config. For now, this is acceptable but worth noting for future optimization.


4. Code Quality: Unnecessary Clone Pattern

Location: source.rs:103, 113, 191, 201

Arc::new(Self { ..self.clone() })

Issue: The spread operator .. after self.clone() is redundant since you're cloning everything anyway.

Recommendation: Simplify to:

Arc::new(self.clone())

5. Potential Issue: Inconsistent Error Handling in Schema Inference

Location: file_format.rs:156-165

Issue: The fallback from FileReader to StreamReader silently swallows all FileReader errors (using Err(_)). This could hide legitimate errors like I/O failures or permission issues.

Recommendation: Be more selective about which errors trigger fallback:

match FileReader::try_new(&mut file, None) {
    Ok(reader) => reader.schema(),
    Err(e) if is_format_error(&e) => {
        file.seek(SeekFrom::Start(0))?;
        StreamReader::try_new(&mut file, None)?.schema()
    }
    Err(e) => return Err(e.into()),
}

6. Security: Range Validation Missing

Location: source.rs:398-399

Issue: Range filtering doesn't validate that ranges are within file bounds:

block_offset >= range.start as u64 && block_offset < range.end as u64

If a malicious or buggy caller provides ranges beyond file size, this could lead to unexpected behavior.

Recommendation: Add validation earlier in the chain or document that callers are responsible for providing valid ranges.


📝 Minor Suggestions

  1. Consistency: The term "file vs stream" is sometimes abbreviated inconsistently. Consider using consistent terminology throughout.

  2. Magic Number Documentation: The ARROW_MAGIC and CONTINUATION_MARKER constants are well-defined, but consider adding a comment linking to the Arrow IPC spec for reference.

  3. Test File Paths: Tests use relative paths like "../../datafusion/datasource-arrow/tests/data/example.arrow". Consider using a test utility for path resolution to make tests more maintainable.

  4. Unused Import: source.rs:33 imports Cursor but it's only used in one code path. Consider moving it closer to usage.


🔒 Security Assessment

Overall: Low Risk

  • No obvious security vulnerabilities
  • File format validation is generally sound
  • Error messages don't leak sensitive information
  • The panic issues mentioned should be addressed to prevent DoS via malformed files

Performance Assessment

Overall: Good

Positive:

  • Stream format correctly prevents repartitioning (would be inefficient)
  • File format maintains efficient range-based parallel reading
  • Schema inference is lazy (doesn't read entire file)

Concerns:

  • Double file read during format detection (mentioned above)
  • Stream format requires reading entire file sequentially (documented limitation)

🧪 Test Coverage Assessment

Overall: Excellent

  • Unit tests: ✅ Comprehensive
  • Integration tests: ✅ Good coverage
  • SQL logic tests: ✅ Extensive
  • Edge cases: ✅ Well tested
  • Error paths: ✅ Adequately covered

📊 Summary

Recommendation: Approve with Minor Fixes

This is a well-implemented feature with good architecture and excellent test coverage. The issues identified are mostly minor, with the exception of the unwrap() calls which should be addressed before merge.

Priority Fixes:

  1. 🔴 Replace unwrap() calls with proper error handling (file_format.rs:346, 358)
  2. 🟡 Add validation for empty files with clear error message
  3. 🟡 Consider more selective error handling in schema inference fallback

Nice to Have:

  • Simplify clone patterns
  • Add range validation
  • Improve consistency in naming/terminology

Stats:

  • Files changed: 10
  • Lines added: 1,128
  • Lines removed: 171
  • Test files: 3 (unit, integration, sqllogictest)

Great work overall! The feature is well-designed and the documentation clearly explains the complexities involved.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

fn open(&self, partitioned_file: PartitionedFile) -> Result<FileOpenFuture> {
if partitioned_file.range.is_some() {
return Err(exec_datafusion_err!(
"ArrowStreamOpener does not support range-based reading"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message references "ArrowStreamOpener" which doesn’t match the type name here and may confuse users; consider aligning the wording to describe the Arrow IPC stream format rather than the opener (applies to this message).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:documentation; feedback: The Augment AI reviewer is correct that the name of the opener is wrong in the error message. It should be ArrowStreamFileOpener instead. Prevents confusion when debugging issues caused by this error.

@github-actions
Copy link
Copy Markdown

  • datafusion/core/src/execution/context/mod.rs:1272 — Dropping the ExprSimplifier means execute_prepared now rejects anything that isn’t already an Expr::Literal. Before this PR we simplified the arguments first, so calls like EXECUTE plan(10 + 20) or casts such as '1'::INT worked. With the new code they fail with “Unsupported parameter type” (as reflected in the updated prepare.slt). That’s a user-visible regression unrelated to IPC stream support; please keep the simplifier (or otherwise evaluate the expressions) so parameters that reduce to literals continue to work.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datafusion/datasource-arrow/src/source.rs (1)

102-114: Propagate batch_size/projection for Arrow IPC file sources

with_batch_size and with_projection on ArrowFileSource / ArrowStreamFileSource currently ignore their parameters (datafusion/datasource-arrow/src/source.rs:102–114) and create_file_opener does not forward batch_size to the opener — unlike Parquet/CSV/JSON/Avro which store and pass batch_size to their openers/readers.

  • Fix: add batch_size: Option to ArrowFileSource and ArrowStreamFileSource, set it in with_batch_size(), forward it into ArrowFileOpener/ArrowStreamFileOpener, and use it to configure the arrow::ipc reader builders so session/file-scan batch_size is honored.
🧹 Nitpick comments (3)
datafusion/datasource-arrow/src/file_format.rs (1)

185-219: Add documentation clarifying format homogeneity assumption.

Format detection reads only the first file's magic bytes (lines 191-199) and applies that source type to all files in the group. While DataFusion file groups typically contain homogeneous files, this assumption should be documented in code. Consider adding a comment explaining that all files within a file group are expected to have the same Arrow format (IPC file or streaming), or add validation to detect and reject mixed formats at plan creation time.

datafusion/datasource-arrow/src/source.rs (2)

439-444: Consider making the inner field private.

The public inner field at line 442 exposes implementation details and allows bypassing the wrapper. Unless direct access is required for a specific use case, consider making it private to maintain better encapsulation.

Apply this diff if direct access is not required:

 pub struct ArrowSource {
-    pub inner: Arc<dyn FileSource>,
+    pub(crate) inner: Arc<dyn FileSource>,
 }

If external access is needed, consider providing a getter method instead:

impl ArrowSource {
    pub fn inner(&self) -> &Arc<dyn FileSource> {
        &self.inner
    }
}

546-549: Consider making the inner field private.

Similar to ArrowSource, the public inner field at line 548 exposes implementation details. Consider making it private or pub(crate) to maintain better encapsulation.

Apply this diff if direct access is not required:

 pub struct ArrowOpener {
-    pub inner: Arc<dyn FileOpener>,
+    pub(crate) inner: Arc<dyn FileOpener>,
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09ab068 and 2ddd690.

📒 Files selected for processing (7)
  • datafusion/core/tests/execution/mod.rs (1 hunks)
  • datafusion/core/tests/execution/register_arrow.rs (1 hunks)
  • datafusion/core/tests/schema_adapter/schema_adapter_integration_tests.rs (1 hunks)
  • datafusion/datasource-arrow/src/file_format.rs (9 hunks)
  • datafusion/datasource-arrow/src/mod.rs (1 hunks)
  • datafusion/datasource-arrow/src/source.rs (7 hunks)
  • datafusion/sqllogictest/test_files/arrow_files.slt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
datafusion/core/tests/schema_adapter/schema_adapter_integration_tests.rs (1)
datafusion/datasource-arrow/src/source.rs (8)
  • new (65-72)
  • new (157-164)
  • new (447-449)
  • new (559-561)
  • table_schema (98-100)
  • table_schema (252-254)
  • table_schema (541-543)
  • new_file_source (452-456)
datafusion/core/tests/execution/mod.rs (1)
datafusion/core/src/execution/context/mod.rs (1)
  • register_arrow (1545-1563)
datafusion/datasource-arrow/src/file_format.rs (1)
datafusion/datasource-arrow/src/source.rs (9)
  • table_schema (98-100)
  • table_schema (252-254)
  • table_schema (541-543)
  • new (65-72)
  • new (157-164)
  • new (447-449)
  • new (559-561)
  • new_file_source (452-456)
  • new_stream_file_source (459-463)
datafusion/datasource-arrow/src/source.rs (3)
datafusion/datasource/src/file.rs (13)
  • as_file_source (39-41)
  • table_schema (66-66)
  • create_file_opener (55-60)
  • as_any (62-62)
  • with_batch_size (68-68)
  • with_statistics (72-72)
  • statistics (80-80)
  • with_projection (70-70)
  • repartitioned (94-117)
  • metrics (78-78)
  • file_type (82-82)
  • with_schema_adapter_factory (142-150)
  • schema_adapter_factory (155-157)
datafusion/datasource/src/file_scan_config.rs (10)
  • file (2121-2137)
  • table_schema (309-311)
  • as_any (516-518)
  • with_batch_size (399-402)
  • with_statistics (339-342)
  • with_projection (319-321)
  • repartitioned (556-570)
  • metrics (646-648)
  • open (495-514)
  • projection (1806-1810)
datafusion/datasource/src/table_schema.rs (1)
  • table_schema (169-171)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: codex
  • GitHub Check: Analyze (rust)
🔇 Additional comments (17)
datafusion/core/tests/execution/mod.rs (1)

21-21: LGTM!

The module declaration properly exposes the new register_arrow integration tests.

datafusion/sqllogictest/test_files/arrow_files.slt (1)

132-355: Excellent test coverage for Arrow IPC stream format!

The tests comprehensively validate parity between file and stream formats, including:

  • Schema inference and physical plans
  • Projection pushdown and filtering
  • Aggregations, joins, unions, and subqueries
  • Partitioned tables with partition filters

This ensures both formats behave identically across all major query patterns.

datafusion/core/tests/execution/register_arrow.rs (2)

23-55: LGTM!

The test properly validates that register_arrow auto-detects both file and stream formats, and that both produce equivalent schemas and row counts.


57-90: LGTM!

The test validates that joins between file-backed and stream-backed tables work correctly, confirming cross-format interoperability.

datafusion/datasource-arrow/src/mod.rs (1)

18-19: LGTM!

The documentation comment clarifies the module's purpose.

datafusion/core/tests/schema_adapter/schema_adapter_integration_tests.rs (1)

287-292: LGTM!

The test has been correctly updated to use the new new_file_source constructor, and the comment has been updated to reflect that this tests ArrowFileSource specifically.

datafusion/datasource-arrow/src/file_format.rs (7)

25-25: LGTM!

The new imports support file rewinding, stream format reading, and partial object reads for format detection.

Also applies to: 31-31, 65-67


144-173: LGTM with graceful fallback logic!

The schema inference correctly handles both file and stream formats:

  • For file payloads, it tries FileReader first, then falls back to StreamReader with proper file rewinding
  • For stream payloads, it uses the new infer_stream_schema helper

The fallback ensures compatibility with both Arrow IPC formats.


237-239: LGTM!

The file_source method correctly defaults to the file format source using new_file_source.


385-448: Robust schema inference with version compatibility!

The infer_stream_schema function properly handles multiple Arrow IPC versions:

  • File format with ARROW1 magic (pre and post v0.15.0 with continuation marker)
  • Stream format with and without continuation marker
  • Validates metadata length is non-negative (lines 425-430)

The implementation correctly parses the preamble, extracts metadata length, and constructs the schema.


450-477: LGTM!

The helper function correctly extends a byte vector from a stream:

  • Returns early if sufficient bytes are already available
  • Accumulates bytes until the target length is reached
  • Properly handles truncated streams with a clear error message

479-493: LGTM!

The format detection efficiently reads only the first 6 bytes to check for the Arrow IPC file format magic number (ARROW1). The length check on line 492 properly guards against short reads.


586-725: Excellent test coverage!

The new tests comprehensively validate:

  • Schema inference for both file and stream formats
  • Different chunk sizes to verify streaming logic
  • Truncated stream error handling
  • Format detection for file, stream, corrupted, and empty files

The comment on line 722 noting that empty files "should error" is accurate since reading 6 bytes from an empty object would naturally fail.

datafusion/datasource-arrow/src/source.rs (4)

18-30: LGTM! Clear documentation on naming conventions.

The documentation effectively clarifies the potentially confusing naming where "Stream" refers to the Arrow IPC stream format, not streaming I/O.


204-221: LGTM! Well-documented rationale for disabling repartitioning.

The detailed explanation clearly justifies why stream format cannot support range-based parallel reading, avoiding duplicate I/O or the overhead of recreating footer information.


334-433: LGTM! Complex but well-structured range-based reading.

The implementation properly handles footer parsing, dictionary loading, and range-based record batch filtering for parallel reading of Arrow IPC file format.


607-649: The test data files exist and the relative paths are correct.

The review comment's concern is unfounded. Both example.arrow and example_stream.arrow exist at datafusion/datasource-arrow/tests/data/, and the relative paths used in the test ("tests/data/{filename}") are correct for Rust crate test execution, where cargo test runs with the crate root as the working directory.

Likely an incorrect or invalid review comment.

Comment on lines +265 to +269
if partitioned_file.range.is_some() {
return Err(exec_datafusion_err!(
"ArrowStreamOpener does not support range-based reading"
));
}
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix error message to match struct name.

The error message refers to "ArrowStreamOpener" but the struct is named "ArrowStreamFileOpener".

Apply this diff:

-                "ArrowStreamOpener does not support range-based reading"
+                "ArrowStreamFileOpener does not support range-based reading"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if partitioned_file.range.is_some() {
return Err(exec_datafusion_err!(
"ArrowStreamOpener does not support range-based reading"
));
}
if partitioned_file.range.is_some() {
return Err(exec_datafusion_err!(
"ArrowStreamFileOpener does not support range-based reading"
));
}
🤖 Prompt for AI Agents
In datafusion/datasource-arrow/src/source.rs around lines 265 to 269, the error
string incorrectly references "ArrowStreamOpener" instead of the struct name
"ArrowStreamFileOpener"; update the returned error message to exactly
"ArrowStreamFileOpener does not support range-based reading" so it matches the
struct name.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:documentation; feedback: The CodeRabbit AI reviewer is correct that the name of the opener is wrong in the error message. It should be ArrowStreamFileOpener instead. Prevents confusion when debugging issues caused by this error.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-augment Thank you for confirming! You're absolutely right—having the correct struct name in the error message is important for debugging. When developers encounter this error, seeing "ArrowStreamFileOpener" will help them quickly identify the exact component that's raising the issue, making troubleshooting much more straightforward.

The fix ensures the error message accurately reflects the code structure.

@martin-augment
Copy link
Copy Markdown
Owner Author

1. Critical: Potential Panic in Production Code

Location: file_format.rs:346, 358

footer_len_buf[..].try_into().unwrap(),
footer_buf[..footer_len].try_into().unwrap(),

Issue: Using unwrap() can cause panics if slice conversion fails. This is particularly risky with potentially malformed files.

value:annoying; category:bug; feedback: The Claude AI reviewer is wrong! There is no such code in file_format.rs! There is similar code in datasource-arrow/src/source.rs but it is not modified in this Pull Request.

@martin-augment
Copy link
Copy Markdown
Owner Author

2. Bug: Empty File Handling

Location: file_format.rs:479-493, test at line 713

Issue: The test acknowledges that empty files cause errors, but this is left unhandled:

// currently errors because it tries to read 0..6 from an empty file
assert!(result.is_err(), "Empty file should error");

value:useful; category:bug; feedback: The Claude AI reviewer is correct that the handling of empty/smaller-than-6-bytes files is not good. Currently reading such files would fail with Error::NotModified (https://docs.rs/object_store/latest/object_store/struct.GetOptions.html#structfield.range) which is really confusing.

@martin-augment
Copy link
Copy Markdown
Owner Author

3. Performance: Multiple File Reads During Format Detection

Location: file_format.rs:186-212

Issue: Format detection requires reading the first file in the group:

  1. First, is_object_in_arrow_ipc_file_format() reads bytes 0-6
  2. Then during execution, the file is read again

For large distributed systems with many files, this adds latency.

value:good-but-wont-fix; category:bug; feedback: The Claude Ai reviewer is partially correct. The new logic reads the first 6 bytes only for the first file in the first file group to detect the type of all files in all groups. The latency is minimal but needed.

@martin-augment
Copy link
Copy Markdown
Owner Author

4. Code Quality: Unnecessary Clone Pattern

Location: source.rs:103, 113, 191, 201

Arc::new(Self { ..self.clone() })

Issue: The spread operator .. after self.clone() is redundant since you're cloning everything anyway.

value:good-to-have; category:bug; feedback: The Claude Ai reviewer is correct that there is no need of the spread operator here since all fields from the clone are used. There is no bug or regression - just more complicated code than needed.

@martin-augment
Copy link
Copy Markdown
Owner Author

5. Potential Issue: Inconsistent Error Handling in Schema Inference

Location: file_format.rs:156-165

Issue: The fallback from FileReader to StreamReader silently swallows all FileReader errors (using Err(_)). This could hide legitimate errors like I/O failures or permission issues.

value:good-to-have; category:bug; feedback: The Claude Ai reviewer is correct that the error from the FileReader attempt is discarded and if the type of the file is not a stream then the wrong error is shown to the user. The finding prevents misleading errors.

@martin-augment
Copy link
Copy Markdown
Owner Author

  • datafusion/core/src/execution/context/mod.rs:1272 — Dropping the ExprSimplifier means execute_prepared now rejects anything that isn’t already an Expr::Literal. Before this PR we simplified the arguments first, so calls like EXECUTE plan(10 + 20) or casts such as '1'::INT worked. With the new code they fail with “Unsupported parameter type” (as reflected in the updated prepare.slt). That’s a user-visible regression unrelated to IPC stream support; please keep the simplifier (or otherwise evaluate the expressions) so parameters that reduce to literals continue to work.

value:annoying; category:bug; feedback:The Codex AI reviewer is wrong! This file is not modified in this Pull Request.

@martin-augment
Copy link
Copy Markdown
Owner Author

439-444: Consider making the inner field private.

The public inner field at line 442 exposes implementation details and allows bypassing the wrapper. Unless direct access is required for a specific use case, consider making it private to maintain better encapsulation

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct that the inner field should not be exposed to the public API. Prevents exposing internal implementation that would make it hard to be improved later because it may break users of the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants