18457: Support Arrow IPC Stream Files by martin-augment · Pull Request #10 · martin-augment/datafusion

martin-augment · 2025-11-10T09:28:06Z

18457: To review by AI

Note

Add support for reading Arrow IPC stream files, auto-detecting file vs stream, and update execution sources, planning, and tests.

Arrow format & execution:
- Add ArrowStreamFileSource/ArrowStreamFileOpener for Arrow IPC stream format (sequential-only), and keep ArrowFileSource/ArrowFileOpener for file format (range-parallel).
- Introduce ArrowSource/ArrowOpener wrappers to unify both formats.
- Auto-detect format via is_object_in_arrow_ipc_file_format and select source in ArrowFormat::create_physical_plan.
- Enhance schema inference to handle stream format (infer_stream_schema) and mixed inputs; detect and fallback from file to stream.
- Expose file_type as arrow or arrow_stream and support projection pushdown.
Tests:
- Add sqllogictests for stream format including scans, filters, projections, joins, aggregates, unions, partitions, and EXPLAIN outputs in arrow_files.slt.
- Add integration tests for register_arrow to auto-detect and join file/stream.
- Add unit tests for schema inference, format detection, repartition behavior (not supported for stream), range errors, and projections in new/updated tests under datasource-arrow and core.
Misc:
- Update schema adapter integration test to use ArrowSource::new_file_source naming.

^{Written by Cursor Bugbot for commit 2ddd690. This will update automatically on new commits. Configure here.}

…rapped

…o cs--register-arrow-ipc-stream-format-files

coderabbitai · 2025-11-10T09:28:46Z

Walkthrough

The PR adds support for Arrow IPC stream format alongside existing file format support. It introduces runtime format detection, splits ArrowSource into specialized file and stream implementations, refactors schema inference to handle both formats, and adds comprehensive test coverage for both Arrow formats.

Changes

Cohort / File(s)	Summary
Test Registration and Execution `datafusion/core/tests/execution/mod.rs`, `datafusion/core/tests/execution/register_arrow.rs`	Added module declaration for `register_arrow` and introduced two Tokio-based integration tests for the `register_arrow` API: `test_register_arrow_auto_detects_format()` and `test_register_arrow_join_file_and_stream()`. Tests validate schema detection, row counts, and query execution on Arrow file and stream data sources.
Schema Adapter Tests `datafusion/core/tests/schema_adapter/schema_adapter_integration_tests.rs`	Updated test to use `ArrowSource::new_file_source()` instead of `ArrowSource::new()`, aligning with the new source specialization API.
Arrow Source Specialization `datafusion/datasource-arrow/src/source.rs`	Refactored Arrow IPC handling by introducing `ArrowFileSource` and `ArrowStreamFileSource` as specialized implementations, replacing the monolithic `ArrowSource`. Converted `ArrowSource` into a wrapper delegating to inner concrete sources. Split `ArrowOpener` into `ArrowFileOpener` and `ArrowStreamFileOpener` with tailored behavior. Added constructors `new_file_source()` and `new_stream_file_source()`. Updated file type identifiers to "arrow" and "arrow_stream".
Arrow Format Detection and Inference `datafusion/datasource-arrow/src/file_format.rs`	Rewrote schema inference to support both Arrow IPC file and stream formats. Added runtime format detection via `is_object_in_arrow_ipc_file_format()` helper. Introduced `infer_stream_schema()` with fallback FileReader→StreamReader logic. Updated `create_physical_plan()` to select appropriate ArrowSource variant based on detected format. Enhanced error handling for metadata parsing.
Module Documentation `datafusion/datasource-arrow/src/mod.rs`	Added module-level documentation reference for `ArrowFormat`.
SQL Logic Tests `datafusion/sqllogictest/test_files/arrow_files.slt`	Added comprehensive test suite for Arrow IPC stream format with parity checks against file format, including projection pushdown, filtering, aggregations, joins, unions, GROUP BY, DISTINCT, subqueries, and partitioned tables.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-18457-2025-11-10-09-28-02

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cursor · 2025-11-10T09:29:19Z

+                Ok(true) => Arc::new(ArrowSource::new_file_source(table_schema)),
+                Ok(false) => Arc::new(ArrowSource::new_stream_file_source(table_schema)),
+                Err(e) => Err(e)?,
+            };


Bug: Arrow: Mixed Formats Cause Scan Failures

The create_physical_plan method only checks the first file to determine format (file vs stream), then applies that same source type to all files in the scan. When multiple Arrow files are queried and they have mixed formats (some IPC file format, some IPC stream format), the wrong source type gets used for files after the first one, causing read errors. Each file should be checked individually or all files should be required to use the same format.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct that there might be a mix of plain and stream files in the different file groups. The finding prevents wrong behavior when processing an object store location with mixed file types.

claude · 2025-11-10T09:29:54Z

Code Review for PR #10: Support Arrow IPC Stream Files

Overview

This PR adds support for reading Arrow IPC stream format files alongside the existing file format support. The implementation includes auto-detection, appropriate execution paths, and comprehensive test coverage.

✅ Strengths

Excellent Documentation
- Clear module-level docs explaining the confusing naming (ArrowStreamFileSource for stream format vs ArrowFileSource for file format)
- Well-commented fallback logic in schema inference (file_format.rs:156-165)
- Comprehensive explanation of why repartitioning isn't supported for stream format (source.rs:211-219)
Comprehensive Test Coverage
- Unit tests for both formats with various scenarios (ranges, projections, repartitioning)
- Integration tests for auto-detection and joining file/stream formats
- SQL logic tests covering scans, filters, projections, joins, aggregates, unions, and partitions
- Edge case testing (empty files, corrupted files, short streams)
Robust Error Handling
- Graceful fallback from file format to stream format during schema inference
- Proper error messages for unsupported operations (e.g., ranges with stream format)
- Validation of metadata length (file_format.rs:425-429)
Good Architecture
- Clean separation between file and stream format implementations
- Wrapper pattern (ArrowSource) provides unified interface
- Proper use of traits (FileSource, FileOpener)

⚠️ Issues and Concerns

1. Critical: Potential Panic in Production Code

Location: file_format.rs:346, 358

footer_len_buf[..].try_into().unwrap(),
footer_buf[..footer_len].try_into().unwrap(),

Issue: Using unwrap() can cause panics if slice conversion fails. This is particularly risky with potentially malformed files.

Recommendation: Replace with proper error handling:

footer_len_buf[..].try_into()
    .map_err(|_| exec_datafusion_err!("Invalid footer length buffer"))?

2. Bug: Empty File Handling

Location: file_format.rs:479-493, test at line 713

Issue: The test acknowledges that empty files cause errors, but this is left unhandled:

// currently errors because it tries to read 0..6 from an empty file
assert!(result.is_err(), "Empty file should error");

Recommendation: Add explicit validation before reading:

if bytes.len() < 6 {
    return Err(ArrowError::ParseError(
        "File too small to be a valid Arrow IPC file".to_string()
    ).into());
}

Impact: Users will see confusing object_store errors instead of clear Arrow format errors.

3. Performance: Multiple File Reads During Format Detection

Location: file_format.rs:186-212

Issue: Format detection requires reading the first file in the group:

First, is_object_in_arrow_ipc_file_format() reads bytes 0-6
Then during execution, the file is read again

For large distributed systems with many files, this adds latency.

Recommendation: Consider caching the detection result or passing it through the config. For now, this is acceptable but worth noting for future optimization.

4. Code Quality: Unnecessary Clone Pattern

Location: source.rs:103, 113, 191, 201

Arc::new(Self { ..self.clone() })

Issue: The spread operator .. after self.clone() is redundant since you're cloning everything anyway.

Recommendation: Simplify to:

Arc::new(self.clone())

5. Potential Issue: Inconsistent Error Handling in Schema Inference

Location: file_format.rs:156-165

Issue: The fallback from FileReader to StreamReader silently swallows all FileReader errors (using Err(_)). This could hide legitimate errors like I/O failures or permission issues.

Recommendation: Be more selective about which errors trigger fallback:

match FileReader::try_new(&mut file, None) {
    Ok(reader) => reader.schema(),
    Err(e) if is_format_error(&e) => {
        file.seek(SeekFrom::Start(0))?;
        StreamReader::try_new(&mut file, None)?.schema()
    }
    Err(e) => return Err(e.into()),
}

6. Security: Range Validation Missing

Location: source.rs:398-399

Issue: Range filtering doesn't validate that ranges are within file bounds:

block_offset >= range.start as u64 && block_offset < range.end as u64

If a malicious or buggy caller provides ranges beyond file size, this could lead to unexpected behavior.

Recommendation: Add validation earlier in the chain or document that callers are responsible for providing valid ranges.

📝 Minor Suggestions

Consistency: The term "file vs stream" is sometimes abbreviated inconsistently. Consider using consistent terminology throughout.
Magic Number Documentation: The ARROW_MAGIC and CONTINUATION_MARKER constants are well-defined, but consider adding a comment linking to the Arrow IPC spec for reference.
Test File Paths: Tests use relative paths like "../../datafusion/datasource-arrow/tests/data/example.arrow". Consider using a test utility for path resolution to make tests more maintainable.
Unused Import: source.rs:33 imports Cursor but it's only used in one code path. Consider moving it closer to usage.

🔒 Security Assessment

Overall: Low Risk

No obvious security vulnerabilities
File format validation is generally sound
Error messages don't leak sensitive information
The panic issues mentioned should be addressed to prevent DoS via malformed files

⚡ Performance Assessment

Overall: Good

Positive:

Stream format correctly prevents repartitioning (would be inefficient)
File format maintains efficient range-based parallel reading
Schema inference is lazy (doesn't read entire file)

Concerns:

Double file read during format detection (mentioned above)
Stream format requires reading entire file sequentially (documented limitation)

🧪 Test Coverage Assessment

Overall: Excellent

Unit tests: ✅ Comprehensive
Integration tests: ✅ Good coverage
SQL logic tests: ✅ Extensive
Edge cases: ✅ Well tested
Error paths: ✅ Adequately covered

📊 Summary

Recommendation: Approve with Minor Fixes

This is a well-implemented feature with good architecture and excellent test coverage. The issues identified are mostly minor, with the exception of the unwrap() calls which should be addressed before merge.

Priority Fixes:

🔴 Replace unwrap() calls with proper error handling (file_format.rs:346, 358)
🟡 Add validation for empty files with clear error message
🟡 Consider more selective error handling in schema inference fallback

Nice to Have:

Simplify clone patterns
Add range validation
Improve consistency in naming/terminology

Stats:

Files changed: 10
Lines added: 1,128
Lines removed: 171
Test files: 3 (unit, integration, sqllogictest)

Great work overall! The feature is well-designed and the documentation clearly explains the complexities involved.

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-11-10T09:31:37Z

+    fn open(&self, partitioned_file: PartitionedFile) -> Result<FileOpenFuture> {
+        if partitioned_file.range.is_some() {
+            return Err(exec_datafusion_err!(
+                "ArrowStreamOpener does not support range-based reading"


The error message references "ArrowStreamOpener" which doesn’t match the type name here and may confuse users; consider aligning the wording to describe the Arrow IPC stream format rather than the opener (applies to this message).

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:documentation; feedback: The Augment AI reviewer is correct that the name of the opener is wrong in the error message. It should be ArrowStreamFileOpener instead. Prevents confusion when debugging issues caused by this error.

github-actions · 2025-11-10T09:38:48Z

datafusion/core/src/execution/context/mod.rs:1272 — Dropping the ExprSimplifier means execute_prepared now rejects anything that isn’t already an Expr::Literal. Before this PR we simplified the arguments first, so calls like EXECUTE plan(10 + 20) or casts such as '1'::INT worked. With the new code they fail with “Unsupported parameter type” (as reflected in the updated prepare.slt). That’s a user-visible regression unrelated to IPC stream support; please keep the simplifier (or otherwise evaluate the expressions) so parameters that reduce to literals continue to work.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

datafusion/datasource-arrow/src/source.rs (1)

102-114: Propagate batch_size/projection for Arrow IPC file sources

with_batch_size and with_projection on ArrowFileSource / ArrowStreamFileSource currently ignore their parameters (datafusion/datasource-arrow/src/source.rs:102–114) and create_file_opener does not forward batch_size to the opener — unlike Parquet/CSV/JSON/Avro which store and pass batch_size to their openers/readers.

Fix: add batch_size: Option to ArrowFileSource and ArrowStreamFileSource, set it in with_batch_size(), forward it into ArrowFileOpener/ArrowStreamFileOpener, and use it to configure the arrow::ipc reader builders so session/file-scan batch_size is honored.

🧹 Nitpick comments (3)

datafusion/datasource-arrow/src/file_format.rs (1)

185-219: Add documentation clarifying format homogeneity assumption.

Format detection reads only the first file's magic bytes (lines 191-199) and applies that source type to all files in the group. While DataFusion file groups typically contain homogeneous files, this assumption should be documented in code. Consider adding a comment explaining that all files within a file group are expected to have the same Arrow format (IPC file or streaming), or add validation to detect and reject mixed formats at plan creation time.
datafusion/datasource-arrow/src/source.rs (2)
439-444: Consider making the inner field private.

The public inner field at line 442 exposes implementation details and allows bypassing the wrapper. Unless direct access is required for a specific use case, consider making it private to maintain better encapsulation.

Apply this diff if direct access is not required:
 pub struct ArrowSource {
-    pub inner: Arc<dyn FileSource>,
+    pub(crate) inner: Arc<dyn FileSource>,
 }
If external access is needed, consider providing a getter method instead:
impl ArrowSource {
    pub fn inner(&self) -> &Arc<dyn FileSource> {
        &self.inner
    }
}
546-549: Consider making the inner field private.

Similar to ArrowSource, the public inner field at line 548 exposes implementation details. Consider making it private or pub(crate) to maintain better encapsulation.

Apply this diff if direct access is not required:
 pub struct ArrowOpener {
-    pub inner: Arc<dyn FileOpener>,
+    pub(crate) inner: Arc<dyn FileOpener>,
 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09ab068 and 2ddd690.

📒 Files selected for processing (7)

datafusion/core/tests/execution/mod.rs (1 hunks)
datafusion/core/tests/execution/register_arrow.rs (1 hunks)
datafusion/core/tests/schema_adapter/schema_adapter_integration_tests.rs (1 hunks)
datafusion/datasource-arrow/src/file_format.rs (9 hunks)
datafusion/datasource-arrow/src/mod.rs (1 hunks)
datafusion/datasource-arrow/src/source.rs (7 hunks)
datafusion/sqllogictest/test_files/arrow_files.slt (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

datafusion/core/tests/schema_adapter/schema_adapter_integration_tests.rs (1)

datafusion/datasource-arrow/src/source.rs (8)

new (65-72)

new (157-164)

new (447-449)

new (559-561)

table_schema (98-100)

table_schema (252-254)

table_schema (541-543)

new_file_source (452-456)

datafusion/core/tests/execution/mod.rs (1)

datafusion/core/src/execution/context/mod.rs (1)

register_arrow (1545-1563)

datafusion/datasource-arrow/src/file_format.rs (1)

datafusion/datasource-arrow/src/source.rs (9)

table_schema (98-100)

table_schema (252-254)

table_schema (541-543)

new (65-72)

new (157-164)

new (447-449)

new (559-561)

new_file_source (452-456)

new_stream_file_source (459-463)

datafusion/datasource-arrow/src/source.rs (3)

datafusion/datasource/src/file.rs (13)

as_file_source (39-41)

table_schema (66-66)

create_file_opener (55-60)

as_any (62-62)

with_batch_size (68-68)

with_statistics (72-72)

statistics (80-80)

with_projection (70-70)

repartitioned (94-117)

metrics (78-78)

file_type (82-82)

with_schema_adapter_factory (142-150)

schema_adapter_factory (155-157)

datafusion/datasource/src/file_scan_config.rs (10)

file (2121-2137)

table_schema (309-311)

as_any (516-518)

with_batch_size (399-402)

with_statistics (339-342)

with_projection (319-321)

repartitioned (556-570)

metrics (646-648)

open (495-514)

projection (1806-1810)

datafusion/datasource/src/table_schema.rs (1)

table_schema (169-171)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: codex
GitHub Check: Analyze (rust)

🔇 Additional comments (17)

datafusion/core/tests/execution/mod.rs (1)

21-21: LGTM!

The module declaration properly exposes the new register_arrow integration tests.

datafusion/sqllogictest/test_files/arrow_files.slt (1)

132-355: Excellent test coverage for Arrow IPC stream format!

The tests comprehensively validate parity between file and stream formats, including:

Schema inference and physical plans

Projection pushdown and filtering

Aggregations, joins, unions, and subqueries

Partitioned tables with partition filters

This ensures both formats behave identically across all major query patterns.

datafusion/core/tests/execution/register_arrow.rs (2)

23-55: LGTM!

The test properly validates that register_arrow auto-detects both file and stream formats, and that both produce equivalent schemas and row counts.

57-90: LGTM!

The test validates that joins between file-backed and stream-backed tables work correctly, confirming cross-format interoperability.

datafusion/datasource-arrow/src/mod.rs (1)

18-19: LGTM!

The documentation comment clarifies the module's purpose.

datafusion/core/tests/schema_adapter/schema_adapter_integration_tests.rs (1)

287-292: LGTM!

The test has been correctly updated to use the new new_file_source constructor, and the comment has been updated to reflect that this tests ArrowFileSource specifically.

datafusion/datasource-arrow/src/file_format.rs (7)

25-25: LGTM!

The new imports support file rewinding, stream format reading, and partial object reads for format detection.

Also applies to: 31-31, 65-67

144-173: LGTM with graceful fallback logic!

The schema inference correctly handles both file and stream formats:

For file payloads, it tries FileReader first, then falls back to StreamReader with proper file rewinding

For stream payloads, it uses the new infer_stream_schema helper

The fallback ensures compatibility with both Arrow IPC formats.

237-239: LGTM!

The file_source method correctly defaults to the file format source using new_file_source.

385-448: Robust schema inference with version compatibility!

The infer_stream_schema function properly handles multiple Arrow IPC versions:

File format with ARROW1 magic (pre and post v0.15.0 with continuation marker)

Stream format with and without continuation marker

Validates metadata length is non-negative (lines 425-430)

The implementation correctly parses the preamble, extracts metadata length, and constructs the schema.

450-477: LGTM!

The helper function correctly extends a byte vector from a stream:

Returns early if sufficient bytes are already available

Accumulates bytes until the target length is reached

Properly handles truncated streams with a clear error message

479-493: LGTM!

The format detection efficiently reads only the first 6 bytes to check for the Arrow IPC file format magic number (ARROW1). The length check on line 492 properly guards against short reads.

586-725: Excellent test coverage!

The new tests comprehensively validate:

Schema inference for both file and stream formats

Different chunk sizes to verify streaming logic

Truncated stream error handling

Format detection for file, stream, corrupted, and empty files

The comment on line 722 noting that empty files "should error" is accurate since reading 6 bytes from an empty object would naturally fail.

datafusion/datasource-arrow/src/source.rs (4)

18-30: LGTM! Clear documentation on naming conventions.

The documentation effectively clarifies the potentially confusing naming where "Stream" refers to the Arrow IPC stream format, not streaming I/O.

204-221: LGTM! Well-documented rationale for disabling repartitioning.

The detailed explanation clearly justifies why stream format cannot support range-based parallel reading, avoiding duplicate I/O or the overhead of recreating footer information.

334-433: LGTM! Complex but well-structured range-based reading.

The implementation properly handles footer parsing, dictionary loading, and range-based record batch filtering for parallel reading of Arrow IPC file format.

607-649: The test data files exist and the relative paths are correct.

The review comment's concern is unfounded. Both example.arrow and example_stream.arrow exist at datafusion/datasource-arrow/tests/data/, and the relative paths used in the test ("tests/data/{filename}") are correct for Rust crate test execution, where cargo test runs with the crate root as the working directory.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-11-10T09:44:27Z

+        if partitioned_file.range.is_some() {
+            return Err(exec_datafusion_err!(
+                "ArrowStreamOpener does not support range-based reading"
+            ));
+        }


⚠️ Potential issue | 🟡 Minor

Fix error message to match struct name.

The error message refers to "ArrowStreamOpener" but the struct is named "ArrowStreamFileOpener".

Apply this diff:

- "ArrowStreamOpener does not support range-based reading" + "ArrowStreamFileOpener does not support range-based reading"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if partitioned_file.range.is_some() {

return Err(exec_datafusion_err!(

"ArrowStreamOpener does not support range-based reading"

));

}

if partitioned_file.range.is_some() {

return Err(exec_datafusion_err!(

"ArrowStreamFileOpener does not support range-based reading"

));

}

🤖 Prompt for AI Agents

In datafusion/datasource-arrow/src/source.rs around lines 265 to 269, the error string incorrectly references "ArrowStreamOpener" instead of the struct name "ArrowStreamFileOpener"; update the returned error message to exactly "ArrowStreamFileOpener does not support range-based reading" so it matches the struct name.

value:useful; category:documentation; feedback: The CodeRabbit AI reviewer is correct that the name of the opener is wrong in the error message. It should be ArrowStreamFileOpener instead. Prevents confusion when debugging issues caused by this error.

@martin-augment Thank you for confirming! You're absolutely right—having the correct struct name in the error message is important for debugging. When developers encounter this error, seeing "ArrowStreamFileOpener" will help them quickly identify the exact component that's raising the issue, making troubleshooting much more straightforward.

The fix ensures the error message accurately reflects the code structure.

martin-augment · 2025-11-10T10:00:46Z

1. Critical: Potential Panic in Production Code

Location: file_format.rs:346, 358
footer_len_buf[..].try_into().unwrap(),
footer_buf[..footer_len].try_into().unwrap(),
Issue: Using unwrap() can cause panics if slice conversion fails. This is particularly risky with potentially malformed files.

value:annoying; category:bug; feedback: The Claude AI reviewer is wrong! There is no such code in file_format.rs! There is similar code in datasource-arrow/src/source.rs but it is not modified in this Pull Request.

martin-augment · 2025-11-10T10:37:34Z

2. Bug: Empty File Handling

Location: file_format.rs:479-493, test at line 713

Issue: The test acknowledges that empty files cause errors, but this is left unhandled:
// currently errors because it tries to read 0..6 from an empty file
assert!(result.is_err(), "Empty file should error");

value:useful; category:bug; feedback: The Claude AI reviewer is correct that the handling of empty/smaller-than-6-bytes files is not good. Currently reading such files would fail with Error::NotModified (https://docs.rs/object_store/latest/object_store/struct.GetOptions.html#structfield.range) which is really confusing.

martin-augment · 2025-11-10T10:40:00Z

3. Performance: Multiple File Reads During Format Detection

Location: file_format.rs:186-212

Issue: Format detection requires reading the first file in the group:

First, is_object_in_arrow_ipc_file_format() reads bytes 0-6

Then during execution, the file is read again

For large distributed systems with many files, this adds latency.

value:good-but-wont-fix; category:bug; feedback: The Claude Ai reviewer is partially correct. The new logic reads the first 6 bytes only for the first file in the first file group to detect the type of all files in all groups. The latency is minimal but needed.

martin-augment · 2025-11-10T10:41:49Z

4. Code Quality: Unnecessary Clone Pattern

Location: source.rs:103, 113, 191, 201
Arc::new(Self { ..self.clone() })
Issue: The spread operator .. after self.clone() is redundant since you're cloning everything anyway.

value:good-to-have; category:bug; feedback: The Claude Ai reviewer is correct that there is no need of the spread operator here since all fields from the clone are used. There is no bug or regression - just more complicated code than needed.

martin-augment · 2025-11-10T10:44:03Z

5. Potential Issue: Inconsistent Error Handling in Schema Inference

Location: file_format.rs:156-165

Issue: The fallback from FileReader to StreamReader silently swallows all FileReader errors (using Err(_)). This could hide legitimate errors like I/O failures or permission issues.

value:good-to-have; category:bug; feedback: The Claude Ai reviewer is correct that the error from the FileReader attempt is discarded and if the type of the file is not a stream then the wrong error is shown to the user. The finding prevents misleading errors.

martin-augment · 2025-11-10T10:47:51Z

datafusion/core/src/execution/context/mod.rs:1272 — Dropping the ExprSimplifier means execute_prepared now rejects anything that isn’t already an Expr::Literal. Before this PR we simplified the arguments first, so calls like EXECUTE plan(10 + 20) or casts such as '1'::INT worked. With the new code they fail with “Unsupported parameter type” (as reflected in the updated prepare.slt). That’s a user-visible regression unrelated to IPC stream support; please keep the simplifier (or otherwise evaluate the expressions) so parameters that reduce to literals continue to work.

value:annoying; category:bug; feedback:The Codex AI reviewer is wrong! This file is not modified in this Pull Request.

martin-augment · 2025-11-10T10:56:41Z

439-444: Consider making the inner field private.

The public inner field at line 442 exposes implementation details and allows bypassing the wrapper. Unless direct access is required for a specific use case, consider making it private to maintain better encapsulation

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct that the inner field should not be exposed to the public API. Prevents exposing internal implementation that would make it hard to be improved later because it may break users of the API.

corasaurus-hex added 30 commits November 2, 2025 20:40

Infer stream ipc format for arrow data sources

3e6a570

Allow FileOpener for ArrowSource to open both IPC formats

0ad62ed

Split reading file vs stream because repartitioning + ranges

34ccba4

Fix rewind bug

99ebe62

Remove a comment that isn't needed anymore

936b2e3

Stray reference left over from Rename Symbol fail

a8bc19d

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

93d26b1

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

21320cf

Address clippy error

3c00395

Address additional clippy errors

917c6c3

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

0f5642a

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

8941014

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

ffeca09

Pull out the stream format check into an independent function

07593b4

Refactor schema inference

0446c32

Let's move the into() outside the parens

7409462

Err, no, on the inside

3f72b0c

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

9e63fc7

Also include a test for arrow stream source

c6d4a06

Add a bunch more tests

4e28ef9

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

d6c77f8

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

cd4f45b

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

d88f5c0

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

afb17eb

Document rename of ArrowSource and addition of ArrowStreamSource

589a1d2

Rename ArrowStreamSource to ArrowStreamFileSource

0ac2f6d

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

d0fb260

Keep same public interface but switch between formats

c8e8bd1

whitespace fixes

6f01b80

Remove note from upgrade guide

6b5ba4b

corasaurus-hex added 13 commits November 8, 2025 23:19

Improve the docs a little

2e857c7

Add a note about the confusing naming

c67c881

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files-w…

724444f

…rapped

Stray line in the upgrade doc

882e9fa

Add initializers and fill out shared interface

9144d9e

Formatting

d5122f0

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files-w…

10cc7d7

…rapped

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

5f064ff

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files-w…

68d286a

…rapped

Type error and linting fix

8b314a5

Docs

20c8377

Merge branch 'cs--register-arrow-ipc-stream-format-files-wrapped' int…

da81767

…o cs--register-arrow-ipc-stream-format-files

Update docs for clarity

2ddd690

cursor bot reviewed Nov 10, 2025

View reviewed changes

augmentcode bot reviewed Nov 10, 2025

View reviewed changes

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

Conversation

martin-augment commented Nov 10, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

cursor bot Nov 10, 2025

Choose a reason for hiding this comment

Bug: Arrow: Mixed Formats Cause Scan Failures

Uh oh!

martin-augment Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 10, 2025

Code Review for PR #10: Support Arrow IPC Stream Files

Overview

✅ Strengths

⚠️ Issues and Concerns

1. Critical: Potential Panic in Production Code

2. Bug: Empty File Handling

3. Performance: Multiple File Reads During Format Detection

4. Code Quality: Unnecessary Clone Pattern

5. Potential Issue: Inconsistent Error Handling in Schema Inference

6. Security: Range Validation Missing

📝 Minor Suggestions

🔒 Security Assessment

⚡ Performance Assessment

🧪 Test Coverage Assessment

📊 Summary

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Nov 10, 2025

1. Critical: Potential Panic in Production Code

Uh oh!

martin-augment commented Nov 10, 2025

2. Bug: Empty File Handling

Uh oh!

martin-augment commented Nov 10, 2025

3. Performance: Multiple File Reads During Format Detection

Uh oh!

martin-augment commented Nov 10, 2025

4. Code Quality: Unnecessary Clone Pattern

Uh oh!

martin-augment commented Nov 10, 2025

5. Potential Issue: Inconsistent Error Handling in Schema Inference

Uh oh!

martin-augment commented Nov 10, 2025

Uh oh!

martin-augment commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

martin-augment commented Nov 10, 2025 •

edited by cursor bot

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

coderabbitai bot Nov 10, 2025 •

edited

Loading