read_feather() fails with IOError: Invalid IPC message: negative bodyLength on batches larger than 2 GB

### Describe the bug, including details regarding any error messages, version, and platform.

Moved from [https://github.com/apache/arrow/issues/37069](https://github.com/apache/arrow/issues/37069). Original text follows:

I have some C# code using the Apache.Arrow 12.0.1 nuget to write .feather files with `ArrowFileWriter.WriteRecordBatch()` and its companions. These files are then read in R 4.2 using `read_feather()` from the arrow 12.0.0 package (Windows 10 22H2, RStudio 2023.06.1). This process works fine for files with a single record batch up to at least 2,129,214,698 bytes (1.98 GB). Much above that—the next file size up the code I'm running produces happens to be 2,176,530,466 bytes, 2.02 GB—and `read_feather()` fails with
```R
Error: IOError: Invalid IPC message: negative bodyLength
```
This appears to coming from [CheckMetadataAndGetBodyLength()](https://github.com/apache/arrow/blob/main/cpp/src/arrow/ipc/message.cc#L188), which takes `bodyLength` as an `int64_t`. So presumably what's happening here is something upstream of `CheckMetadataAndGetBodyLength()` is handling `bodyLength` as 32 bit signed and inducing integer rollover.

The error message is a bit cryptic and, from a brief look at the code it's unclear how a block's `bodyLength` relates to file and record batch size. Since forcing the total (uncompressed) size of a record batch below 2 GB avoids the negative bodyLength error it appears `CheckMetadataAndGetBodyLength()` is picking up something controlled by record batch size rather than file size. At the moment I've tested up to 3.6 GB multibatch files and `read_feather()` handles them fine.

The [Arrow spec](https://arrow.apache.org/docs/format/Columnar.html) seems to suggest compliant Arrow implementations support array lengths to 2³¹ - 1 elements. The record batch format here is 11 four byte columns, plus a one byte and a couple two bytes, so one reading of the spec is files (or maybe record batches) up to 98 GB should be compatible with the recommended limits on multi-language use.

I'm including the C++ tag on this as I'm not seeing where C# or R would be inducing rollover, suggesting it may be a lower level issue or perhaps at the managed-unmanaged boundary. [`ArrowBuffer`](https://github.com/apache/arrow/blob/main/csharp/src/Apache.Arrow/ArrowBuffer.cs#L51) is built on [`ReadOnlyMemory<byte>`](https://learn.microsoft.com/en-us/dotnet/api/system.readonlymemory-1?view=net-7.0) and therefore constrained by its int32 `Length` property. Not sure if that's considered to meet spec but it doesn't appear it should be causing rollover here—the 2.02 GB file is only 44M records, so not more than 170 MB per column even when everything's in one `RecordBatch`—and the unchecked int32 conversion in [`ArrowMemoryReaderImplementation.ReadNextRecordBatch()`](https://github.com/apache/arrow/blob/main/csharp/src/Apache.Arrow/Ipc/ArrowMemoryReaderImplementation.cs#L83) probably isn't involved either. Since I'm not getting an exception at write the checked C# conversions from `long` can be excluded.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

read_feather() fails with IOError: Invalid IPC message: negative bodyLength on batches larger than 2 GB #179

Describe the bug, including details regarding any error messages, version, and platform.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

read_feather() fails with IOError: Invalid IPC message: negative bodyLength on batches larger than 2 GB #179

Description

Describe the bug, including details regarding any error messages, version, and platform.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions