-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Describe the bug, including details regarding any error messages, version, and platform.
Moved from apache/arrow#37069. Original text follows:
I have some C# code using the Apache.Arrow 12.0.1 nuget to write .feather files with ArrowFileWriter.WriteRecordBatch() and its companions. These files are then read in R 4.2 using read_feather() from the arrow 12.0.0 package (Windows 10 22H2, RStudio 2023.06.1). This process works fine for files with a single record batch up to at least 2,129,214,698 bytes (1.98 GB). Much above that—the next file size up the code I'm running produces happens to be 2,176,530,466 bytes, 2.02 GB—and read_feather() fails with
Error: IOError: Invalid IPC message: negative bodyLengthThis appears to coming from CheckMetadataAndGetBodyLength(), which takes bodyLength as an int64_t. So presumably what's happening here is something upstream of CheckMetadataAndGetBodyLength() is handling bodyLength as 32 bit signed and inducing integer rollover.
The error message is a bit cryptic and, from a brief look at the code it's unclear how a block's bodyLength relates to file and record batch size. Since forcing the total (uncompressed) size of a record batch below 2 GB avoids the negative bodyLength error it appears CheckMetadataAndGetBodyLength() is picking up something controlled by record batch size rather than file size. At the moment I've tested up to 3.6 GB multibatch files and read_feather() handles them fine.
The Arrow spec seems to suggest compliant Arrow implementations support array lengths to 2³¹ - 1 elements. The record batch format here is 11 four byte columns, plus a one byte and a couple two bytes, so one reading of the spec is files (or maybe record batches) up to 98 GB should be compatible with the recommended limits on multi-language use.
I'm including the C++ tag on this as I'm not seeing where C# or R would be inducing rollover, suggesting it may be a lower level issue or perhaps at the managed-unmanaged boundary. ArrowBuffer is built on ReadOnlyMemory<byte> and therefore constrained by its int32 Length property. Not sure if that's considered to meet spec but it doesn't appear it should be causing rollover here—the 2.02 GB file is only 44M records, so not more than 170 MB per column even when everything's in one RecordBatch—and the unchecked int32 conversion in ArrowMemoryReaderImplementation.ReadNextRecordBatch() probably isn't involved either. Since I'm not getting an exception at write the checked C# conversions from long can be excluded.