Investigate performance of FixedSizedList/List types in Parquet files

We use [FixedSizeListArrays](https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout) and [ListArrays](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout) to represent tensor and variably shaped data, respectively. In the Apache Arrow columnar format, these structures simply establish a view over a flat buffer of values, with additional offset arrays for each dimension in the ListArray case.

Arrow doesn't map 1-1 to Parquet and this means that reading (and writing?) these nested structures can be inefficient, compared to I/O on primitive types. Relevant issue and comments:

- https://github.com/apache/arrow/issues/34510
- https://github.com/apache/arrow/issues/34510#issuecomment-1464215953
- https://github.com/apache/arrow/issues/34510#issuecomment-1464331411
- https://github.com/apache/arrow/issues/34510#issuecomment-1464387983

> To be honest parquet's tag line could be "It's good enough". You can almost certainly do 2-3x better than parquet for any given workload, but you really need orders of magnitude improvements to overcome ecosystem inertia. I suspect most workloads will also mix in byte arrays and/or object storage or block compression, at which point those will easily be the tall pole in decode performance.

- https://github.com/apache/arrow/issues/34510#issuecomment-1464463384

>Arrow based fixed size lists of primitive values (eg. tensors) shouldn't be converted to nested parquet data, but instead they are better as BYTE_ARRAY in parquet (while I think it'd be important sadly there is no fixed size BYTE_ARRAY in the parquet spec so it'll be still slightly slower than possible). Also some fast paths for never null data - which was not marked as non-nullable when the data was saved - can be useful too, but that's all.

So if optimal performance was desired for performing parquet i/o for nested, tensor type data, it sounds as if casting between the List types and Fixed Size Binary types ([pyarrow.binary](https://arrow.apache.org/docs/python/generated/pyarrow.binary.html#pyarrow.binary)/[Fixed Size Primitives](https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout)) might be an easy fix to solve this, if performance proves to be a problem.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate performance of FixedSizedList/List types in Parquet files #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate performance of FixedSizedList/List types in Parquet files #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions