Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While we work on micro micro optimizations, we have seen a common pattern where older parts of the arrow-rs codebase use ArrayData to create new arrays.
An ArrayData has at least one extra allocation (for the Vec that holds Buffers) as well as a bunch of dynamic function calls. While this overhead is small individually, it is paid for every array so in aggregate it can be substantial
It also typically requires an unsafe call which is unnecessary as the new APIs can be checked by the compiler.
Quoting @tustvold
My 2 cents is it would be better to move the codepaths relying on ArrayData over to using the typed arrays directly, this should not only cut down on allocations but unnecessary validation and dispatch overheads.
Describe the solution you'd like
Change relying on ArrayData over to creating the typed arrays directly, this should not only cut down on allocations but unnecessary validation and dispatch overheads.
Describe alternatives you've considered
Here are some example PRs
the old, less efficient pattern looks like this (note the vec![buffer] to create a buffer).
let data = unsafe {
ArrayData::new_unchecked(T::DATA_TYPE, len, None, Some(null), 0, vec![buffer], vec![])
};
PrimitiveArray::from(data)
or
let array_data = ArrayDataBuilder::new(arrow_data_type)
.len(self.record_reader.num_values())
.add_buffer(record_data)
.null_bit_buffer(self.record_reader.consume_bitmap_buffer());
let array_data = unsafe { array_data.build_unchecked() };
The new pattern looks like this (note no unsafe or allocations)
// Create nulls directly (note the `filter` to avoid nulls)
let nulls =
Some(NullBuffer::new(BooleanBuffer::new(null, 0, len))).filter(|n| n.null_count() > 0);
// Create Primitive Array directly
PrimitiveArray::new(ScalarBuffer::from(buffer), nulls)
** Note the only tricky thing I have seen is that ArrayDataBuilder automatically checks / drops NullBuffers that have no nulls. When updating the code we need to follow a similar pattern
Additional context
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
make_array) #9061While we work on micro micro optimizations, we have seen a common pattern where older parts of the arrow-rs codebase use
ArrayDatato create new arrays.An ArrayData has at least one extra allocation (for the Vec that holds
Buffers) as well as a bunch of dynamic function calls. While this overhead is small individually, it is paid for every array so in aggregate it can be substantialIt also typically requires an
unsafecall which is unnecessary as the new APIs can be checked by the compiler.Quoting @tustvold
Describe the solution you'd like
Change relying on ArrayData over to creating the typed arrays directly, this should not only cut down on allocations but unnecessary validation and dispatch overheads.
Describe alternatives you've considered
Here are some example PRs
PrimitiveArrays directly rather than viaArrayData#9122ArrayData(1% improvement) #9120the old, less efficient pattern looks like this (note the
vec![buffer]to create a buffer).or
The new pattern looks like this (note no unsafe or allocations)
** Note the only tricky thing I have seen is that
ArrayDataBuilderautomatically checks / drops NullBuffers that have no nulls. When updating the code we need to follow a similar patternAdditional context