Skip to content

Commit 90839df

Browse files
authored
[Parquet] perf: Create StructArrays directly rather than via ArrayData (1% improvement) (#9120)
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - part of #9061 - Part of #9128 # Rationale for this change I noticed on #9061 that there is non trivial overhead to create struct arrays. I am trying to improve `make_array` in parallel, but @tustvold had an even better idea in #9058 (comment) > My 2 cents is it would be better to move the codepaths relying on ArrayData over to using the typed arrays directly, this should not only cut down on allocations but unnecessary validation and dispatch overheads. # What changes are included in this PR? Update the parquet `StructArray` reader (used for the top level RecordBatch) to directly construct StructArray rather than using ArrayData # Are these changes tested? By existing CI Benchmarks show a small repeatable improvement of a few percent. For example ``` arrow_reader_clickbench/async/Q10 1.00 12.7±0.35ms ? ?/sec 1.02 12.9±0.44ms ? ?/sec ``` I am pretty sure this is because the click bench dataset has more than 100 columns. Creating such a struct array requires cloning 100 `ArrayData` (one for each child) which each has a Vec<Buffers>. So this saves (at least) 100 allocations per batch # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
1 parent 4ddaa8c commit 90839df

File tree

2 files changed

+30
-16
lines changed

2 files changed

+30
-16
lines changed

arrow-buffer/src/builder/boolean.rs

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
// under the License.
1717

1818
use crate::bit_util::apply_bitwise_binary_op;
19-
use crate::{BooleanBuffer, Buffer, MutableBuffer, bit_util};
19+
use crate::{BooleanBuffer, Buffer, MutableBuffer, NullBuffer, bit_util};
2020
use std::ops::Range;
2121

2222
/// Builder for [`BooleanBuffer`]
@@ -289,6 +289,14 @@ impl From<BooleanBufferBuilder> for BooleanBuffer {
289289
}
290290
}
291291

292+
impl From<BooleanBufferBuilder> for NullBuffer {
293+
#[inline]
294+
fn from(builder: BooleanBufferBuilder) -> Self {
295+
let boolean_buffer = BooleanBuffer::from(builder);
296+
NullBuffer::new(boolean_buffer)
297+
}
298+
}
299+
292300
#[cfg(test)]
293301
mod tests {
294302
use super::*;

parquet/src/arrow/array_reader/struct_array.rs

Lines changed: 21 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@
1818
use crate::arrow::array_reader::ArrayReader;
1919
use crate::errors::{ParquetError, Result};
2020
use arrow_array::{Array, ArrayRef, StructArray, builder::BooleanBufferBuilder};
21-
use arrow_data::{ArrayData, ArrayDataBuilder};
22-
use arrow_schema::DataType as ArrowType;
21+
use arrow_buffer::NullBuffer;
22+
use arrow_schema::{DataType as ArrowType, DataType};
2323
use std::any::Any;
2424
use std::sync::Arc;
2525

@@ -124,16 +124,15 @@ impl ArrayReader for StructArrayReader {
124124
return Err(general_err!("Not all children array length are the same!"));
125125
}
126126

127-
// Now we can build array data
128-
let mut array_data_builder = ArrayDataBuilder::new(self.data_type.clone())
129-
.len(children_array_len)
130-
.child_data(
131-
children_array
132-
.into_iter()
133-
.map(|x| x.into_data())
134-
.collect::<Vec<ArrayData>>(),
135-
);
127+
let DataType::Struct(fields) = &self.data_type else {
128+
return Err(general_err!(
129+
"Internal: StructArrayReader must have struct data type, got {:?}",
130+
self.data_type
131+
));
132+
};
133+
let fields = fields.clone(); // cloning Fields is cheap (Arc internally)
136134

135+
let mut nulls = None;
137136
if self.nullable {
138137
// calculate struct def level data
139138

@@ -168,12 +167,19 @@ impl ArrayReader for StructArrayReader {
168167
if bitmap_builder.len() != children_array_len {
169168
return Err(general_err!("Failed to decode level data for struct array"));
170169
}
171-
172-
array_data_builder = array_data_builder.null_bit_buffer(Some(bitmap_builder.into()));
170+
nulls = Some(NullBuffer::from(bitmap_builder));
173171
}
174172

175-
let array_data = unsafe { array_data_builder.build_unchecked() };
176-
Ok(Arc::new(StructArray::from(array_data)))
173+
// Safety: checked above that all children array data have same
174+
// length and correct type
175+
unsafe {
176+
Ok(Arc::new(StructArray::new_unchecked_with_length(
177+
fields,
178+
children_array,
179+
nulls,
180+
children_array_len,
181+
)))
182+
}
177183
}
178184

179185
fn skip_records(&mut self, num_records: usize) -> Result<usize> {

0 commit comments

Comments
 (0)