You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This has come up several times, most recently on the arrow mailing list:
Discussing how to expose dictionary data may lead to multiple overlapping
considerations, long discussions and perhaps format and API changes. So we
hope that there could be some loopholes or small change that could
potentially unblock such optimization without going into a large design/API
space. For instance:
Can we introduce a hint to ParquetReader which will produce
DictionaryArray for the given column instead of a concrete array
(StringViewArray in our case)?
When doing late materialization, maybe we can extend ArrowPredicate,
so that it first instructs Parquet reader that it wants to get encoded
dictionaries first, and once they are supplied, return another predicate
that will be applied to encoded data. E.g., "x = some_value" translates to
"x_encoded = index".
What you are requesting is already supported in parquet-rs. In
particular if you request a UTF8 or Binary DictionaryArray for the
column it will decode the column preserving the dictionary encoding. You
can override the embedded arrow schema, if any, using
ArrowReaderOptions::with_schema [1]. Provided you don't read RecordBatch
across row groups and therefore across dictionaries, which the async
reader doesn't, this should never materialize the dictionary. FWIW the
ViewArray decodeders will also preserve the dictionary encoding,
however, the dictionary encoded nature is less explicit in the resulting
arrays.
As for using integer comparisons to optimise dictionary filtering, you
should be able to construct an ArrowPredicate that computes the filter
for the dictionary values, caches this for future use, e.g. using ptr_eq
to detect when the dictionary changes, and then filters based on
dictionary keys.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This has come up several times, most recently on the arrow mailing list:
https://lists.apache.org/thread/5kg3q0y4cqzl16q6vrvkxlw0yxmk4241
@tustvold pointed out:
The RowFilter API does exist and can evaluate predicates during evaluation, but it has no examples:
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/type.ParquetRecordBatchReaderBuilder.html#method.with_row_filter
Describe the solution you'd like
I would like these features to be more easily documented:
with_row_filter, with a link to the https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/ blogBonus points would be a second example, that shows how to evaluate predicates on "encoded data" that
ptr_eqas described above to reuse the dictionaryDescribe alternatives you've considered
Additional context