Added capability to fetch dictionary values by DarkWanderer · Pull Request #9011 · apache/arrow-rs

DarkWanderer · 2025-12-17T17:16:38Z

Which issue does this PR close?

Closes Enable access to column dictionaries in async reader #9010

Rationale for this change

Some databases, one example being Grafana Tempo, utilize column dictionaries as makeshift column indexes, to improve filtering speed ad-hoc. Checking if low-cardinality value is present in dictionary allows to effectively pre-filter data by skipping whole row group. This PR adds this capability

What changes are included in this PR?

Add public get_row_group_column_dictionary function to ParquetRecordBatchStreamBuilder

Are these changes tested?

Tests have been added

Are there any user-facing changes?

Public API extension for ParquetRecordBatchStreamBuilder

tustvold · 2025-12-17T17:57:46Z

Unfortunately dictionary encoding is best effort, and writers will fallback to different encodings if the dictionary gets too large. The result is you need to know if all the pages are dictionary encoded in order to be able to make this optimisation - iirc this information is not encoded anywhere but the page header itself...

Putting this aside there are likely some challenges around typing with this approach.

IMO bloom filters are the recommended way to handle this sort of thing, dictionaries are more of an encoding optimisation.

Edit: I could see a world where users could opt-in to have ArrowFilter passed the dictionary in a pre-pass, and for this to then allow the reader to skip decoding dictionary encoded pages if there are no matches, but unless I'm remembering incorrectly this wouldn't allow skipping the IO...

DarkWanderer · 2025-12-17T18:03:04Z

Edit: I could see a world where users could opt-in to have ArrowFilter passed the dictionary in a pre-pass, and for this to then allow the reader to skip decoding dictionary encoded pages if there are no matches, but unless I'm remembering incorrectly this wouldn't allow skipping the IO...

That is exactly what I am hoping for - to perform a multiple-range fetch of a few MB from object_store to filter down row groups to only ones I need, which saves me multiple gigabytes of actual S3 read.

etseidl · 2025-12-18T17:23:43Z

Unfortunately dictionary encoding is best effort, and writers will fallback to different encodings if the dictionary gets too large. The result is you need to know if all the pages are dictionary encoded in order to be able to make this optimisation - iirc this information is not encoded anywhere but the page header itself...

This is why the page encoding stats exist in the column metadata. This will tell you if all pages in a given chunk are dictionary encoded. (See

arrow-rs/parquet/src/file/metadata/mod.rs

Lines 1072 to 1083 in 116ae12

    
               /// Returns the page encoding statistics reduced to a bitmask, or `None` if statistics are 
        
               /// not available (or they were left in their original form). 
        
               /// 
        
               /// The [`PageEncodingStats`] struct was added to the Parquet specification specifically to 
        
               /// enable fast determination of whether all pages in a column chunk are dictionary encoded 
        
               /// (see <https://github.com/apache/parquet-format/pull/16>). 
        
               /// Decoding the full page encoding statistics, however, can be very costly, and is not 
        
               /// necessary to support the aforementioned use case. As an alternative, this crate can 
        
               /// instead distill the list of `PageEncodingStats` down to a bitmask of just the encodings 
        
               /// used for data pages 
        
               /// (see [`ParquetMetaDataOptions::set_encoding_stats_as_mask`]). 
        
               /// To test for an all-dictionary-encoded chunk one could use this bitmask in the following way:

)

That said, I'm not nuts about duplicating the logic to decode the dictionary page, and just for the async reader. If we're going down this path, then I think all readers should have access. Perhaps this could be in ArrowReaderBuilder or even ParquetMetaData 🤷

DarkWanderer · 2025-12-19T12:40:40Z

This is why the page encoding stats exist in the column metadata. This will tell you if all pages in a given chunk are dictionary encoded.

Thanks for confirming, that was my intuition as well

That said, I'm not nuts about duplicating the logic to decode the dictionary page, and just for the async reader. If we're going down this path, then I think all readers should have access. Perhaps this could be in ArrowReaderBuilder or even ParquetMetaData

I would note that there is already a disparity in available APIs. Specifically, SerializedFileReader gives access to RowGroupReader, which in turn gives raw access to PageReader - this capability is missing in ParquetRecordBatchStreamBuilder. Also, method get_row_group_column_bloom_filter is already exposed in ParquetRecordBatchStreamBuilder, that was my motivation for placing get_row_group_column_dictionary near it.

I hear your argument about consistent API and code duplication however, will think of a better place to insert it.

github-actions bot added the parquet Changes to the parquet crate label Dec 17, 2025

DarkWanderer marked this pull request as draft December 17, 2025 17:17

Added capability to fetch dictionary values

db8b8e4

DarkWanderer force-pushed the get-dictionary branch from 74358d8 to db8b8e4 Compare December 17, 2025 17:54

DarkWanderer closed this Dec 18, 2025

DarkWanderer reopened this Dec 18, 2025

DarkWanderer added 2 commits December 28, 2025 16:51

Move dictionary read to ParquetMetaDataReader

37005d2

Added async tests, moved tests to reader.rs

d2392f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added capability to fetch dictionary values#9011

Added capability to fetch dictionary values#9011
DarkWanderer wants to merge 3 commits intoapache:mainfrom
DarkWanderer:get-dictionary

DarkWanderer commented Dec 17, 2025

Uh oh!

tustvold commented Dec 17, 2025 •

edited

Loading

Uh oh!

DarkWanderer commented Dec 17, 2025

Uh oh!

etseidl commented Dec 18, 2025

Uh oh!

DarkWanderer commented Dec 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DarkWanderer commented Dec 17, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

tustvold commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkWanderer commented Dec 17, 2025

Uh oh!

etseidl commented Dec 18, 2025

Uh oh!

DarkWanderer commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tustvold commented Dec 17, 2025 •

edited

Loading

DarkWanderer commented Dec 19, 2025 •

edited

Loading