Skip to content

Optimized decoding of Parquet Statistics, null_pages and null_counts #9296

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, if using statistics, a lot of time can be spent decoding/summarizing the statistics from the ValueStatistics / Statistics structs (which are large / inefficient structs).
In DataFusion this can sometimes take as much time running the query (or more if the query can be answered from statistics directly).

Describe the solution you'd like
We should consider decoding the statistics into a columnar format (values + null bitmap (if needed)) directly, avoiding needing to convert this later (and possibly help decoding as well a bit as well as memory usage).

Looking at ColumnIndex:

pub struct ColumnIndex {
    pub(crate) null_pages: Vec<bool>,
    pub(crate) boundary_order: BoundaryOrder,
    pub(crate) null_counts: Option<Vec<i64>>,
    pub(crate) repetition_level_histograms: Option<Vec<i64>>,
    pub(crate) definition_level_histograms: Option<Vec<i64>>,
}
  • null_pages: this currently is a Vec<bool> (true is null, false is non-null), it would be better to save this as a NullBuffer or similar, where true means valid and false means invalid - this would make it possible to copy the null bitmap without conversion
  • null_counts: Option<Vec>: it would be better to have this as a Int64Array or similar (Or preferably even Uint64Array if we can do the conversion earlier)

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogperformance

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions