Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, if using statistics, a lot of time can be spent decoding/summarizing the statistics from the ValueStatistics / Statistics structs (which are large / inefficient structs).
In DataFusion this can sometimes take as much time running the query (or more if the query can be answered from statistics directly).
Describe the solution you'd like
We should consider decoding the statistics into a columnar format (values + null bitmap (if needed)) directly, avoiding needing to convert this later (and possibly help decoding as well a bit as well as memory usage).
Looking at ColumnIndex:
pub struct ColumnIndex {
pub(crate) null_pages: Vec<bool>,
pub(crate) boundary_order: BoundaryOrder,
pub(crate) null_counts: Option<Vec<i64>>,
pub(crate) repetition_level_histograms: Option<Vec<i64>>,
pub(crate) definition_level_histograms: Option<Vec<i64>>,
}
null_pages: this currently is a Vec<bool> (true is null, false is non-null), it would be better to save this as a NullBuffer or similar, where true means valid and false means invalid - this would make it possible to copy the null bitmap without conversion
null_counts: Option<Vec>: it would be better to have this as a Int64Array or similar (Or preferably even Uint64Array if we can do the conversion earlier)
Describe alternatives you've considered
Additional context
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, if using statistics, a lot of time can be spent decoding/summarizing the statistics from the
ValueStatistics/Statisticsstructs (which are large / inefficient structs).In DataFusion this can sometimes take as much time running the query (or more if the query can be answered from statistics directly).
Describe the solution you'd like
We should consider decoding the statistics into a columnar format (values + null bitmap (if needed)) directly, avoiding needing to convert this later (and possibly help decoding as well a bit as well as memory usage).
Looking at
ColumnIndex:null_pages: this currently is aVec<bool>(true is null, false is non-null), it would be better to save this as aNullBufferor similar, wheretruemeans valid andfalsemeans invalid - this would make it possible to copy the null bitmap without conversionnull_counts: Option<Vec>: it would be better to have this as aInt64Arrayor similar (Or preferably evenUint64Arrayif we can do the conversion earlier)Describe alternatives you've considered
Additional context