Skip to content

Expose public async bloom filter reader (metadata + AsyncFileReader) #9067

@ethe

Description

@ethe

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'm integrating Parquet bloom filters into an async pruning pipeline and found a gap in the public API.

For current situation,

  • There is a sync API:
    Sbbf::read_from_column_chunk(column_meta, reader)
  • There is an async method, but only on the async Arrow builder:
    ParquetRecordBatchStreamBuilder::get_row_group_column_bloom_filter(...)
  • The helper used internally to parse bloom filter headers is pub(crate):
    chunk_read_bloom_filter_header_and_offset (in parquet::bloom_filter)

If parquet crate only has ParquetMetaData + an AsyncFileReader, downstream applications can't read bloom filters without re‑implementing Parquet bloom header parsing.

This blocks async metadata‑only pruning libraries (like me) from using bloom filters safely and efficiently.

Describe the solution you'd like

Expose a public async bloom reader that mirrors the sync API:

pub async fn read_bloom_filter_async<R: AsyncFileReader>(
    column_meta: &ColumnChunkMetaData,
    reader: &mut R
) -> Result<Option<Sbbf>>;

This would:

  • keep internal header parsing private
  • allow async pruning without coupling to Arrow builder
  • avoid duplicate parsing logic in downstream crates
  • be backwards compatible (pure API addition)

Alternative
Make chunk_read_bloom_filter_header_and_offset public, but this is a low‑level parsing helper and would bake in more implementation detail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions