feat: support nested projection#7959
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the projection mechanism by introducing ProjectionInput and ReadColumns to support nested field access in complex columns. It updates ScanRequest and various storage engines to utilize these new structures, enabling more granular data retrieval. Feedback focuses on improving the robustness and performance of a temporary schema-casting fix in flat_projection.rs, specifically by replacing a risky unwrap() with proper error handling and reducing excessive logging in hot paths.
ce9d3f9 to
12212db
Compare
12212db to
8a2c2dd
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the ReadColumns and ReadColumn structures to support reading specific nested fields from storage, replacing the previous reliance on simple column ID vectors. It updates the scan and SST read paths to propagate these nested paths, allowing for more efficient data retrieval. Additionally, it includes logic in the FlatProjectionMapper to handle struct widening via casting when the storage returns a narrower schema than the output expects. Feedback indicates that the struct widening logic in flat_projection.rs is incorrectly placed in the block for internal columns rather than user-defined columns and notes potential type mismatches and limitations in the current casting implementation.
There was a problem hiding this comment.
Pull request overview
This PR introduces a new “logical read columns” abstraction to support nested (struct-field) projection pushdown through the mito2 read pipeline, allowing Parquet leaf-level selection while preserving existing scan/projection behavior.
Changes:
- Add
ReadColumns(logical root columns + nested paths) and derive read columns from both user projection input and predicates. - Plumb
ReadColumnsthrough scanning, projection mapping, Parquet read format/projection-mask construction, and cache keying. - Extend Parquet projection handling to support nested paths and adjust flat projection conversion to handle narrowed struct columns.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mito2/src/sst/parquet/reader.rs | Switch Parquet reader builder projection input to ReadColumns and build projection mask from ReadFormat’s parquet read columns. |
| src/mito2/src/sst/parquet/read_columns.rs | Add constructors/merge helpers for nested parquet read columns; add unit tests for merge behavior. |
| src/mito2/src/sst/parquet/prefilter.rs | Update tests to build ReadFormat using ReadColumns. |
| src/mito2/src/sst/parquet/format.rs | Refactor projection computation to produce ParquetReadColumns (including nested paths) and expose iterators/accessors accordingly. |
| src/mito2/src/sst/parquet/flat_format.rs | Plumb ReadColumns through flat read format construction and projection access. |
| src/mito2/src/sst/parquet/file_range.rs | Update last-row reader wiring to use parquet_read_columns() instead of raw indices; adjust tests. |
| src/mito2/src/read/scan_util.rs | Update test helpers for the new ProjectionMapper::new signature. |
| src/mito2/src/read/scan_region.rs | Build ReadColumns from projection + predicate and carry it through ScanInput/SST reads; remove legacy “build read column ids” logic and its tests. |
| src/mito2/src/read/read_columns.rs | New module implementing ReadColumns, merging/normalizing nested paths, and helpers to derive read columns from projection/predicate (with tests). |
| src/mito2/src/read/range_cache.rs | Change scan fingerprinting to store ReadColumns rather than Vec<ColumnId> and adjust sizing/type tracking. |
| src/mito2/src/read/projection.rs | Update mappers to carry ReadColumns and accept IntoIterator projections; adapt tests accordingly. |
| src/mito2/src/read/last_row.rs | Selector result caching now keys schema compatibility off ParquetReadColumns instead of raw projection indices. |
| src/mito2/src/read/flat_projection.rs | Store ReadColumns in the mapper; add struct-casting path to reconcile narrowed struct reads with unpruned output schema expectations. |
| src/mito2/src/read/compat.rs | Update compat logic to use mapper read columns API. |
| src/mito2/src/read/batch_adapter.rs | Adjust tests for ProjectionMapper::new signature changes. |
| src/mito2/src/read.rs | Export the new read_columns module. |
| src/mito2/src/memtable/bulk/part_reader.rs | Update projection mask construction and record batch projection to use projection index iterator. |
| src/mito2/src/memtable/bulk/context.rs | Adapt bulk iter context projection plumbing to ReadColumns/new ReadFormat::new signature. |
| src/mito2/src/cache.rs | Store selector cache schema as ParquetReadColumns; update constructors/tests. |
| src/common/recordbatch/src/error.rs | Add CastColumn error variant used by the new struct casting path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d07ab9efa1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !column_id_to_projected_index.contains_key(&col_id) { | ||
| let projected_index = column_id_to_projected_index.len(); | ||
| column_id_to_projected_index.insert(col_id, projected_index); |
There was a problem hiding this comment.
Guard projected index map against empty nested leaf reads
When a nested path is requested but an SST schema lacks that nested leaf (for example, querying a newly added struct field against older files), this code still allocates a column_id_to_projected_index entry before parquet leaf projection is resolved. In that case the parquet reader can return zero leaves for the root, but downstream code still indexes into the projected batch as if the column exists, which misaligns columns and surfaces cast/type errors instead of compat behavior (null/default for missing nested fields). Please only assign projected indices for roots that actually survive leaf selection, or fallback to whole-root projection when no nested leaf matches.
Useful? React with 👍 / 👎.
672fee6 to
2416b00
Compare
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
PR Checklist
Please convert it to a draft if some of the following conditions are not met.