-
Notifications
You must be signed in to change notification settings - Fork 0
Add Iceberg V3 row lineage read support for _row_id and _last_updated_sequence_number #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
1a1bcd8
c0b4a99
8b53584
e110255
bda69ff
8745950
14b5729
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -89,6 +89,17 @@ class IcebergSplitReader : public SplitReader { | |
| /// Column was added to the table schema after this data file was | ||
| /// written. Set as NULL constant since the old file doesn't contain | ||
| /// this column. | ||
| /// c) Row lineage (_last_updated_sequence_number): | ||
| /// For Iceberg V3 row lineage, if the column is not in the file, | ||
| /// inherit the data sequence number from the file's manifest entry | ||
| /// (provided via $data_sequence_number info column). Per the spec, | ||
| /// null values indicate the value should be inherited. | ||
| /// d) Row lineage (_row_id): | ||
| /// Per the spec, null _row_id values are assigned as | ||
| /// first_row_id + _pos. When first_row_id is available from | ||
| /// the split info column $first_row_id, the value is computed | ||
| /// in next(). When first_row_id is not available (e.g., | ||
| /// pre-V3 tables), NULL is returned. | ||
| std::vector<TypePtr> adaptColumns( | ||
| const RowTypePtr& fileType, | ||
| const RowTypePtr& tableSchema) const override; | ||
|
|
@@ -101,5 +112,23 @@ class IcebergSplitReader : public SplitReader { | |
| std::list<std::unique_ptr<PositionalDeleteFileReader>> | ||
| positionalDeleteFileReaders_; | ||
| BufferPtr deleteBitmap_; | ||
|
|
||
| // The child index of _last_updated_sequence_number in readerOutputType_. | ||
| // When set (along with dataSequenceNumber_), null values in this column | ||
| // are replaced with dataSequenceNumber_ during reads per the Iceberg spec. | ||
| std::optional<column_index_t> lastUpdatedSeqNumOutputIndex_; | ||
|
|
||
| // Data sequence number from the file's manifest entry, used to replace null | ||
| // values in _last_updated_sequence_number during reads. | ||
| std::optional<int64_t> dataSequenceNumber_; | ||
|
|
||
| // First row ID from the manifest entry, used to compute _row_id. | ||
| // When available (>= 0), _row_id = first_row_id + _pos for rows not in file. | ||
| std::optional<int64_t> firstRowId_; | ||
|
|
||
| // The child index of _row_id in readerOutputType_. | ||
| // When set (along with firstRowId_), _row_id is computed as | ||
| // first_row_id + _pos for null values during reads per the Iceberg spec. | ||
| std::optional<column_index_t> rowIdOutputIndex_; | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @copilot , Is it equivalent to rowIdOutputIndex_.has_value() && firstRowId_.has_value()?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, |
||
| }; | ||
| } // namespace facebook::velox::connector::hive::iceberg | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot Is it equivalent to lastUpdatedSeqNumOutputIndex_.has_value() && dataSequenceNumber_.has_value()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,
readLastUpdatedSeqNumFromFile_was equivalent tolastUpdatedSeqNumOutputIndex_.has_value() && dataSequenceNumber_.has_value(). Removed the redundant boolean in 14b5729 — now using the optional checks directly.