Skip to content

feat: Add row lineage metadata columns to Iceberg reader#16716

Draft
Joe-Abraham wants to merge 3 commits intofacebookincubator:mainfrom
Joe-Abraham:row_lineage_c++
Draft

feat: Add row lineage metadata columns to Iceberg reader#16716
Joe-Abraham wants to merge 3 commits intofacebookincubator:mainfrom
Joe-Abraham:row_lineage_c++

Conversation

@Joe-Abraham
Copy link
Collaborator

@Joe-Abraham Joe-Abraham commented Mar 11, 2026

Enhances read support for Iceberg V3 metadata columns in the Iceberg connector. The modifications add support for hidden Iceberg metadata columns _row_id and _last_updated_sequence_number, enabling handling of row lineage and schema evolution for Iceberg V3 tables.

Part of

Changes

Iceberg Metadata Columns:

  • Added _row_id and _last_updated_sequence_number as hidden metadata columns via IcebergMetadataColumns.h.
  • Introduced methods to define and instantiate these columns (icebergRowIdColumn and icebergLastUpdatedSequenceNumberColumn).

Reader Enhancements:

  • Updated IcebergSplitReader.cpp to support reading and handling the new metadata columns:
    • _row_id is either read directly from the file or dynamically calculated as first_row_id + _pos.
    • _last_updated_sequence_number is filled with the data sequence number if null values are encountered.
  • Refactored processing logic to account for these metadata columns during schema adaptation and data file reading.

Test Coverage:

  • Added multiple unit tests to validate:
  • Direct reading and calculation of _row_id.
  • Replacement of null values in _last_updated_sequence_number.
  • Integration with schema evolution and row deletions.

Typo Fixes:

  • Corrected several instances of the typo fileFomat_ to fileFormat_.

@netlify
Copy link

netlify bot commented Mar 11, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit dd3e703
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69b28f071e922e000704ee92

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2026

// True if _last_updated_sequence_number is read from the data file (not set
// as a constant). Set in adaptColumns().
bool readLastUpdatedSeqNumFromFile_{false};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it equivalent to lastUpdatedSeqNumOutputIndex_.has_value() && dataSequenceNumber_.has_value()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is equivalent, I have updated it.

std::optional<int64_t> firstRowId_;

// True if _row_id should be computed as first_row_id + _pos in next().
bool computeRowId_{false};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it equivalent to rowIdOutputIndex_.has_value() && firstRowId_.has_value()?

@Joe-Abraham Joe-Abraham requested a review from PingLiuPing March 12, 2026 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants