Add Iceberg V3 row lineage read support for _row_id and _last_updated_sequence_number#31
Add Iceberg V3 row lineage read support for _row_id and _last_updated_sequence_number#31
Conversation
|
@copilot when looking into https://github.com/apache/iceberg/blob/main/format/spec.md#row-lineage, It seems the _row_id and _last_updated_sequence_number are stored as hidden columns in the original file itself. Are we doing that? |
Yes, the implementation reads The extra logic only applies to two cases per the spec:
|
|
@copilot check if the rules of row lineage are followed as per https://iceberg.apache.org/spec/#row-lineage and the intend is to read the data from the file. |
I reviewed the spec and found an issue: the spec uses null-based inheritance, not 0-based. Fixed in commit 9415758. Key spec rules and how they're now handled:
Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
@copilot Query 20260306_060805_00002_kgskq, FAILED, 1 node Query 20260306_060805_00002_kgskq failed: Field not found: _row_id. Available fields are: id, name, email, ts, region. Split Hive: /Users/joe/Developer/utils/singlenode/warehouse_iceberg_v3_file/joe_test_iceberg_v3/users_iceberg/data/region=eu/00000-15-80f841be-597b-48a4-aa7d-756108e856aa-0-00002.parquet 4 - 1365 Task 20260306_060805_00002_kgskq.2.0.0.0 |
... The error Fixed in 9ff6b86 — the code now tries |
|
@copilot In spark I am getting in presto with this velox I am getting Are we following the right spec and calculating the values according to the iceberg spec https://iceberg.apache.org/spec/#row-lineage |
The root cause was that the Parquet reader filters columns based on Fixed in c9d3ab0 by expanding the file schema in
Also updated all tests to properly separate |
c9d3ab0 to
7a88a01
Compare
|
presto:joe_test_iceberg_v3> SELECT id, name, _row_id, _last_updated_sequence_number FROM users_iceberg ORDER BY id; spark produces The data created by spark is correct as per https://iceberg.apache.org/spec/?h=row#row-lineage-assignment Check the code, if the reading of these column is implemented as per the spec |
The root cause was that the file schema expansion code (which tells the Parquet reader to read Without it, the Parquet reader didn't know these hidden columns existed in the file, so:
Fixed in 0e36812:
The read logic now correctly: reads values from file when present, only computes |
|
With the latest change the _last_updated_sequence_number is null in few places |
The issue was that the Parquet reader wraps columns in This explains the pattern you saw:
Fixed in 1907e0a by calling |
…_` member for `_last_updated_sequence_number`.
55cf9c2 to
e110255
Compare
|
@copilot,
|
…gnostic Two key fixes: 1. Move dataSequenceNumber_ initialization before getAdaptedRowType() so adaptColumns() can use it when _last_updated_sequence_number is missing from the data file (ordering bug caused it to always be std::nullopt during adaptColumns). 2. Replace encoding-specific null replacement (asFlatVector only) with encoding-agnostic approach using SimpleVector::valueAt() and creating a new FlatVector. This handles FlatVector, DictionaryVector, and any other SimpleVector subclass returned by the Parquet reader after lazy loading. Also use seqNumChild = loadedVectorShared(seqNumChild) to replace the LazyVector wrapper in the RowVector's child directly. Co-authored-by: Joe-Abraham <53977252+Joe-Abraham@users.noreply.github.com>
... Reviewed the code per your checklist. Found and fixed two issues in commit bda69ff: Fix 1 — Fix 2 — Encoding-agnostic null replacement: The null replacement in Also removed an unused variable in the Test coverage review — 6 test cases covering:
No unnecessary changes found — all additions directly support reading row lineage per the spec. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
/claude-review |
|
@sourcery-ai review |
Reviewer's GuideAdds Iceberg V3 row lineage read support in the Hive Iceberg split reader by expanding file schemas with row lineage metadata columns, computing _row_id from first_row_id + position when missing, inheriting and backfilling null _last_updated_sequence_number values from the manifest’s data sequence number, and updating tests and metadata column definitions accordingly, along with a minor GitHub Actions allowlist change. Updated class diagram for IcebergSplitReader and IcebergMetadataColumn row lineage supportclassDiagram
class IcebergSplitReader {
+void prepareSplit(std::shared_ptr<ConnectorTableHandle> tableHandle, std::shared_ptr<ConnectorSplit> split, std::shared_ptr<const RowType> readerOutputType, std::shared_ptr<common_MetadataFilter> metadataFilter, RuntimeStatistics& runtimeStats, F14FastMap_string_string fileReadOps)
+uint64_t next(uint64_t size, VectorPtr& output)
+std::vector_TypePtr adaptColumns(RowTypePtr fileType, RowTypePtr tableSchema) const
-std::list_PositionalDeleteFileReader_ptr positionalDeleteFileReaders_
-BufferPtr deleteBitmap_
-bool readLastUpdatedSeqNumFromFile_
-std::optional_column_index_t lastUpdatedSeqNumOutputIndex_
-std::optional_int64_t dataSequenceNumber_
-std::optional_int64_t firstRowId_
-bool computeRowId_
-std::optional_column_index_t rowIdOutputIndex_
}
class IcebergMetadataColumn {
+static constexpr const char* kRowIdColumnName
+static constexpr const char* kLastUpdatedSequenceNumberColumnName
+static std::shared_ptr_IcebergMetadataColumn icebergRowIdColumn()
+static std::shared_ptr_IcebergMetadataColumn icebergLastUpdatedSequenceNumberColumn()
-int id
-std::string name
-TypePtr type
-std::string comment
}
class BaseRowReader {
+uint64_t next(uint64_t size, VectorPtr& output, Mutation* mutation)
}
class RowVector {
+VectorPtr childAt(column_index_t index)
+void setChildAt(column_index_t index, VectorPtr child)
}
class ConstantVector_int64_t {
+ConstantVector_int64_t(MemoryPool* pool, vector_size_t size, bool isNull, TypePtr type, int64_t value)
}
class FlatVector_int64_t {
+void set(vector_size_t index, int64_t value)
}
class SimpleVector_int64_t {
+bool isNullAt(vector_size_t index) const
+int64_t valueAt(vector_size_t index) const
}
IcebergSplitReader --> BaseRowReader : uses
IcebergSplitReader --> IcebergMetadataColumn : references constants
IcebergSplitReader --> RowVector : produces output
IcebergSplitReader --> ConstantVector_int64_t : creates for constants
IcebergSplitReader --> FlatVector_int64_t : creates for rewritten columns
IcebergSplitReader --> SimpleVector_int64_t : reads existing values
IcebergMetadataColumn <.. IcebergSplitReader : metadata column definitions used in file schema expansion
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In the
next()null-replacement paths for_last_updated_sequence_numberand_row_id, you unconditionally cast toSimpleVector<int64_t>and always allocate a new flat vector; consider usingDecodedVector(or at least guarding the cast and avoiding allocation when there are no nulls) to better support all encodings and reduce unnecessary allocations. - The repeated
folly::SingletonVault::singleton()->registrationComplete();calls in the new tests could be factored into a small helper or a test fixture setup hook to avoid duplication and make the intent clearer.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the `next()` null-replacement paths for `_last_updated_sequence_number` and `_row_id`, you unconditionally cast to `SimpleVector<int64_t>` and always allocate a new flat vector; consider using `DecodedVector` (or at least guarding the cast and avoiding allocation when there are no nulls) to better support all encodings and reduce unnecessary allocations.
- The repeated `folly::SingletonVault::singleton()->registrationComplete();` calls in the new tests could be factored into a small helper or a test fixture setup hook to avoid duplication and make the intent clearer.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
Fixed in 4ba18c3:
|
d5a6bb0 to
e463633
Compare
…and last updated sequence number
e463633 to
8745950
Compare
|
|
||
| // First row ID from the manifest entry, used to compute _row_id. | ||
| // When available (>= 0), _row_id = first_row_id + _pos for rows not in file. | ||
| std::optional<int64_t> firstRowId_; |
There was a problem hiding this comment.
@copilot Is it equivalent to lastUpdatedSeqNumOutputIndex_.has_value() && dataSequenceNumber_.has_value()?
There was a problem hiding this comment.
Yes, readLastUpdatedSeqNumFromFile_ was equivalent to lastUpdatedSeqNumOutputIndex_.has_value() && dataSequenceNumber_.has_value(). Removed the redundant boolean in 14b5729 — now using the optional checks directly.
| bool computeRowId_{false}; | ||
|
|
||
| // The child index of _row_id in readerOutputType_. | ||
| std::optional<column_index_t> rowIdOutputIndex_; |
There was a problem hiding this comment.
@copilot , Is it equivalent to rowIdOutputIndex_.has_value() && firstRowId_.has_value()?
There was a problem hiding this comment.
Yes, computeRowId_ was equivalent to rowIdOutputIndex_.has_value() && firstRowId_.has_value(). Removed the redundant boolean in 14b5729 — now using the optional checks directly.
Replace readLastUpdatedSeqNumFromFile_ with lastUpdatedSeqNumOutputIndex_.has_value() && dataSequenceNumber_.has_value(), and computeRowId_ with rowIdOutputIndex_.has_value() && firstRowId_.has_value(). The boolean flags were always equivalent to checking whether both corresponding optionals had values, so they were redundant. Co-authored-by: Joe-Abraham <53977252+Joe-Abraham@users.noreply.github.com>
_last_updated_sequence_numberinnext(): replace null values (not 0 values) per spec_last_updated_sequence_numbertests to use null values instead of 0adaptColumns(): usereaderOutputType_for type lookup when metadata columns like_row_idare not intableSchema_row_idcomputation (first_row_id + _pos) when column is missing from fileprepareSplit()dataColumnsto match real-world behaviornext()dataSequenceNumber_ordering bug: move initialization beforegetAdaptedRowType()soadaptColumns()can use itDecodedVectorfor null replacement: standard Velox pattern for encoding-agnostic value access (handles flat, dictionary, etc.)decoded.mayHaveNulls()check skips FlatVector allocation when null buffer exists but has no nullsreadLastUpdatedSeqNumFromFile_andcomputeRowId_with equivalent optional.has_value()checks onlastUpdatedSeqNumOutputIndex_/dataSequenceNumber_androwIdOutputIndex_/firstRowId_🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.
Summary by Sourcery
Add Iceberg V3 row lineage read support for _row_id and _last_updated_sequence_number in the Hive Iceberg split reader, including correct handling of missing columns, null inheritance semantics, and positional deletes.
New Features:
Enhancements:
CI:
Tests: