Skip to content

Add Iceberg V3 row lineage read support for _row_id and _last_updated_sequence_number#31

Draft
Copilot wants to merge 7 commits intomainfrom
copilot/read-row-lineage-fields
Draft

Add Iceberg V3 row lineage read support for _row_id and _last_updated_sequence_number#31
Copilot wants to merge 7 commits intomainfrom
copilot/read-row-lineage-fields

Conversation

Copy link

Copilot AI commented Feb 26, 2026

  • Review Iceberg V3 row lineage spec rules
  • Fix _last_updated_sequence_number in next(): replace null values (not 0 values) per spec
  • Fix _last_updated_sequence_number tests to use null values instead of 0
  • Update comments to match spec (null-based inheritance, not 0-based)
  • Fix adaptColumns(): use readerOutputType_ for type lookup when metadata columns like _row_id are not in tableSchema
  • Implement _row_id computation (first_row_id + _pos) when column is missing from file
  • Restore file schema expansion in prepareSplit()
  • Fix tests: exclude metadata columns from dataColumns to match real-world behavior
  • Fix lazy vector handling in next()
  • Fix dataSequenceNumber_ ordering bug: move initialization before getAdaptedRowType() so adaptColumns() can use it
  • Use DecodedVector for null replacement: standard Velox pattern for encoding-agnostic value access (handles flat, dictionary, etc.)
  • Guard allocation when no actual nulls: decoded.mayHaveNulls() check skips FlatVector allocation when null buffer exists but has no nulls
  • Remove redundant boolean flags: replace readLastUpdatedSeqNumFromFile_ and computeRowId_ with equivalent optional .has_value() checks on lastUpdatedSeqNumOutputIndex_/dataSequenceNumber_ and rowIdOutputIndex_/firstRowId_
  • Clean up unused variable in test
  • Code review and security check

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Summary by Sourcery

Add Iceberg V3 row lineage read support for _row_id and _last_updated_sequence_number in the Hive Iceberg split reader, including correct handling of missing columns, null inheritance semantics, and positional deletes.

New Features:

  • Expose Iceberg V3 row lineage metadata columns _row_id and _last_updated_sequence_number for Hive Iceberg reads via new metadata column definitions.

Enhancements:

  • Compute _row_id from first_row_id and row position when not physically stored in the data file, including when positional deletes are applied.
  • Inherit _last_updated_sequence_number from the file manifest’s data sequence number when the column is missing or contains nulls, using encoding-agnostic vector handling.
  • Ensure Iceberg split file schemas are expanded to include row-lineage metadata columns required by the Parquet reader and adaptColumns uses the reader output type when handling missing columns.

CI:

  • Update Claude review GitHub workflow to authorize an additional maintainer account for triggering review commands.

Tests:

  • Add Hive Iceberg tests covering reading physical _row_id, computed _row_id with and without positional deletes, and _last_updated_sequence_number inheritance for all-null, mixed, and missing-column cases.

Copilot AI changed the title [WIP] Add functionality to read row lineage fields Add Iceberg V3 row lineage read support for _row_id and _last_updated_sequence_number Feb 26, 2026
@Joe-Abraham
Copy link
Owner

@copilot when looking into https://github.com/apache/iceberg/blob/main/format/spec.md#row-lineage, It seems the _row_id and _last_updated_sequence_number are stored as hidden columns in the original file itself. Are we doing that?

Copy link
Author

Copilot AI commented Mar 3, 2026

@copilot when looking into https://github.com/apache/iceberg/blob/main/format/spec.md#row-lineage, It seems the _row_id and _last_updated_sequence_number are stored as hidden columns i...

Yes, the implementation reads _row_id and _last_updated_sequence_number as physical columns from the data file when they are present. In adaptColumns(), when fileTypeIdx.has_value() (line 237), the column is read directly from the Parquet/ORC file as a regular column — no special handling needed.

The extra logic only applies to two cases per the spec:

  1. Column missing from file (pre-V3 data): _last_updated_sequence_number inherits the data sequence number from the manifest entry; _row_id returns NULL.
  2. 0-value replacement in next(): When _last_updated_sequence_number is physically present but has value 0, it's replaced with the data sequence number (spec says "0 means not yet committed and must be inherited").

@Joe-Abraham
Copy link
Owner

@copilot check if the rules of row lineage are followed as per https://iceberg.apache.org/spec/#row-lineage and the intend is to read the data from the file.

Copy link
Author

Copilot AI commented Mar 5, 2026

@copilot check if the rules of row lineage are followed as per https://iceberg.apache.org/spec/#row-lineage and the intend is to read the data from the file.

I reviewed the spec and found an issue: the spec uses null-based inheritance, not 0-based. Fixed in commit 9415758.

Key spec rules and how they're now handled:

  1. _last_updated_sequence_number: Writers set it to null for inheritance. On read, null values are replaced with the manifest entry's sequence_number. Fixed next() to check isNullAt() instead of valueAt() == 0. Tests updated to use null values.

  2. Column missing from file: Spec says "A data file with only new rows may omit both columns. Readers treat them as null for all rows." Handled in adaptColumns()_last_updated_sequence_number inherits from $data_sequence_number, _row_id returns NULL.

  3. _row_id assignment: Per spec, null _row_id should be first_row_id + _pos. This requires first_row_id from the manifest entry to be available in split info, which is not yet plumbed through. Added a TODO for this. For now, null _row_id stays null (correct for pre-V3/upgraded tables where first_row_id is null).

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • iceberg.apache.org
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@Joe-Abraham
Copy link
Owner

@copilot
presto:joe_test_iceberg_v3> SELECT id, name, _row_id FROM users_iceberg ORDER BY id;

Query 20260306_060805_00002_kgskq, FAILED, 1 node
Splits: 6 total, 0 done (0.00%)
[Latency: client-side: 137ms, server-side: 118ms] [0 rows, 0B] [0 rows/s, 0B/s]

Query 20260306_060805_00002_kgskq failed: Field not found: _row_id. Available fields are: id, name, email, ts, region. Split Hive: /Users/joe/Developer/utils/singlenode/warehouse_iceberg_v3_file/joe_test_iceberg_v3/users_iceberg/data/region=eu/00000-15-80f841be-597b-48a4-aa7d-756108e856aa-0-00002.parquet 4 - 1365 Task 20260306_060805_00002_kgskq.2.0.0.0

Copy link
Author

Copilot AI commented Mar 6, 2026

@copilot
presto:joe_test_iceberg_v3> SELECT id, name, _row_id FROM users_iceberg ORDER BY id;

...

The error Field not found: _row_id occurs because _row_id is a metadata column, not part of the table schema (id, name, email, ts, region). In adaptColumns(), when _row_id is missing from the data file, the code was calling tableSchema->findChild("_row_id") which throws since metadata columns aren't in the table schema.

Fixed in 9ff6b86 — the code now tries readerOutputType_ first (which includes all requested columns, including metadata columns) before falling back to tableSchema. Also updated the tests to properly separate dataColumns (table schema without metadata columns) from outputType (includes _row_id) to match real-world behavior.

@Joe-Abraham
Copy link
Owner

@copilot In spark I am getting
+---+------+------+-------+-----------------------------+
|id |name |region|_row_id|_last_updated_sequence_number|
+---+------+------+-------+-----------------------------+
|1 |Alice |us |0 |2 |
|2 |Bob |us |1 |1 |
|3 |Claire|eu |2 |2 |
|4 |Dave |eu |3 |1 |
|5 |Eve |us |8 |3 |
|6 |Frank |eu |9 |3 |
+---+------+------+-------+-----------------------------+

in presto with this velox I am getting
presto:joe_test_iceberg_v3> SELECT id, name, _row_id, _last_updated_sequence_number FROM users_iceberg ORDER BY id;
id | name | _row_id | _last_updated_sequence_number
----+--------+---------+-------------------------------
1 | Alice | NULL | 2
2 | Bob | NULL | 2
3 | Claire | NULL | 2
4 | Dave | NULL | 2
5 | Eve | NULL | 3
6 | Frank | NULL | 3

Are we following the right spec and calculating the values according to the iceberg spec https://iceberg.apache.org/spec/#row-lineage

Copy link
Author

Copilot AI commented Mar 6, 2026

@copilot In spark I am getting
+---+------+------+-------+-----------------------------+
|id |name |region|_row_id|_last_updated_sequence_number|
...

The root cause was that the Parquet reader filters columns based on fileSchema (from hiveTableHandle->dataColumns()), which is the table's logical schema. Since _row_id and _last_updated_sequence_number are hidden metadata columns NOT in the table schema, the reader was filtering them out entirely — so they were never read from the file.

Fixed in c9d3ab0 by expanding the file schema in prepareSplit() to include these row lineage columns before the Parquet reader is created. Now:

  • _row_id will be read from the file when physically present (V3 data)
  • _last_updated_sequence_number will be read from the file, with null values replaced by the manifest entry's data sequence number per spec

Also updated all tests to properly separate dataColumns (table schema without metadata columns) from outputType (includes metadata columns) to match real-world behavior.

@Joe-Abraham Joe-Abraham force-pushed the copilot/read-row-lineage-fields branch from c9d3ab0 to 7a88a01 Compare March 9, 2026 06:12
@Joe-Abraham
Copy link
Owner

@copilot

presto:joe_test_iceberg_v3> SELECT id, name, _row_id, _last_updated_sequence_number FROM users_iceberg ORDER BY id;
id | name | _row_id | _last_updated_sequence_number
----+--------+---------+-------------------------------
1 | Alice | 4 | 2
2 | Bob | 5 | 2
3 | Claire | 6 | 2
4 | Dave | 7 | 2
5 | Eve | 8 | 3
6 | Frank | 9 | 3
(6 rows)

spark produces
+---+------+------+-------+-----------------------------+
|id |name |region|_row_id|_last_updated_sequence_number|
+---+------+------+-------+-----------------------------+
|1 |Alice |us |0 |2 |
|2 |Bob |us |1 |1 |
|3 |Claire|eu |2 |2 |
|4 |Dave |eu |3 |1 |
|5 |Eve |us |8 |3 |
|6 |Frank |eu |9 |3 |
+---+------+------+-------+-----------------------------+

The data created by spark is correct as per https://iceberg.apache.org/spec/?h=row#row-lineage-assignment

Check the code, if the reading of these column is implemented as per the spec

Copy link
Author

Copilot AI commented Mar 9, 2026

@copilot

presto:joe_test_iceberg_v3> SELECT id, name, _row_id, _last_updated_sequence_number FROM users_iceberg ORDER BY id;
...

The root cause was that the file schema expansion code (which tells the Parquet reader to read _row_id and _last_updated_sequence_number from the physical file) was accidentally removed in the previous commit when implementing first_row_id + _pos computation.

Without it, the Parquet reader didn't know these hidden columns existed in the file, so:

  • _row_id was always computed as first_row_id + _pos instead of being read from the file
  • _last_updated_sequence_number always inherited the constant $data_sequence_number instead of reading per-row values

Fixed in 0e36812:

  1. Restored file schema expansion in prepareSplit() before createReader() — adds _row_id and _last_updated_sequence_number to the file schema so the Parquet reader reads them from the file
  2. Fixed 3 tests that were using dataColumns(outputType) (which included metadata columns in the table schema, masking this bug). Tests now properly separate dataColumns (table schema without metadata columns) from outputType

The read logic now correctly: reads values from file when present, only computes first_row_id + _pos when _row_id is missing from the file (pre-V3), and only inherits $data_sequence_number when _last_updated_sequence_number is null or missing.

@Joe-Abraham
Copy link
Owner

@copilot

With the latest change the _last_updated_sequence_number is null in few places

presto:joe_test_iceberg_v3> SELECT id, name, _row_id, _last_updated_sequence_number FROM users_iceberg ORDER BY id;
 id |  name  | _row_id | _last_updated_sequence_number
----+--------+---------+-------------------------------
 1  | Alice  |       0 | NULL
 2  | Bob    |       1 |                             1
 3  | Claire |       2 | NULL
 4  | Dave   |       3 |                             1
 5  | Eve    |       8 |                             3
 6  | Frank  |       9 |                             3
(6 rows)

Query 20260309_101216_00007_8az69, FINISHED, 1 node
Splits: 6 total, 6 done (100.00%)
[Latency: client-side: 156ms, server-side: 131ms] [6 rows, 7.11KB] [45 rows/s, 54.3KB/s]

Copy link
Author

Copilot AI commented Mar 9, 2026

@copilot

With the latest change the _last_updated_sequence_number is null in few places
...

The issue was that the Parquet reader wraps columns in LazyVector. The null replacement code in next() only checked for ConstantEncoding and FlatVector — when the column came back as a LazyVector, neither branch matched and null values were silently left as-is.

This explains the pattern you saw:

  • Bob=1, Dave=1, Eve=3, Frank=3 ✓ — these files had the column OMITTED entirely, so adaptColumns() set it as a constant with the data sequence number (no lazy wrapping)
  • Alice=NULL, Claire=NULL ✗ — these files had the column PRESENT with null values, read as a LazyVector, and the null replacement was skipped

Fixed in 1907e0a by calling BaseVector::loadedVectorShared() to load the lazy vector before checking its encoding. Same fix applied to _row_id for consistency.

@Joe-Abraham Joe-Abraham force-pushed the copilot/read-row-lineage-fields branch from 55cf9c2 to e110255 Compare March 9, 2026 11:36
@Joe-Abraham
Copy link
Owner

@copilot,
The intend of the changes in this PR is to read row lineage columns as per the https://iceberg.apache.org/spec/#row-lineage

  1. check if there is no unnecessary changes to make the support
  2. check if there is sufficient testcases
  3. make necessary changes
  4. review the code.

…gnostic

Two key fixes:
1. Move dataSequenceNumber_ initialization before getAdaptedRowType() so
   adaptColumns() can use it when _last_updated_sequence_number is missing
   from the data file (ordering bug caused it to always be std::nullopt
   during adaptColumns).

2. Replace encoding-specific null replacement (asFlatVector only) with
   encoding-agnostic approach using SimpleVector::valueAt() and creating
   a new FlatVector. This handles FlatVector, DictionaryVector, and any
   other SimpleVector subclass returned by the Parquet reader after lazy
   loading. Also use seqNumChild = loadedVectorShared(seqNumChild) to
   replace the LazyVector wrapper in the RowVector's child directly.

Co-authored-by: Joe-Abraham <53977252+Joe-Abraham@users.noreply.github.com>
Copy link
Author

Copilot AI commented Mar 10, 2026

@copilot,
The intend of the changes in this PR is to read row lineage columns as per the https://iceberg.apache.org/spec/#row-lineage

...

Reviewed the code per your checklist. Found and fixed two issues in commit bda69ff:

Fix 1 — dataSequenceNumber_ initialization ordering: dataSequenceNumber_ was initialized AFTER getAdaptedRowType(), but adaptColumns() uses it to set the constant value when _last_updated_sequence_number is missing from the file. Moved initialization before getAdaptedRowType().

Fix 2 — Encoding-agnostic null replacement: The null replacement in next() only handled FlatVector via asFlatVector<int64_t>(). After lazy loading, the Parquet reader may return other encodings (dictionary, etc.) where asFlatVector returns nullptr and null replacement was silently skipped. Now uses SimpleVector<int64_t> interface which handles all vector encodings, and creates a new FlatVector with replaced values.

Also removed an unused variable in the readRowIdColumn test.

Test coverage review — 6 test cases covering:

  • readRowIdColumn: _row_id read from file as-is
  • readRowIdColumnMissing: _row_id computed as first_row_id + _pos
  • readRowIdColumnComputedWithDeletes: _row_id with positional deletes
  • readLastUpdatedSequenceNumberAllNulls: all-null → replaced with data_seq_num
  • readLastUpdatedSequenceNumberMixed: mixed null/non-null → only nulls replaced
  • readLastUpdatedSequenceNumberInherited: column missing from file → constant from data_seq_num

No unnecessary changes found — all additions directly support reading row lineage per the spec.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • download.osgeo.org
    • Triggering command: /usr/local/bin/cmake /usr/local/bin/cmake -DCMAKE_MESSAGE_LOG_LEVEL=VERBOSE -P /home/REDACTED/work/velox/velox/_build/release/_deps/geos-subbuild/geos-populate-prefix/src/geos-populate-stamp/download-geos-populate.cmake appl�� U_SOURCE -mavx2 -mfma -mavx -mf16c -mlzcnt -mbmi2 -D USE_VELOX_COMMON_BASE -D HAS_UNCAUGHT_EXCE/usr/libexec/gcc/x86_64-linux-gnu/13/collect2 ypes t-populate-prefix/src/v1.13.0.tar.gz o-unused as k/velox/velox/_b--64 cal/bin/as inja -E k/velox/velox/_build/release/_deps/fmt-subbuild /home/REDACTED/work/velox/velox/_build/release/_deps/duckdb-build//home/REDACTED/work/velox/velox/_b-c /opt/pipx_bin/as _deps/xsimd-subbuild/xsimd-populate-prefix/src/xsimd-populate-stamp/download-xsimd-populate.cma _node/ub_duckdb_-E _node/ub_duckdb_touch as (dns block)
    • Triggering command: /usr/local/bin/cmake /usr/local/bin/cmake -DCMAKE_MESSAGE_LOG_LEVEL=VERBOSE -P /home/REDACTED/work/velox/velox/_build/release/_deps/geos-subbuild/geos-populate-prefix/src/geos-populate-stamp/download-geos-populate.cmake -c U_SOURCE -mavx2 -mfma -mavx -mf16c -mlzcnt -mbmi2 -D USE_VELOX_COMMON_BASE -D HAS_UNCAUGHT_EXCE/usr/libexec/gcc/x86_64-linux-gnu/13/collect2 _64-linux-gnu/13/lto-wrapper t-populate-prefix/src/v1.13.0.tar.gz o-unused as k/velox/velox/_b--64 u/13/cc1 s -E k/velox/velox/_build/release/_deps/boost-src/libs/atomic/include-mf16c uild/release/_deps/fastfloat-subbuild/fastfloat-populate-prefix/src/fastfloat-populate-stamp/fas-c make _deps/xsimd-subbuild/xsimd-populate-prefix/src/xsimd-populate-stamp/download-xsimd-populate.cma ssion/ub_duckdb_-E ssion/ub_duckdb_touch make (dns block)
  • https://storage.googleapis.com/grpc-bazel-mirror/github.com/census-instrumentation/opencensus-proto/archive/v0.3.0.tar.gz
    • Triggering command: /usr/local/bin/cmake cmake ../.. -DCMAKE_BUILD_TYPE=Release -DVELOX_ENABLE_PARQUET=ON -GNinja -n $(get_cxx_flags )) ps/absl-subbuild/absl-populate-prefix/src/absl-populate-stamp/verify-absl-populate.cmake uild/release/_deps/duckdb-build/src/execution/physical_plan/ub_duckdb_physical_plan.cpp.tmp -mlzcnt ement.cpp ement.cpp.tmp /usr/local/bin/n--force-confdef --ve�� HAS_UNCAUGHT_EXC--print-foreign-architectures ps/protobuf-subbuild/protobuf-populate-prefix/src/protobuf-populate-stamp/verify-protobuf-populate.cmake tnet/tools/bash -Wno-unused -Wno-unused-para-E -Wno-sign-comparcompare_files qualifiers /home/REDACTED/work/velox/velox/_build/release/_deps/duckdb-build/src/common/operator/ub_duckdb_common_operators.cpp (http block)
    • Triggering command: /usr/local/bin/cmake cmake ../.. -DCMAKE_BUILD_TYPE=Release -DVELOX_ENABLE_PARQUET=ON -DVELOX_ENABLE_ARROW=ON -GNinja /home/REDACTED/wor-plugin-opt=-pass-through=-lgcc make missing-declarat/usr/libexec/gcc/x86_64-linux-gnu/13/collect2 k/velox/velox/_b-plugin /usr/bin/cc make -t k/velox/velox/_b-plugin-opt=-pass-through=-lgcc_s k/velox/velox/_b-plugin-opt=-pass-through=-lgcc make ude k/velox/velox/_b-plugin-opt=-pass-through=-lgcc lar/operators/ub--build-id make (http block)
    • Triggering command: /usr/local/bin/cmake cmake ../.. -DCMAKE_BUILD_TYPE=Release -DVELOX_ENABLE_PARQUET=ON -DVELOX_ENABLE_ARROW=ON -DVELOX_ENABLE_GEOGRAPHIC_FUNCTIONS=OFF -GNinja make usr/local/bin/cmas /lto-wrapper inja e-nHiPSH/CheckFuCMakeFiles/cmTC_78f79.dir/src.cxx.o -c /_build/release/_deps &amp;&amp; /usr/local/bin/cmake -DCMAKE_MESSAGE_LOG_LEVEL=VERBOSE -P /home/REDACTED//usr/local/bin/cmake k/velox/velox/_build/release/_deps/fmt-subbuild/fmt-populate-prefix/src/fmt-populate-stamp/fmt-p-E /usr/local/bin/ninja uild/release/CMa/usr/local/bin/cmake ns/scalar/generi-E ns/scalar/generimake_directory /usr/local/bin/n/home/REDACTED/work/velox/velox/_build/release/_deps/grpc-subbuild/CMakeFiles (http block)
  • iceberg.apache.org
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@Joe-Abraham
Copy link
Owner

/claude-review

@Joe-Abraham
Copy link
Owner

@sourcery-ai review

@sourcery-ai
Copy link

sourcery-ai bot commented Mar 10, 2026

Reviewer's Guide

Adds Iceberg V3 row lineage read support in the Hive Iceberg split reader by expanding file schemas with row lineage metadata columns, computing _row_id from first_row_id + position when missing, inheriting and backfilling null _last_updated_sequence_number values from the manifest’s data sequence number, and updating tests and metadata column definitions accordingly, along with a minor GitHub Actions allowlist change.

Updated class diagram for IcebergSplitReader and IcebergMetadataColumn row lineage support

classDiagram
  class IcebergSplitReader {
    +void prepareSplit(std::shared_ptr<ConnectorTableHandle> tableHandle, std::shared_ptr<ConnectorSplit> split, std::shared_ptr<const RowType> readerOutputType, std::shared_ptr<common_MetadataFilter> metadataFilter, RuntimeStatistics& runtimeStats, F14FastMap_string_string fileReadOps)
    +uint64_t next(uint64_t size, VectorPtr& output)
    +std::vector_TypePtr adaptColumns(RowTypePtr fileType, RowTypePtr tableSchema) const
    -std::list_PositionalDeleteFileReader_ptr positionalDeleteFileReaders_
    -BufferPtr deleteBitmap_
    -bool readLastUpdatedSeqNumFromFile_
    -std::optional_column_index_t lastUpdatedSeqNumOutputIndex_
    -std::optional_int64_t dataSequenceNumber_
    -std::optional_int64_t firstRowId_
    -bool computeRowId_
    -std::optional_column_index_t rowIdOutputIndex_
  }

  class IcebergMetadataColumn {
    +static constexpr const char* kRowIdColumnName
    +static constexpr const char* kLastUpdatedSequenceNumberColumnName
    +static std::shared_ptr_IcebergMetadataColumn icebergRowIdColumn()
    +static std::shared_ptr_IcebergMetadataColumn icebergLastUpdatedSequenceNumberColumn()
    -int id
    -std::string name
    -TypePtr type
    -std::string comment
  }

  class BaseRowReader {
    +uint64_t next(uint64_t size, VectorPtr& output, Mutation* mutation)
  }

  class RowVector {
    +VectorPtr childAt(column_index_t index)
    +void setChildAt(column_index_t index, VectorPtr child)
  }

  class ConstantVector_int64_t {
    +ConstantVector_int64_t(MemoryPool* pool, vector_size_t size, bool isNull, TypePtr type, int64_t value)
  }

  class FlatVector_int64_t {
    +void set(vector_size_t index, int64_t value)
  }

  class SimpleVector_int64_t {
    +bool isNullAt(vector_size_t index) const
    +int64_t valueAt(vector_size_t index) const
  }

  IcebergSplitReader --> BaseRowReader : uses
  IcebergSplitReader --> IcebergMetadataColumn : references constants
  IcebergSplitReader --> RowVector : produces output
  IcebergSplitReader --> ConstantVector_int64_t : creates for constants
  IcebergSplitReader --> FlatVector_int64_t : creates for rewritten columns
  IcebergSplitReader --> SimpleVector_int64_t : reads existing values

  IcebergMetadataColumn <.. IcebergSplitReader : metadata column definitions used in file schema expansion
Loading

File-Level Changes

Change Details Files
Implement Iceberg V3 row lineage handling in IcebergSplitReader for _row_id and _last_updated_sequence_number, including schema expansion, constant inference, and runtime null/value computation.
  • Expand the reader file schema in prepareSplit() to add BIGINT row_id and last_updated_sequence_number columns when requested in the output but absent from the physical file schema.
  • Initialize firstRowId and dataSequenceNumber from split info columns ($first_row_id and $data_sequence_number) before getAdaptedRowType() so adaptColumns() can treat missing lineage columns correctly.
  • In adaptColumns(), set last_updated_sequence_number as a constant dataSequenceNumber when missing from the file, and use readerOutputType_ (falling back to tableSchema) to derive types for missing columns instead of always using tableSchema.
  • In next(), when last_updated_sequence_number is read from the file and dataSequenceNumber is present, eagerly load the vector (handling lazy encodings) and replace nulls either by building a new ConstantVector for all-null constant vectors or a new FlatVector for partially-null vectors using the SimpleVector interface.
  • In next(), when row_id should be computed, eagerly load the column and, depending on constant/null or partial-null, build new FlatVectors that set values to firstRowId + file position, correctly tracking file positions with or without positional delete bitmaps while preserving vector_size vs actualSize semantics.
  • Track whether row_id and last_updated_sequence_number are file-read vs constant via readLastUpdatedSeqNumFromFile, computeRowId, and cached child indices lastUpdatedSeqNumOutputIndex_ and rowIdOutputIndex_ for efficient write-back into RowVector children.
velox/connectors/hive/iceberg/IcebergSplitReader.cpp
velox/connectors/hive/iceberg/IcebergSplitReader.h
Define Iceberg V3 row lineage metadata columns for _row_id and _last_updated_sequence_number in IcebergMetadataColumns.
  • Introduce static constexpr names for _row_id and _last_updated_sequence_number and assign fixed ids in the high metadata id range.
  • Add icebergRowIdColumn() and icebergLastUpdatedSequenceNumberColumn() helpers that construct IcebergMetadataColumn instances with BIGINT type and appropriate descriptions for row lineage semantics.
velox/connectors/hive/iceberg/IcebergMetadataColumns.h
Extend Hive Iceberg tests to cover row lineage behavior for _row_id and _last_updated_sequence_number, including schema evolution, deletes, and null inheritance, and to ensure metadata columns are excluded from dataColumns.
  • Add tests that read a physical _row_id column from a data file where it exists, verifying values are passed through even though _row_id is not in dataColumns (table schema).
  • Add tests that read _row_id when the data file lacks the column but $first_row_id is provided, asserting _row_id is computed as first_row_id + _pos both with and without positional delete files, verifying correct handling of deleted positions.
  • Add tests for _last_updated_sequence_number where the column exists but is all-null, mixed null/non-null, and completely missing, asserting nulls are replaced and missing columns are inherited from $data_sequence_number as a constant, while dataColumns only includes actual table schema columns.
  • Adjust tableScan setups in tests so metadata columns (_row_id, _last_updated_sequence_number) are only present in outputType, not in dataColumns, matching real-world usage, and clean up an unused test variable.
velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp
Update Claude review GitHub Actions workflow allowlist to include an additional authorized commenter.
  • Add "Joe-Abraham" to the JSON list of GitHub usernames allowed to trigger /claude-review and /claude-query in the workflow.
  • Mirror the same allowlist change in the unauthorized-notice job condition so the new user is treated as authorized in both paths.
.github/workflows/claude-review.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In the next() null-replacement paths for _last_updated_sequence_number and _row_id, you unconditionally cast to SimpleVector<int64_t> and always allocate a new flat vector; consider using DecodedVector (or at least guarding the cast and avoiding allocation when there are no nulls) to better support all encodings and reduce unnecessary allocations.
  • The repeated folly::SingletonVault::singleton()->registrationComplete(); calls in the new tests could be factored into a small helper or a test fixture setup hook to avoid duplication and make the intent clearer.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the `next()` null-replacement paths for `_last_updated_sequence_number` and `_row_id`, you unconditionally cast to `SimpleVector<int64_t>` and always allocate a new flat vector; consider using `DecodedVector` (or at least guarding the cast and avoiding allocation when there are no nulls) to better support all encodings and reduce unnecessary allocations.
- The repeated `folly::SingletonVault::singleton()->registrationComplete();` calls in the new tests could be factored into a small helper or a test fixture setup hook to avoid duplication and make the intent clearer.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Repository owner deleted a comment from github-actions bot Mar 10, 2026
@Joe-Abraham
Copy link
Owner

@copilot

  • In the next() null-replacement paths for _last_updated_sequence_number and _row_id, you unconditionally cast to SimpleVector<int64_t> and always allocate a new flat vector; consider using DecodedVector (or at least guarding the cast and avoiding allocation when there are no nulls) to better support all encodings and reduce unnecessary allocations.
  • The repeated folly::SingletonVault::singleton()->registrationComplete(); calls in the new tests could be factored into a small helper or a test fixture setup hook to avoid duplication and make the intent clearer.

Copy link
Author

Copilot AI commented Mar 10, 2026

@copilot

  • In the next() null-replacement paths for _last_updated_sequence_number and _row_id, you unconditionally cast to SimpleVector<int64_t> and always allocate a new flat vect...

Fixed in 4ba18c3:

  1. DecodedVector: Replaced SimpleVector casts with DecodedVector in both _last_updated_sequence_number and _row_id null-replacement paths. DecodedVector is the standard Velox pattern for encoding-agnostic value access and correctly handles flat, dictionary, constant, and other encodings.

  2. Allocation guard: Added decoded.mayHaveNulls() check after decoding — if the vector has a null buffer but no actual null values, the FlatVector allocation is skipped entirely.

  3. registrationComplete() duplication: Kept as-is. The 6 pre-existing tests in this file already use the same per-test pattern, so factoring it out would require modifying unrelated tests. This maintains consistency with the existing convention.

@Joe-Abraham Joe-Abraham force-pushed the copilot/read-row-lineage-fields branch 2 times, most recently from d5a6bb0 to e463633 Compare March 11, 2026 07:08
@Joe-Abraham Joe-Abraham force-pushed the copilot/read-row-lineage-fields branch from e463633 to 8745950 Compare March 11, 2026 07:13

// First row ID from the manifest entry, used to compute _row_id.
// When available (>= 0), _row_id = first_row_id + _pos for rows not in file.
std::optional<int64_t> firstRowId_;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Is it equivalent to lastUpdatedSeqNumOutputIndex_.has_value() && dataSequenceNumber_.has_value()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, readLastUpdatedSeqNumFromFile_ was equivalent to lastUpdatedSeqNumOutputIndex_.has_value() && dataSequenceNumber_.has_value(). Removed the redundant boolean in 14b5729 — now using the optional checks directly.

bool computeRowId_{false};

// The child index of _row_id in readerOutputType_.
std::optional<column_index_t> rowIdOutputIndex_;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot , Is it equivalent to rowIdOutputIndex_.has_value() && firstRowId_.has_value()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, computeRowId_ was equivalent to rowIdOutputIndex_.has_value() && firstRowId_.has_value(). Removed the redundant boolean in 14b5729 — now using the optional checks directly.

Replace readLastUpdatedSeqNumFromFile_ with
lastUpdatedSeqNumOutputIndex_.has_value() && dataSequenceNumber_.has_value(),
and computeRowId_ with
rowIdOutputIndex_.has_value() && firstRowId_.has_value().

The boolean flags were always equivalent to checking whether both
corresponding optionals had values, so they were redundant.

Co-authored-by: Joe-Abraham <53977252+Joe-Abraham@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants