Skip to content

[GLUTEN-10568] [VL] Pass the table schema to the HiveTableHandle#10569

Closed
kevinwilfong wants to merge 1 commit intoapache:mainfrom
kevinwilfong:data_schema
Closed

[GLUTEN-10568] [VL] Pass the table schema to the HiveTableHandle#10569
kevinwilfong wants to merge 1 commit intoapache:mainfrom
kevinwilfong:data_schema

Conversation

@kevinwilfong
Copy link
Collaborator

What changes are proposed in this pull request?

This PR changes the SubstraitToVeloxPlanConverter to pass the table schema rather than just the schema of the columns we
read from the files to the HiveTableHandle.

This is a necessary prerequisite to supporting index based column resolution, whether that's reading from files using column
positions rather than names to map between the file schema and the table schema, or reading from file formats that do not
contain schema information like Text.

To do this I updated VeloxIteratorApi to set the file schema of LocalFilesNodes to the data schema of the Scan (when
present). This is similar to what's already done in the Iterator API for ClickHouse. I then parse that schema and pass it to the
SplitInfo in VeloxPlanConverter. Finally, I extract it from the SplitInfo in SubstraitToVeloxPlanConverter and pass it to the
HiveTableHandle constructor in place of the base schema.

This should not produce any noticeable effect for the existing code paths/file formats as the table schema is a superset of
the base schema and file columns are currently mapped to table columns exclusively by name.

How was this patch tested?

Ran the existing unit tests. This change should not change any existing behavior, but should enable future changes.

@kevinwilfong kevinwilfong marked this pull request as draft August 27, 2025 23:10
@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Aug 27, 2025
@github-actions
Copy link

#10568

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot removed the CORE works for Gluten Core label Aug 27, 2025
splitInfos.zipWithIndex.map {
case (splitInfos, index) =>
case (splits, index) =>
val splitsByteArray = splits.zipWithIndex.map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we inject table schema into substrait plan?
the draft implementation will bring more GC in driver.

Copy link
Collaborator Author

@kevinwilfong kevinwilfong Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say inject it into the substrait plan do you mean in the ReadRel?

If so definitely, this was something I was already considering I just wasn't sure how open folks are to updating the substrait protobufs, I borrowed this approach from what's already done in the code for ClickHouse.

We could probably deprecate the schema field in FileOrFiles. It looks like the only place it's used is that ClickHouse code path, and there it looks like it's put there because they only want to set it when the file is in the TextFile format, we could probably change that logic to only consume the field if it's in the TextFile format if that's not already the way it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say inject it into the substrait plan do you mean in the ReadRel?

yes, you can try modify gluten-substrait/src/main/resources/substrait/proto/substrait/algebra.proto

We could probably deprecate the schema field in FileOrFiles

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yohahaha I see now why they originally added it at the split level for ClickHouse

They only wanted to add the table schema if it was necessary, which today is only if a file is in the TextFileFormat. Given this is a property of the partition/split we don't know this at the time we're constructing the plan.

If we always set the table schema then it's possible the table schema has some column type that Gluten doesn't support (e.g. TimestampNTZ) which causes an exception serializing the ReadRelNode.

We can optionally add it to the substrait plan for the case where we want to support index based column resolution, which can be determined at plan generation time, however, to support file formats that always depend on knowing the schema like Text files, I think we'll need to keep it at the split level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the failure in the test "SPARK-36726: test incorrect Parquet row group file offset" in GlutenParquetIOSuite in gluten-ut/spark35 when I add it at the plan level

@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Aug 29, 2025
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@kevinwilfong kevinwilfong marked this pull request as ready for review August 29, 2025 23:02
@kevinwilfong kevinwilfong marked this pull request as draft August 29, 2025 23:04
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Sep 2, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale stale label Oct 18, 2025
@github-actions
Copy link

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

@github-actions github-actions bot closed this Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants