WIP: Enable GlutenParquetTypeWideningSuite for Spark 4.0 and 4.1#11670
Draft
baibaichen wants to merge 5 commits intoapache:mainfrom
Draft
WIP: Enable GlutenParquetTypeWideningSuite for Spark 4.0 and 4.1#11670baibaichen wants to merge 5 commits intoapache:mainfrom
baibaichen wants to merge 5 commits intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
4e3ce3c to
ab1d6ad
Compare
|
Run Gluten Clickhouse CI on x86 |
ab1d6ad to
ee0c919
Compare
|
Run Gluten Clickhouse CI on x86 |
ee0c919 to
9e79ca7
Compare
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
83b9bb6 to
7250a63
Compare
|
Run Gluten Clickhouse CI on x86 |
…rtedException
Add exception translation in Gluten's iterator chain so that Velox native
reader type conversion errors are properly translated to Spark's expected
SchemaColumnConvertNotSupportedException.
Changes:
- ClosableIterator.java: Extract translateException() virtual method
(default returns GlutenException, preserving existing behavior)
- ColumnarBatchOutIterator.java: Override translateException() to detect
Velox type mapping errors ('not allowed for requested type' or
'Not a valid type for') and wrap them as
SchemaColumnConvertNotSupportedException
This enables Spark's ParquetTypeWideningSuite error-path tests to pass
when using Gluten's Velox native reader.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enable the previously disabled GlutenParquetTypeWideningSuite with Velox backend fixes for Parquet type widening (SPARK-40876). Test suite: 81 pass, 0 fail, 38 ignored (from 74 failures) Changes: - VeloxTestSettings.scala (spark40+41): Enable suite with targeted excludes for DELTA_BYTE_ARRAY encoding limitation (2) and parquet-mr overflow (1) - GlutenParquetTypeWideningSuite.scala (spark40+41): Override test class to disable native writer (test read-path only) and override 35 tests that need expectError=true for both reader configs (Velox always uses native reader regardless of vectorized setting) - get-velox.sh: Point to Velox branch with type widening support Velox fixes (in baibaichen/velox feature/enable-parquet-type-widening-suite): 1. Revert OAP commit that over-relaxed convertType() type checks 2. Support INT->DOUBLE/REAL/DECIMAL widening + decimal precision check 3. Support Decimal->Decimal widening (same-scale + scale rescaling) 4. Fix SPARK-16632: Allow reading INT32 as ByteType/ShortType Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…widening overrides Spark 4.1 adds 6 new tests for decimal precision+scale widening where precisionIncrease >= scaleIncrease >= 0. Velox already supports these conversions, so they should NOT be in the 'expect error' override list. Remove these 6 cases from the spark41 override: - Decimal(5,2) -> Decimal(7,4) - Decimal(5,2) -> Decimal(10,7) - Decimal(5,2) -> Decimal(20,17) - Decimal(10,2) -> Decimal(12,4) - Decimal(10,2) -> Decimal(20,12) - Decimal(20,2) -> Decimal(22,4) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove 5 excludes for Decimal->Decimal same-scale precision widening tests that are now supported by Velox commit 3. These tests were previously excluded with comment 'Velox reads wrong data' but the Decimal->Decimal widening fix resolved the issue. Un-excluded tests: - Decimal(5,2) -> Decimal(7,2) - Decimal(5,2) -> Decimal(10,2) - Decimal(5,2) -> Decimal(20,2) - Decimal(10,2) -> Decimal(12,2) - Decimal(10,2) -> Decimal(20,2) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…olumns When Gluten creates HiveTableHandle, it was passing all columns (including partition columns) as dataColumns. This caused Velox's convertType() to validate partition column types against the Parquet file's physical types, failing when they differ (e.g., LongType in file vs IntegerType from partition inference). Fix: build dataColumns excluding partition columns (ColumnType::kPartitionKey). Partition column values come from the partition path, not from the file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7250a63 to
5d22ba0
Compare
|
Run Gluten Clickhouse CI on x86 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
WIP: Enable
GlutenParquetTypeWideningSuitefor Spark 4.0 and 4.1This PR enables the previously disabled
GlutenParquetTypeWideningSuitetest suite, which validates Parquet type widening support (SPARK-40876) when using Gluten's Velox native reader.Background
Parquet reading involves two types of type conversions:
Physical→Logical restoration: Parquet uses wide physical containers (INT32, INT64, etc.) with logical annotations. Reading
int32 + INT(8)asTINYINTis safe — the writer guarantees values fit within the annotated range.Schema evolution widening: Reading old data with a wider type (e.g.,
IntegerType → DoubleType,Decimal(5,2) → Decimal(7,2)). This is engine-specific — SPARK-40876 introduced this in Spark 4.0.The original Velox Parquet reader (following Presto's behavior) did not support schema evolution widening for integer→float/double/decimal or decimal precision/scale widening, causing 74 out of 84 tests to fail.
Changes
Velox C++ fixes (in baibaichen/velox
feature/enable-parquet-type-widening-suite):16732b4f5convertType()type checks that were over-relaxed (allowed INT64→INTEGER narrowing, commented out UTF8/ENUM validation)convertType()to allow REAL/DOUBLE/Decimal for INT_8/16/32/64. AddhasEnoughDecimalPrecisionmatching Spark's rule (INT32: p-s≥10, INT64: p-s≥20). Add DOUBLE/REAL cases togetIntValues()with decimal scale adjustment.IntegerColumnReader). Support scale rescaling withprecisionIncrease ≥ scaleIncreaserule.Gluten changes (this PR):
translateException()inClosableIterator+ColumnarBatchOutIteratorto convert Velox type errors toSchemaColumnConvertNotSupportedExceptionexpectError=truefor tests where Velox correctly rejects unsupported conversionsTest Results
The 3 truly excluded tests:
How was this patch tested?
Ran
GlutenParquetTypeWideningSuitelocally for Spark 4.0, achieving 81 pass / 0 fail / 38 ignored.