Skip to content

WIP: Enable GlutenParquetTypeWideningSuite for Spark 4.0 and 4.1#11670

Draft
baibaichen wants to merge 5 commits intoapache:mainfrom
baibaichen:feature/enable-parquet-type-widening-suite
Draft

WIP: Enable GlutenParquetTypeWideningSuite for Spark 4.0 and 4.1#11670
baibaichen wants to merge 5 commits intoapache:mainfrom
baibaichen:feature/enable-parquet-type-widening-suite

Conversation

@baibaichen
Copy link
Contributor

What changes were proposed in this pull request?

WIP: Enable GlutenParquetTypeWideningSuite for Spark 4.0 and 4.1

This PR enables the previously disabled GlutenParquetTypeWideningSuite test suite, which validates Parquet type widening support (SPARK-40876) when using Gluten's Velox native reader.

Background

Parquet reading involves two types of type conversions:

  1. Physical→Logical restoration: Parquet uses wide physical containers (INT32, INT64, etc.) with logical annotations. Reading int32 + INT(8) as TINYINT is safe — the writer guarantees values fit within the annotated range.

  2. Schema evolution widening: Reading old data with a wider type (e.g., IntegerType → DoubleType, Decimal(5,2) → Decimal(7,2)). This is engine-specific — SPARK-40876 introduced this in Spark 4.0.

The original Velox Parquet reader (following Presto's behavior) did not support schema evolution widening for integer→float/double/decimal or decimal precision/scale widening, causing 74 out of 84 tests to fail.

Changes

Velox C++ fixes (in baibaichen/velox feature/enable-parquet-type-widening-suite):

Commit Description
1. Revert OAP 16732b4f5 Restore strict convertType() type checks that were over-relaxed (allowed INT64→INTEGER narrowing, commented out UTF8/ENUM validation)
2. INT type widening + precision check Extend convertType() to allow REAL/DOUBLE/Decimal for INT_8/16/32/64. Add hasEnoughDecimalPrecision matching Spark's rule (INT32: p-s≥10, INT64: p-s≥20). Add DOUBLE/REAL cases to getIntValues() with decimal scale adjustment.
3. Decimal→Decimal widening Fix same-scale precision widening (skip double-scaling in IntegerColumnReader). Support scale rescaling with precisionIncrease ≥ scaleIncrease rule.
4. SPARK-16632 fix Allow reading INT32 as ByteType/ShortType in INT_16/INT_32/Physical INT32 cases

Gluten changes (this PR):

Commit Description
1. Exception translation Add translateException() in ClosableIterator + ColumnarBatchOutIterator to convert Velox type errors to SchemaColumnConvertNotSupportedException
2. Enable TypeWideningSuite Enable suite in VeloxTestSettings (spark40+41), override GlutenParquetTypeWideningSuite to disable native writer and set expectError=true for tests where Velox correctly rejects unsupported conversions

Test Results

Status Count Details
Pass 81 10 original + 13 INT widening + 10 Decimal widening + 2 SPARK-16632 + 11 error-path + 35 overrides
Ignored 38 35 overrides (actually passing with different error assertion) + 3 truly excluded
Fail 0

The 3 truly excluded tests:

  • 2× DELTA_BYTE_ARRAY encoding: Velox doesn't support this encoding for FIXED_LEN_BYTE_ARRAY (orthogonal to type widening)
  • 1× parquet-mr decimal narrowing overflow→null: Cannot reproduce with Velox native reader

How was this patch tested?

Ran GlutenParquetTypeWideningSuite locally for Spark 4.0, achieving 81 pass / 0 fail / 38 ignored.

@github-actions github-actions bot added CORE works for Gluten Core BUILD VELOX labels Feb 27, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen force-pushed the feature/enable-parquet-type-widening-suite branch from 4e3ce3c to ab1d6ad Compare February 28, 2026 11:40
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen force-pushed the feature/enable-parquet-type-widening-suite branch from ab1d6ad to ee0c919 Compare February 28, 2026 13:36
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen force-pushed the feature/enable-parquet-type-widening-suite branch from ee0c919 to 9e79ca7 Compare February 28, 2026 14:10
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions
Copy link

github-actions bot commented Mar 1, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen force-pushed the feature/enable-parquet-type-widening-suite branch from 83b9bb6 to 7250a63 Compare March 2, 2026 02:52
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

baibaichen and others added 5 commits March 2, 2026 06:07
…rtedException

Add exception translation in Gluten's iterator chain so that Velox native
reader type conversion errors are properly translated to Spark's expected
SchemaColumnConvertNotSupportedException.

Changes:
- ClosableIterator.java: Extract translateException() virtual method
  (default returns GlutenException, preserving existing behavior)
- ColumnarBatchOutIterator.java: Override translateException() to detect
  Velox type mapping errors ('not allowed for requested type' or
  'Not a valid type for') and wrap them as
  SchemaColumnConvertNotSupportedException

This enables Spark's ParquetTypeWideningSuite error-path tests to pass
when using Gluten's Velox native reader.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enable the previously disabled GlutenParquetTypeWideningSuite with Velox
backend fixes for Parquet type widening (SPARK-40876).

Test suite: 81 pass, 0 fail, 38 ignored (from 74 failures)

Changes:
- VeloxTestSettings.scala (spark40+41): Enable suite with targeted excludes
  for DELTA_BYTE_ARRAY encoding limitation (2) and parquet-mr overflow (1)
- GlutenParquetTypeWideningSuite.scala (spark40+41): Override test class to
  disable native writer (test read-path only) and override 35 tests that
  need expectError=true for both reader configs (Velox always uses native
  reader regardless of vectorized setting)
- get-velox.sh: Point to Velox branch with type widening support

Velox fixes (in baibaichen/velox feature/enable-parquet-type-widening-suite):
1. Revert OAP commit that over-relaxed convertType() type checks
2. Support INT->DOUBLE/REAL/DECIMAL widening + decimal precision check
3. Support Decimal->Decimal widening (same-scale + scale rescaling)
4. Fix SPARK-16632: Allow reading INT32 as ByteType/ShortType

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…widening overrides

Spark 4.1 adds 6 new tests for decimal precision+scale widening where
precisionIncrease >= scaleIncrease >= 0. Velox already supports these
conversions, so they should NOT be in the 'expect error' override list.

Remove these 6 cases from the spark41 override:
- Decimal(5,2) -> Decimal(7,4)
- Decimal(5,2) -> Decimal(10,7)
- Decimal(5,2) -> Decimal(20,17)
- Decimal(10,2) -> Decimal(12,4)
- Decimal(10,2) -> Decimal(20,12)
- Decimal(20,2) -> Decimal(22,4)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove 5 excludes for Decimal->Decimal same-scale precision widening
tests that are now supported by Velox commit 3. These tests were
previously excluded with comment 'Velox reads wrong data' but the
Decimal->Decimal widening fix resolved the issue.

Un-excluded tests:
- Decimal(5,2) -> Decimal(7,2)
- Decimal(5,2) -> Decimal(10,2)
- Decimal(5,2) -> Decimal(20,2)
- Decimal(10,2) -> Decimal(12,2)
- Decimal(10,2) -> Decimal(20,2)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…olumns

When Gluten creates HiveTableHandle, it was passing all columns (including
partition columns) as dataColumns. This caused Velox's convertType() to
validate partition column types against the Parquet file's physical types,
failing when they differ (e.g., LongType in file vs IntegerType from
partition inference).

Fix: build dataColumns excluding partition columns (ColumnType::kPartitionKey).
Partition column values come from the partition path, not from the file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@baibaichen baibaichen force-pushed the feature/enable-parquet-type-widening-suite branch from 7250a63 to 5d22ba0 Compare March 2, 2026 06:07
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant