Skip to content

[GLUTEN-11683][VL] Add Parquet type widening support#11719

Draft
baibaichen wants to merge 8 commits intoapache:mainfrom
baibaichen:pr3/parquet-type-widening
Draft

[GLUTEN-11683][VL] Add Parquet type widening support#11719
baibaichen wants to merge 8 commits intoapache:mainfrom
baibaichen:pr3/parquet-type-widening

Conversation

@baibaichen
Copy link
Contributor

What changes were proposed in this pull request?

Add Parquet type widening support to Velox and enable 80 of 84 tests in GlutenParquetTypeWideningSuite.

Changes

  1. Point Velox to type widening branch (get-velox.sh):
    Use baibaichen/pr3/parquet-type-widening Velox branch with INT→Decimal, INT→Double, Float→Double widening support.

  2. Update VeloxTestSettings (spark40 + spark41):
    Remove 15 excludes for widening tests now passing.

  3. Disable native writer (GlutenParquetTypeWideningSuite.scala):
    This suite tests the READ path only. Disable native writer so Spark's writer produces correct V2 encodings (DELTA_BINARY_PACKED/DELTA_BYTE_ARRAY). Remove 10 more excludes.

  4. Fallback to vanilla reader when vectorized=false (BasicScanExecTransformer.scala):
    When PARQUET_VECTORIZED_READER_ENABLED=false, fallback to Spark's vanilla parquet-mr reader instead of Velox native reader. This preserves parquet-mr's behavior (decimal precision narrowing, null on overflow). Remove 34 more excludes.

Test Results

PR2 PR3
✅ Passed 21 80 (+59)
❌ Excluded 63 4 (-59)

Remaining 4 excludes: Velox does not support DELTA_BYTE_ARRAY encoding for FIXED_LEN_BYTE_ARRAY decimals.

Depends on #11689 (PR2).
Fixes #11683

How was this patch tested?

Local tests: TypeWideningSuite 80 pass / 4 ignored (spark40 and spark41).

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with GitHub Copilot.

@github-actions github-actions bot added CORE works for Gluten Core BUILD VELOX labels Mar 8, 2026
@github-actions
Copy link

github-actions bot commented Mar 8, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Mar 9, 2026

Run Gluten Clickhouse CI on x86

baibaichen and others added 3 commits March 10, 2026 05:34
Replace OAP commit [15173][15343] (INT narrowing) with upstream Velox
PR #15173 (fix reading array of row) to fix parquet-thrift compatibility.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…olumns

When Gluten creates HiveTableHandle, it was passing all columns (including
partition columns) as dataColumns. This caused Velox's convertType() to
validate partition column types against the Parquet file's physical types,
failing when they differ (e.g., LongType in file vs IntegerType from
partition inference).

Fix: build dataColumns excluding partition columns (ColumnType::kPartitionKey).
Partition column values come from the partition path, not from the file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
With OAP INT narrowing commit replaced by upstream Velox PR #15173:
- Remove 2 excludes now passing: LongType->IntegerType, LongType->DateType
- Add 2 excludes for new failures: IntegerType->ShortType (OAP removed)

Exclude 63 (net unchanged: -2 +2). Test results: 21 pass / 63 ignored.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@baibaichen baibaichen force-pushed the pr3/parquet-type-widening branch from e3c259f to 4f80267 Compare March 10, 2026 07:37
These tests regress after skipping OAP commit 8c2bd0849 (Allow reading
integers into smaller-range types). They will be re-enabled in PR3 when
Velox widening commits are applied.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@baibaichen baibaichen force-pushed the pr3/parquet-type-widening branch from 4f80267 to abbc057 Compare March 10, 2026 09:44
baibaichen and others added 4 commits March 10, 2026 14:34
With Velox PR3 type widening (INT->Decimal, INT->Double, Float->Double):
- Remove 15 excludes for widening tests now passing

Remaining 48 excludes.
Test results: 36 pass / 48 ignored.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This suite tests the READ path only. Disable native writer so Spark's
writer produces correct V2 encodings (DELTA_BINARY_PACKED/DELTA_BYTE_ARRAY).
- Remove 10 excludes for decimal widening tests now passing

Remaining 38 excludes:
- 34: Velox native reader rejects incompatible decimal conversions
  regardless of reader config (no parquet-mr fallback)
- 4: Velox does not support DELTA_BYTE_ARRAY encoding

Test results: 46 pass / 38 ignored.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When PARQUET_VECTORIZED_READER_ENABLED=false, fallback to Spark's vanilla
parquet-mr reader instead of using Velox native reader. This preserves
parquet-mr's behavior (e.g., allowing decimal precision narrowing, returning
null on overflow) which differs from the vectorized reader.

- Remove 34 excludes from GlutenParquetTypeWideningSuite that now pass
  via vanilla reader fallback

Remaining 4 excludes: Velox does not support DELTA_BYTE_ARRAY encoding.

Test results: 80 pass / 4 ignored.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@baibaichen baibaichen force-pushed the pr3/parquet-type-widening branch from abbc057 to c2d50e1 Compare March 10, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[VL] Support type widening in Parquet reader (SPARK-40876)

1 participant