Skip to content

Backport: Add a config to disable R2C Host retry (#14373)#14409

Merged
GaryShen2008 merged 1 commit intoNVIDIA:release/26.02from
thirtiseven:backport_r2c_config
Mar 13, 2026
Merged

Backport: Add a config to disable R2C Host retry (#14373)#14409
GaryShen2008 merged 1 commit intoNVIDIA:release/26.02from
thirtiseven:backport_r2c_config

Conversation

@thirtiseven
Copy link
Collaborator

Contributes to #14368

Description

This pr add a config spark.rapids.sql.rowToColumnar.retry.enabled to disable the per-row retry introduced in #13842, which caused performance issues in some cases.

Also added a test.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Contributes to NVIDIA#14368

### Description

This pr add a config `spark.rapids.sql.rowToColumnar.retry.enabled` to
disable the per-row retry introduced in NVIDIA#13842, which caused performance
issues in some cases.

Also added a test.

### Checklists

- [ ] This PR has added documentation for new or modified features or
behaviors.
- [x] This PR has added new tests or modified existing tests to cover
new code paths.
(Please explain in the PR description how the new code paths are tested,
such as names of the new/existing tests that cover them.)
- [ ] Performance testing has been performed and its results are added
in the PR description. Or, an issue has been filed with a link in the PR
description.

---------

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
@thirtiseven thirtiseven self-assigned this Mar 12, 2026
@thirtiseven
Copy link
Collaborator Author

build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 12, 2026

Greptile Summary

This PR adds an internal config spark.rapids.sql.rowToColumnar.retry.enabled (defaulting to false) that disables the per-row retry wrapper introduced in #13842 for RowToColumnarIterator, addressing a reported performance regression. When disabled, each row is converted directly without a withRetryNoSplit/withRestoreOnRetry wrapper, removing per-row overhead at the cost of propagating host OOM exceptions directly to the task level. The builders.tryBuild(rowCount) call remains wrapped in withRetryNoSplit in both paths, so GPU OOM at the build step is still retried.

  • RowToColumnarIterator gains an enableRetry: Boolean = false constructor parameter that gates the two conversion branches.
  • Both production call sites (GpuRowToColumnarExec and ParquetCachedBatchSerializer) now read rapidsConf.isR2cRetryEnabled and forward it to the iterator.
  • Four new tests are added: CPU OOM without retry (propagates), CPU OOM with retry (recovers), multi-batch retry correctness, and a single-batch recovery case using a custom rowThatFailsOnceWithCpuRetryOOM helper that throws exactly once.
  • Existing retry tests are updated to pass enableRetry = true explicitly to preserve original behaviour coverage.

Confidence Score: 5/5

  • Safe to merge — the change is gated behind an internal config that defaults to the new (retry-disabled) behaviour, and all production call sites have been updated.
  • The implementation is logically correct: the non-retry path removes only the per-row OOM wrapper while keeping the final builders.tryBuild retry intact. Both production callers read and forward the new config. The new tests directly validate all four scenarios (retry on/off × OOM recovery/propagation). No pre-existing retry tests have been broken; they were updated to pass enableRetry = true explicitly. The config is marked .internal() so no user-facing documentation update is strictly required.
  • No files require special attention.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuRowToColumnarExec.scala Adds enableRetry: Boolean = false parameter to RowToColumnarIterator, splitting buildBatch() into two branches: the existing per-row retry path and a new direct-conversion path. Removed a leftover TODO comment. Logic is sound; the builders.tryBuild(rowCount) remains wrapped in withRetryNoSplit in both paths.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Adds ENABLE_R2C_RETRY internal config spark.rapids.sql.rowToColumnar.retry.enabled defaulting to false, and the corresponding isR2cRetryEnabled lazy accessor. Documentation in the .doc() call correctly describes the trade-off.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetCachedBatchSerializer.scala Reads enableR2cRetry from rapidsConf and forwards it to RowToColumnarIterator. The change correctly threads the new config through the Parquet batch-serialization path.
tests/src/test/scala/com/nvidia/spark/rapids/RowToColumnarIteratorRetrySuite.scala Updates existing tests to pass enableRetry = true explicitly, and adds four new tests covering CPU OOM with and without retry (single-batch and multi-batch). The rowThatFailsOnceWithCpuRetryOOM helper correctly uses a per-instance var failed so each returned row has independent throw-once semantics.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["RowToColumnarIterator.buildBatch()"] --> B{enableRetry?}

    B -- true --> C["Create RetryableRowConverter\n(with UnsafeProjection)"]
    C --> D["converter.attempt(rowIter.next())"]
    D --> E["withRetryNoSplit {\n  withRestoreOnRetry(converter) {\n    converters.convert(currentRow, builders)\n  }\n}"]
    E -- OOM --> F["converter.restore()\nRollback builders + copy row"]
    F --> E
    E -- success --> G["byteCount += bytesWritten\nrowCount += 1"]

    B -- false --> H["val row = rowIter.next()"]
    H --> I["converters.convert(row, builders)"]
    I -- OOM --> J["CpuRetryOOM propagates\n(task may fail)"]
    I -- success --> K["byteCount += bytes\nrowCount += 1"]

    G --> L{more rows\nwithin target?}
    K --> L
    L -- yes --> D
    L -- yes --> H
    L -- no --> M["withRetryNoSplit {\n  builders.tryBuild(rowCount)\n}"]
    M --> N["return ColumnarBatch"]
Loading

Last reviewed commit: a3f34e9

@sameerz sameerz added the performance A performance related task/issue label Mar 12, 2026
@GaryShen2008 GaryShen2008 merged commit cb70faf into NVIDIA:release/26.02 Mar 13, 2026
47 checks passed
@thirtiseven thirtiseven deleted the backport_r2c_config branch March 13, 2026 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance A performance related task/issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants