Spark: Add streaming-overwrite-mode option for handling OVERWRITE snapshots #15152
+1,491
−37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
This PR addresses a long-standing feature request for handling OVERWRITE snapshots in Spark Structured Streaming.
Related issues and PRs:
Issue #2788 - Original feature request by @SreeramGarlapati
PR #2944 - Format version-aware approach by @tprelle
PR #7295 - Enum-based approach by @karim-ramadan
This implementation builds on the ideas from both previous PRs, adopting the enum-based design from #7295 while maintaining backward compatibility with the existing
streaming-skip-overwrite-snapshotsoption.Summary
This PR adds a new
streaming-overwrite-modeoption that provides more flexibility for handling OVERWRITE snapshots during Spark Structured Streaming reads. While users today typically usestreaming-skip-overwrite-snapshots=trueto skip these snapshots entirely, this PR introduces anadded-files-onlymode that allows processing the added files from OVERWRITE snapshots instead of skipping them.Motivation
Tables frequently undergo operations that produce OVERWRITE snapshots:
INSERT OVERWRITEto specific partitionsMERGE INTO/UPDATE/DELETEoperationsToday, users handle this by setting
streaming-skip-overwrite-snapshots=true, which skips these snapshots entirely. However, this means any new data added during these operations is missed by the stream.This PR gives users a third option: process only the added files from OVERWRITE snapshots, allowing streams to capture new data from these operations.
Changes
New option:
streaming-overwrite-modewith three modes:failskipadded-files-onlyBackward compatibility:
streaming-skip-overwrite-snapshots=truemaps tostreaming-overwrite-mode=skipUsage
Warning for added-files-only mode
This mode may produce duplicate records when overwrites rewrite existing data (e.g., MERGE, UPDATE, DELETE). Downstream processing must handle duplicates (e.g., idempotent writes, deduplication).
Testing