Skip to content

Conversation

@sergiomartinswhg
Copy link

Context

This PR addresses a long-standing feature request for handling OVERWRITE snapshots in Spark Structured Streaming.

Related issues and PRs:

Issue #2788 - Original feature request by @SreeramGarlapati
PR #2944 - Format version-aware approach by @tprelle
PR #7295 - Enum-based approach by @karim-ramadan

This implementation builds on the ideas from both previous PRs, adopting the enum-based design from #7295 while maintaining backward compatibility with the existing streaming-skip-overwrite-snapshots option.

Summary

This PR adds a new streaming-overwrite-mode option that provides more flexibility for handling OVERWRITE snapshots during Spark Structured Streaming reads. While users today typically use streaming-skip-overwrite-snapshots=true to skip these snapshots entirely, this PR introduces an added-files-only mode that allows processing the added files from OVERWRITE snapshots instead of skipping them.

Motivation

Tables frequently undergo operations that produce OVERWRITE snapshots:

  • INSERT OVERWRITE to specific partitions
  • MERGE INTO / UPDATE / DELETE operations

Today, users handle this by setting streaming-skip-overwrite-snapshots=true, which skips these snapshots entirely. However, this means any new data added during these operations is missed by the stream.

This PR gives users a third option: process only the added files from OVERWRITE snapshots, allowing streams to capture new data from these operations.

Changes

New option: streaming-overwrite-mode with three modes:

Mode Behavior
fail Throws exception on OVERWRITE snapshots (default)
skip Ignores OVERWRITE snapshots entirely
added-files-only Processes only added files from OVERWRITE snapshots

Backward compatibility:

  • streaming-skip-overwrite-snapshots=true maps to streaming-overwrite-mode=skip
  • New option takes precedence when both are specified
  • Deprecation warning logged when legacy option is used

Usage

spark.readStream()
    .format("iceberg")
    .option("streaming-overwrite-mode", "added-files-only")
    .load("catalog.db.table")

Warning for added-files-only mode

This mode may produce duplicate records when overwrites rewrite existing data (e.g., MERGE, UPDATE, DELETE). Downstream processing must handle duplicates (e.g., idempotent writes, deduplication).

Testing

  • Unit tests for StreamingOverwriteMode enum parsing
  • Integration tests for all three modes across Spark 3.4, 3.5, 4.0, and 4.1
  • Tests verify backward compatibility with legacy option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant