Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154

joyhaldar · 2026-01-27T14:32:19Z

This PR optimizes ExpireSnapshotsSparkAction by replacing driver-side collection with distributed Spark operations for manifest filtering.

The previous implementation read content files from ALL manifests in expired snapshots. This change filters at the manifest level first, reading content files only from orphaned manifests, similar to the approach used in ReachableFileCleanup but implemented with distributed Spark operations.

Optimizations

Early exit when no snapshots expired or no orphaned manifests.
Join-based filtering to identify orphaned manifests.
Read content files only from orphaned manifests instead of all expired manifests.

Code Changes

Added emptyFileInfoDS() helper to BaseSparkAction
Changed ReadManifest visibility to protected in BaseSparkAction
Added contentFilesFromManifestDF() method for targeted manifest reading

Before

All Expired Files ──────┐
                        ├──► Except ──► Orphaned Files
All Live Files ─────────┘
     (reads all manifests)

After

                    ┌─► No expired snapshots? ──► Return empty (Exit Early)
                    │
Expired Snapshots ──┤
                    │
                    └─► Find orphaned manifest Paths via except
                              │
                              ├─► No orphaned manifests? ──► Return manifest lists + stats (Exit Early)
                              │
                              └─► Join to get orphaned manifest details
                                        │
                                        └─► Read content files only from orphaned manifests
                                                    │
                                                    └─► Except with live content files ──► Orphaned Files

All existing TestExpireSnapshotsAction tests pass.

Co-authored-by: Joy Haldar <joy.haldar@target.com>

…oin-based filtering

Copilot

Pull request overview

This PR optimizes ExpireSnapshotsSparkAction by replacing driver-side collection with distributed Spark operations for manifest filtering. Instead of reading content files from all manifests in expired snapshots, the implementation now filters at the manifest level first using join-based operations, then reads content files only from orphaned manifests.

Changes:

Added early exit paths when no snapshots are expired or no orphaned manifests exist
Implemented distributed join-based filtering to identify orphaned manifests before reading their content files
Refactored helper methods in BaseSparkAction to support the new distributed approach

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
ExpireSnapshotsSparkAction.java	Replaced driver-side collection logic with distributed Spark operations for manifest-level filtering and added `contentFilesFromManifestDF()` method
BaseSparkAction.java	Added `emptyFileInfoDS()` helper method and changed `ReadManifest` visibility to `protected`
TestExpireSnapshotsAction.java	Updated expected job count in `testUseLocalIterator()` test from 4 to 12

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T16:28:56Z

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/actions/TestExpireSnapshotsAction.java

              .as(
                  "Expected total number of jobs with stream-results should match the expected number")
-              .isEqualTo(4L);
+              .isEqualTo(12L);


The expected job count increased from 4 to 12 due to the new distributed operations. Consider adding a comment explaining why this specific count is expected, or add a test case that validates the optimization logic (e.g., verifying early exits when no orphaned manifests exist).

Copilot · 2026-01-27T16:28:56Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

+      Dataset<FileInfo> liveStats = statisticsFileDS(updatedTable, null);
+      Dataset<FileInfo> orphanedStats = expiredStats.except(liveStats);
+
+      if (orphanedManifestPaths.isEmpty()) {


Using isEmpty() on a Dataset triggers a Spark action that collects data to the driver. Consider using first() wrapped in a try-catch or take(1).length == 0 to avoid potentially expensive operations when checking if a dataset is empty.

Suggested change

if (orphanedManifestPaths.isEmpty()) {

boolean hasOrphanedManifestPaths = orphanedManifestPaths.limit(1).toLocalIterator().hasNext();

if (!hasOrphanedManifestPaths) {

joyhaldar and others added 3 commits January 26, 2026 19:07

Spark: Optimize ExpireSnapshotsSparkAction with manifest-level pruning

987b454

Co-authored-by: Joy Haldar <joy.haldar@target.com>

Optimize ExpireSnapshotsSparkAction by replacing collectAsList with j…

6fcf4c3

…oin-based filtering

Optimize ExpireSnapshotsSparkAction by replacing collectAsList with j…

323b7ac

…oin-based filtering

github-actions bot added the spark label Jan 27, 2026

manuzhang changed the title ~~Spark: Optimize ExpireSnapshotsSparkAction with manifest-level filtering~~ Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering Jan 27, 2026

manuzhang requested a review from Copilot January 27, 2026 16:27

Copilot AI reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154

Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154

joyhaldar commented Jan 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if (orphanedManifestPaths.isEmpty()) {
	boolean hasOrphanedManifestPaths = orphanedManifestPaths.limit(1).toLocalIterator().hasNext();
	if (!hasOrphanedManifestPaths) {

Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154

Are you sure you want to change the base?

Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154

Conversation

joyhaldar commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joyhaldar commented Jan 27, 2026 •

edited

Loading