Skip to content

Conversation

@joyhaldar
Copy link
Contributor

@joyhaldar joyhaldar commented Jan 27, 2026

This PR optimizes ExpireSnapshotsSparkAction by replacing driver-side collection with distributed Spark operations for manifest filtering.

The previous implementation read content files from ALL manifests in expired snapshots. This change filters at the manifest level first, reading content files only from orphaned manifests, similar to the approach used in ReachableFileCleanup but implemented with distributed Spark operations.

Optimizations

  1. Early exit when no snapshots expired or no orphaned manifests.
  2. Join-based filtering to identify orphaned manifests.
  3. Read content files only from orphaned manifests instead of all expired manifests.

Code Changes

  • Added emptyFileInfoDS() helper to BaseSparkAction
  • Changed ReadManifest visibility to protected in BaseSparkAction
  • Added contentFilesFromManifestDF() method for targeted manifest reading

Before

All Expired Files ──────┐
                        ├──► Except ──► Orphaned Files
All Live Files ─────────┘
     (reads all manifests)

After

                    ┌─► No expired snapshots? ──► Return empty (Exit Early)
                    │
Expired Snapshots ──┤
                    │
                    └─► Find orphaned manifest Paths via except
                              │
                              ├─► No orphaned manifests? ──► Return manifest lists + stats (Exit Early)
                              │
                              └─► Join to get orphaned manifest details
                                        │
                                        └─► Read content files only from orphaned manifests
                                                    │
                                                    └─► Except with live content files ──► Orphaned Files

All existing TestExpireSnapshotsAction tests pass.

@github-actions github-actions bot added the spark label Jan 27, 2026
@manuzhang manuzhang changed the title Spark: Optimize ExpireSnapshotsSparkAction with manifest-level filtering Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering Jan 27, 2026
@manuzhang manuzhang requested a review from Copilot January 27, 2026 16:27
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes ExpireSnapshotsSparkAction by replacing driver-side collection with distributed Spark operations for manifest filtering. Instead of reading content files from all manifests in expired snapshots, the implementation now filters at the manifest level first using join-based operations, then reads content files only from orphaned manifests.

Changes:

  • Added early exit paths when no snapshots are expired or no orphaned manifests exist
  • Implemented distributed join-based filtering to identify orphaned manifests before reading their content files
  • Refactored helper methods in BaseSparkAction to support the new distributed approach

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
ExpireSnapshotsSparkAction.java Replaced driver-side collection logic with distributed Spark operations for manifest-level filtering and added contentFilesFromManifestDF() method
BaseSparkAction.java Added emptyFileInfoDS() helper method and changed ReadManifest visibility to protected
TestExpireSnapshotsAction.java Updated expected job count in testUseLocalIterator() test from 4 to 12

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.as(
"Expected total number of jobs with stream-results should match the expected number")
.isEqualTo(4L);
.isEqualTo(12L);
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected job count increased from 4 to 12 due to the new distributed operations. Consider adding a comment explaining why this specific count is expected, or add a test case that validates the optimization logic (e.g., verifying early exits when no orphaned manifests exist).

Copilot uses AI. Check for mistakes.
Dataset<FileInfo> liveStats = statisticsFileDS(updatedTable, null);
Dataset<FileInfo> orphanedStats = expiredStats.except(liveStats);

if (orphanedManifestPaths.isEmpty()) {
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using isEmpty() on a Dataset triggers a Spark action that collects data to the driver. Consider using first() wrapped in a try-catch or take(1).length == 0 to avoid potentially expensive operations when checking if a dataset is empty.

Suggested change
if (orphanedManifestPaths.isEmpty()) {
boolean hasOrphanedManifestPaths = orphanedManifestPaths.limit(1).toLocalIterator().hasNext();
if (!hasOrphanedManifestPaths) {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant