-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Joy Haldar <joy.haldar@target.com>
…oin-based filtering
…oin-based filtering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes ExpireSnapshotsSparkAction by replacing driver-side collection with distributed Spark operations for manifest filtering. Instead of reading content files from all manifests in expired snapshots, the implementation now filters at the manifest level first using join-based operations, then reads content files only from orphaned manifests.
Changes:
- Added early exit paths when no snapshots are expired or no orphaned manifests exist
- Implemented distributed join-based filtering to identify orphaned manifests before reading their content files
- Refactored helper methods in
BaseSparkActionto support the new distributed approach
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| ExpireSnapshotsSparkAction.java | Replaced driver-side collection logic with distributed Spark operations for manifest-level filtering and added contentFilesFromManifestDF() method |
| BaseSparkAction.java | Added emptyFileInfoDS() helper method and changed ReadManifest visibility to protected |
| TestExpireSnapshotsAction.java | Updated expected job count in testUseLocalIterator() test from 4 to 12 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .as( | ||
| "Expected total number of jobs with stream-results should match the expected number") | ||
| .isEqualTo(4L); | ||
| .isEqualTo(12L); |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected job count increased from 4 to 12 due to the new distributed operations. Consider adding a comment explaining why this specific count is expected, or add a test case that validates the optimization logic (e.g., verifying early exits when no orphaned manifests exist).
| Dataset<FileInfo> liveStats = statisticsFileDS(updatedTable, null); | ||
| Dataset<FileInfo> orphanedStats = expiredStats.except(liveStats); | ||
|
|
||
| if (orphanedManifestPaths.isEmpty()) { |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using isEmpty() on a Dataset triggers a Spark action that collects data to the driver. Consider using first() wrapped in a try-catch or take(1).length == 0 to avoid potentially expensive operations when checking if a dataset is empty.
| if (orphanedManifestPaths.isEmpty()) { | |
| boolean hasOrphanedManifestPaths = orphanedManifestPaths.limit(1).toLocalIterator().hasNext(); | |
| if (!hasOrphanedManifestPaths) { |
This PR optimizes
ExpireSnapshotsSparkActionby replacing driver-side collection with distributed Spark operations for manifest filtering.The previous implementation read content files from ALL manifests in expired snapshots. This change filters at the manifest level first, reading content files only from orphaned manifests, similar to the approach used in
ReachableFileCleanupbut implemented with distributed Spark operations.Optimizations
Code Changes
emptyFileInfoDS()helper toBaseSparkActionReadManifestvisibility toprotectedinBaseSparkActioncontentFilesFromManifestDF()method for targeted manifest readingBefore
After
All existing
TestExpireSnapshotsActiontests pass.