Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154

Copilot · 2026-01-27T16:28:56Z

Using isEmpty() on a Dataset triggers a Spark action that collects data to the driver. Consider using first() wrapped in a try-catch or take(1).length == 0 to avoid potentially expensive operations when checking if a dataset is empty.

Suggested change

if (orphanedManifestPaths.isEmpty()) {

boolean hasOrphanedManifestPaths = orphanedManifestPaths.limit(1).toLocalIterator().hasNext();

if (!hasOrphanedManifestPaths) {

Thanks for the review. Dataset.isEmpty() uses limit(1) and executeTake(1), it only fetches a single row to check emptiness, not the full dataset.

Source: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala#L557-L560

Copilot · 2026-01-27T16:28:56Z

The expected job count increased from 4 to 12 due to the new distributed operations. Consider adding a comment explaining why this specific count is expected, or add a test case that validates the optimization logic (e.g., verifying early exits when no orphaned manifests exist).

Added a comment explaining the job count.

-Original file line number
+Diff line change
@@ Expand Up @@
         return spark.createDataset(fileInfoList, FileInfo.ENCODER);
       }
+      protected Dataset<FileInfo> emptyFileInfoDS() {
+        return spark.emptyDataset(FileInfo.ENCODER);
+      }
       /**
        * Deletes files and keeps track of how many files were removed for each file type.
        *
@@ Expand Down Expand Up / @@ -405,7 +409,7 @@ public long totalFilesCount() { @@
         }
       }
-      private static class ReadManifest implements FlatMapFunction<ManifestFileBean, FileInfo> {
+      protected static class ReadManifest implements FlatMapFunction<ManifestFileBean, FileInfo> {
         private final Broadcast<Table> table;
         ReadManifest(Broadcast<Table> table) {
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -1200,10 +1200,12 @@ public void testUseLocalIterator() { @@
               checkExpirationResults(1L, 0L, 0L, 1L, 2L, results);
+              // Job count reflects distributed operations for manifest path filtering,
+              // early exit checks, and join-based filtering
               assertThat(jobsRunDuringStreamResults)
                   .as(
                       "Expected total number of jobs with stream-results should match the expected number")
-                  .isEqualTo(4L);
+                  .isEqualTo(12L);
             });
       }
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154

Diff view

Diff view

There are no files selected for viewing

Copilot AI Jan 27, 2026

Uh oh!

joyhaldar Jan 28, 2026 •

edited

Loading

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

joyhaldar Jan 28, 2026

Uh oh!

Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154

Are you sure you want to change the base?

Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

joyhaldar Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

joyhaldar Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

joyhaldar Jan 28, 2026 •

edited

Loading