Spark 4.1: Set data file sort_order_id in manifest for writes from Spark #15150
+392
−37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR updates all writes from Spark 4.0 & 3.5 to set the
sort_order_idof data files when applicable per the Iceberg Table Spec. I've opened this PR to:This is a successor to my initial PRs #13636 & #14683. The first of which has since been closed for being stale but I later revived for Spark 3.5 & 4.0 in #14683 at the request of @anuragmantri to complement a follow up change that we both converged on around using this optimization in conjunction w/ some Spark DSv2 APIs to report sort order when possible to optimize downstream query plans. After some discussion w/ @anuragmantr & @RussellSpitzer, I've forward ported the changes to spark 4.1 as the underlying "base" spark version has since changed. I've re-opened this PR as of late there has been some increased interest in this:
sort_order_idin manifest when writing/compacting data files #13636 (comment)So it appears that there is value to these changes being upstreamed instead of confined to a fork.
Testing
I've added tests for newer added utility functions and updated existing tests that write data files and compact data files in a sorted manner to verify that we're setting the
sort_order_identry in the manifests to the correct value. Additionally, I've used this patch on an internal fork and verified that it correctly sets this field during compaction and normal writes.Issue: #13634
cc @anuragmantri & @RussellSpitzer