feat: Add maintain_order parameter to merge_sorted#27263
feat: Add maintain_order parameter to merge_sorted#27263jonathanchang31 wants to merge 7 commits intopola-rs:mainfrom
maintain_order parameter to merge_sorted#27263Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #27263 +/- ##
==========================================
- Coverage 81.58% 81.43% -0.15%
==========================================
Files 1820 1829 +9
Lines 251036 252688 +1652
Branches 3149 3146 -3
==========================================
+ Hits 204808 205785 +977
- Misses 45420 46104 +684
+ Partials 808 799 -9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
The implementation looks good, However, could you redo the tests a bit?
-
We already test the streaming engine separately from the in-memory engine, so
@pytest.mark.parametrize("streaming", [False, True])is not needed. -
Could you add one parametric test (with hypothesis), that roughly does the following?
- Get two input dataframes (with both a
keycolumn that has matching dtypes; the dtype can just bepl.Int32or something) - Eagerly add row-indexes to each dataframe (you'll need to set different column names)
- Add a
dfcolumn to each dataframe where one gets only 0 and the other only 1. - Set
actualis themaintain_order=Truemerge_sorted result - Set
expectedto be a result of a concat, and then sort by(key, df).
This is quite contrived, but you get the idea. Something simpler would of course be better. 😉
- Get two input dataframes (with both a
That test would cover all of the current tests, so you can probably remove those.
PS You can just use collect(). We automatically test both the in-memory engine and the streaming engine in CI.
|
It's okay if the coverage jobs do not pass. We know about the disk space issue. ;) |
|
@dsprenkels all the checks passed |
| # Coverage data comes from instrumentation, not DWARF symbols. | ||
| # Keep macOS test binaries smaller to avoid linker/disk exhaustion. | ||
| CARGO_PROFILE_DEV_DEBUG: 0 | ||
| CARGO_PROFILE_TEST_DEBUG: 0 |
There was a problem hiding this comment.
Please fix: Please revert this. This is a separate issue. We are looking into this issue, and we do not require this CI job as a passing requirement for merging.
| .into_series() | ||
| }, | ||
| #[cfg(feature = "dtype-array")] | ||
| Array(_, _) => { |
There was a problem hiding this comment.
Please fix: Can you please revert this, as this a somewhat a separate concern. Or leave the TODO that it should be optimized (row-encoding is very slow).
| assert_frame_equal(df_chained_from_right, df_full) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("streaming", [False, True]) |
There was a problem hiding this comment.
Can you still address the previous comment I had about the parametric test?
Summary
Closes #27114.
Adds a
maintain_order: boolparameter tomerge_sorted(). When set toTrue, the output is guaranteed to have left-biased ordering for equal keys: rows from the left frame always appear before rows from the right frame when their keys match.maintain_orderthrough the full stack: Python API → PyO3 bindings → DSL/IR plan → streaming engine → in-memory enginefind_mergeable()holds back right-side rows at chunk boundaries whose keys equal the left's maximum (usesgt_eqinstead ofgt), preventing right-side ties from being emitted before left-side ties arriving in later morselsFalsefor backward compatibilityTest plan
DataFrame.merge_sorted()API pathmerge_sortedtests pass with no regressionscargo check --features merge_sortedpasses