Skip to content

feat: Add maintain_order parameter to merge_sorted#27263

Open
jonathanchang31 wants to merge 7 commits intopola-rs:mainfrom
jonathanchang31:feat/merge-sorted-maintain-order
Open

feat: Add maintain_order parameter to merge_sorted#27263
jonathanchang31 wants to merge 7 commits intopola-rs:mainfrom
jonathanchang31:feat/merge-sorted-maintain-order

Conversation

@jonathanchang31
Copy link
Copy Markdown

Summary

Closes #27114.

Adds a maintain_order: bool parameter to merge_sorted(). When set to True, the output is guaranteed to have left-biased ordering for equal keys: rows from the left frame always appear before rows from the right frame when their keys match.

  • Threads maintain_order through the full stack: Python API → PyO3 bindings → DSL/IR plan → streaming engine → in-memory engine
  • Core streaming fix: find_mergeable() holds back right-side rows at chunk boundaries whose keys equal the left's maximum (uses gt_eq instead of gt), preventing right-side ties from being emitted before left-side ties arriving in later morsels
  • The in-memory engine already produces left-biased output, so the flag is only load-bearing in the streaming path
  • Defaults to False for backward compatibility

Test plan

  • 3 new test functions (6 cases) covering both streaming and in-memory engines:
    • Basic left-biased ordering with overlapping keys
    • All keys identical
    • DataFrame.merge_sorted() API path
  • All 41 existing merge_sorted tests pass with no regressions
  • Rust cargo check --features merge_sorted passes

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars first-contribution First contribution by user labels Apr 10, 2026
@github-actions github-actions bot added the changes-dsl Do not merge if this label is present and red. label Apr 10, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 48.38710% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.43%. Comparing base (880651f) to head (5bc3ca0).
⚠️ Report is 26 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-ops/src/frame/join/merge_sorted.rs 0.00% 16 Missing ⚠️
crates/polars-plan/src/plans/ir/tree_format.rs 0.00% 8 Missing ⚠️
crates/polars-plan/src/plans/ir/dot.rs 0.00% 4 Missing ⚠️
crates/polars-plan/src/plans/visitor/hash.rs 0.00% 2 Missing ⚠️
...rates/polars-python/src/lazyframe/visitor/nodes.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #27263      +/-   ##
==========================================
- Coverage   81.58%   81.43%   -0.15%     
==========================================
  Files        1820     1829       +9     
  Lines      251036   252688    +1652     
  Branches     3149     3146       -3     
==========================================
+ Hits       204808   205785     +977     
- Misses      45420    46104     +684     
+ Partials      808      799       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dsprenkels dsprenkels self-requested a review April 13, 2026 07:37
Copy link
Copy Markdown
Collaborator

@dsprenkels dsprenkels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good, However, could you redo the tests a bit?

  • We already test the streaming engine separately from the in-memory engine, so @pytest.mark.parametrize("streaming", [False, True]) is not needed.

  • Could you add one parametric test (with hypothesis), that roughly does the following?

    1. Get two input dataframes (with both a key column that has matching dtypes; the dtype can just be pl.Int32 or something)
    2. Eagerly add row-indexes to each dataframe (you'll need to set different column names)
    3. Add a df column to each dataframe where one gets only 0 and the other only 1.
    4. Set actual is the maintain_order=True merge_sorted result
    5. Set expected to be a result of a concat, and then sort by (key, df).

    This is quite contrived, but you get the idea. Something simpler would of course be better. 😉

That test would cover all of the current tests, so you can probably remove those.

PS You can just use collect(). We automatically test both the in-memory engine and the streaming engine in CI.

Comment thread crates/polars-mem-engine/src/planner/lp.rs
Comment thread crates/polars-plan/src/plans/ir/tree_format.rs Outdated
@dsprenkels
Copy link
Copy Markdown
Collaborator

dsprenkels commented Apr 16, 2026

It's okay if the coverage jobs do not pass. We know about the disk space issue. ;)

@jonathanchang31
Copy link
Copy Markdown
Author

@dsprenkels all the checks passed

# Coverage data comes from instrumentation, not DWARF symbols.
# Keep macOS test binaries smaller to avoid linker/disk exhaustion.
CARGO_PROFILE_DEV_DEBUG: 0
CARGO_PROFILE_TEST_DEBUG: 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix: Please revert this. This is a separate issue. We are looking into this issue, and we do not require this CI job as a passing requirement for merging.

.into_series()
},
#[cfg(feature = "dtype-array")]
Array(_, _) => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix: Can you please revert this, as this a somewhat a separate concern. Or leave the TODO that it should be optimized (row-encoding is very slow).

assert_frame_equal(df_chained_from_right, df_full)


@pytest.mark.parametrize("streaming", [False, True])
Copy link
Copy Markdown
Collaborator

@dsprenkels dsprenkels Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you still address the previous comment I had about the parametric test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changes-dsl Do not merge if this label is present and red. enhancement New feature or an improvement of an existing feature first-contribution First contribution by user python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add maintain_order: bool to merge_sorted()

2 participants