feat: Add `maintain_order` parameter to `merge_sorted` by jonathanchang31 · Pull Request #27263 · pola-rs/polars

jonathanchang31 · 2026-04-10T13:07:49Z

Summary

Adds a maintain_order: bool parameter to merge_sorted(). When set to True, the output is guaranteed to have left-biased ordering for equal keys: rows from the left frame always appear before rows from the right frame when their keys match.

Threads maintain_order through the full stack: Python API → PyO3 bindings → DSL/IR plan → streaming engine → in-memory engine
Core streaming fix: find_mergeable() holds back right-side rows at chunk boundaries whose keys equal the left's maximum (uses gt_eq instead of gt), preventing right-side ties from being emitted before left-side ties arriving in later morsels
The in-memory engine already produces left-biased output, so the flag is only load-bearing in the streaming path
Defaults to False for backward compatibility

Test plan

3 new test functions (6 cases) covering both streaming and in-memory engines:
- Basic left-biased ordering with overlapping keys
- All keys identical
- DataFrame.merge_sorted() API path
All 41 existing merge_sorted tests pass with no regressions
Rust cargo check --features merge_sorted passes

codecov · 2026-04-10T14:45:16Z

Codecov Report

❌ Patch coverage is 48.38710% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.43%. Comparing base (880651f) to head (5bc3ca0).
⚠️ Report is 26 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-ops/src/frame/join/merge_sorted.rs	0.00%	16 Missing ⚠️
crates/polars-plan/src/plans/ir/tree_format.rs	0.00%	8 Missing ⚠️
crates/polars-plan/src/plans/ir/dot.rs	0.00%	4 Missing ⚠️
crates/polars-plan/src/plans/visitor/hash.rs	0.00%	2 Missing ⚠️
...rates/polars-python/src/lazyframe/visitor/nodes.rs	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #27263      +/-   ##
==========================================
- Coverage   81.58%   81.43%   -0.15%     
==========================================
  Files        1820     1829       +9     
  Lines      251036   252688    +1652     
  Branches     3149     3146       -3     
==========================================
+ Hits       204808   205785     +977     
- Misses      45420    46104     +684     
+ Partials      808      799       -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dsprenkels

The implementation looks good, However, could you redo the tests a bit?

We already test the streaming engine separately from the in-memory engine, so @pytest.mark.parametrize("streaming", [False, True]) is not needed.
Could you add one parametric test (with hypothesis), that roughly does the following?
1. Get two input dataframes (with both a key column that has matching dtypes; the dtype can just be pl.Int32 or something)
2. Eagerly add row-indexes to each dataframe (you'll need to set different column names)
3. Add a df column to each dataframe where one gets only 0 and the other only 1.
4. Set actual is the maintain_order=True merge_sorted result
5. Set expected to be a result of a concat, and then sort by (key, df).
This is quite contrived, but you get the idea. Something simpler would of course be better. 😉

That test would cover all of the current tests, so you can probably remove those.

PS You can just use collect(). We automatically test both the in-memory engine and the streaming engine in CI.

dsprenkels · 2026-04-16T07:23:21Z

It's okay if the coverage jobs do not pass. We know about the disk space issue. ;)

jonathanchang31 · 2026-04-16T11:13:14Z

@dsprenkels all the checks passed

dsprenkels · 2026-04-16T07:13:21Z

+  # Coverage data comes from instrumentation, not DWARF symbols.
+  # Keep macOS test binaries smaller to avoid linker/disk exhaustion.
+  CARGO_PROFILE_DEV_DEBUG: 0
+  CARGO_PROFILE_TEST_DEBUG: 0


Please fix: Please revert this. This is a separate issue. We are looking into this issue, and we do not require this CI job as a passing requirement for merging.

dsprenkels · 2026-04-16T07:20:07Z

                .into_series()
        },
        #[cfg(feature = "dtype-array")]
-        Array(_, _) => {


Please fix: Can you please revert this, as this a somewhat a separate concern. Or leave the TODO that it should be optimized (row-encoding is very slow).

dsprenkels · 2026-04-16T11:40:26Z

        assert_frame_equal(df_chained_from_right, df_full)
+
+
+@pytest.mark.parametrize("streaming", [False, True])


Can you still address the previous comment I had about the parametric test?

feat: Add parameter to

6983ece

jonathanchang31 requested review from MarcoGorelli, alexander-beedie, c-peters, orlp, ritchie46 and wence- as code owners April 10, 2026 13:07

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars first-contribution First contribution by user labels Apr 10, 2026

jonathanchang31 added 2 commits April 10, 2026 15:14

fix: lint

7b632c8

update DSL schema hashes for maintain_order field

ef3dafe

github-actions bot added the changes-dsl Do not merge if this label is present and red. label Apr 10, 2026

dsprenkels self-requested a review April 13, 2026 07:37

dsprenkels requested changes Apr 13, 2026

View reviewed changes

Comment thread crates/polars-mem-engine/src/planner/lp.rs

Comment thread crates/polars-plan/src/plans/ir/tree_format.rs Outdated

jonathanchang31 added 4 commits April 14, 2026 17:28

fix: include merge_sorted maintain_order in IR tree format

03fb0da

fix: gate merge_sorted nested and categorical paths

59c47af

fix: include merge_sorted maintain_order in plan displays

198c706

ci: reduce coverage debuginfo on macOS

5bc3ca0

dsprenkels requested changes Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `maintain_order` parameter to `merge_sorted`#27263

feat: Add `maintain_order` parameter to `merge_sorted`#27263
jonathanchang31 wants to merge 7 commits intopola-rs:mainfrom
jonathanchang31:feat/merge-sorted-maintain-order

jonathanchang31 commented Apr 10, 2026

Uh oh!

codecov bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

dsprenkels left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

dsprenkels commented Apr 16, 2026 •

edited

Loading

Uh oh!

jonathanchang31 commented Apr 16, 2026

Uh oh!

dsprenkels Apr 16, 2026

Uh oh!

dsprenkels Apr 16, 2026

Uh oh!

dsprenkels Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		assert_frame_equal(df_chained_from_right, df_full)


		@pytest.mark.parametrize("streaming", [False, True])

Conversation

jonathanchang31 commented Apr 10, 2026

Summary

Test plan

Uh oh!

codecov bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dsprenkels left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dsprenkels commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathanchang31 commented Apr 16, 2026

Uh oh!

dsprenkels Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dsprenkels Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dsprenkels Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 10, 2026 •

edited

Loading

dsprenkels left a comment •

edited

Loading

dsprenkels commented Apr 16, 2026 •

edited

Loading

dsprenkels Apr 16, 2026 •

edited

Loading