Skip to content

Commit d8e323c

Browse files
committed
Update docs based on DataFusion TPC-H observations.
1 parent c9792cc commit d8e323c

File tree

1 file changed

+16
-7
lines changed

1 file changed

+16
-7
lines changed

arrow-row/src/radix.rs

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,23 +24,32 @@
2424
//!
2525
//! # When to use this
2626
//!
27-
//! Radix sort on row-encoded keys is the fastest sort strategy for most
28-
//! multi-column sorts, including:
29-
//! - **Primitive columns** (integers, floats)
30-
//! - **String columns**, especially multiple string columns
27+
//! Radix sort is the fastest strategy when sorting by **two or more columns**,
28+
//! especially as N grows. It benefits from:
29+
//! - **Multi-column schemas** where comparison sort must traverse columns
30+
//! per comparison, while radix sort pays a fixed cost per byte position
31+
//! - **String columns**, where the row encoding produces compact,
32+
//! high-entropy byte sequences that radix passes discriminate quickly
3133
//! - **Mixed column types** (primitives, strings, dicts, lists)
3234
//!
33-
//! The advantage over [`lexsort_to_indices`] grows with N and with the
34-
//! number of columns.
35-
//!
3635
//! # When NOT to use this
3736
//!
3837
//! Prefer [`lexsort_to_indices`] when:
38+
//! - **Sorting by a single column.** The row encoding overhead (allocation,
39+
//! encoding, indirection through `Rows`) outweighs the radix advantage.
40+
//! Single-column sorts are faster with direct comparison sort on the
41+
//! columnar array, which avoids encoding entirely.
3942
//! - **All sort columns are low-cardinality dictionaries** with no
4043
//! high-cardinality column to break ties. The row encoding for
4144
//! dictionary values produces long shared prefixes, and radix sort
4245
//! gains little from its first few byte passes before falling back
4346
//! to comparison sort.
47+
//! - **Columns with low-entropy leading bytes**, such as `Decimal128` or
48+
//! `Decimal256`. These types are encoded as 16- or 32-byte big-endian
49+
//! integers, but real-world values occupy a tiny fraction of the range.
50+
//! The leading bytes are nearly identical across rows (e.g., `0x80` for
51+
//! small positive values), so radix passes burn through the max depth
52+
//! without discriminating rows, then fall back to comparison sort.
4453
//! - **A leading primitive column discriminates most rows and a trailing
4554
//! column is expensive to encode** (e.g., lists). [`lexsort_to_indices`]
4655
//! avoids encoding the trailing column for rows already resolved by

0 commit comments

Comments
 (0)