|
24 | 24 | //! |
25 | 25 | //! # When to use this |
26 | 26 | //! |
27 | | -//! Radix sort on row-encoded keys is the fastest sort strategy for most |
28 | | -//! multi-column sorts, including: |
29 | | -//! - **Primitive columns** (integers, floats) |
30 | | -//! - **String columns**, especially multiple string columns |
| 27 | +//! Radix sort is the fastest strategy when sorting by **two or more columns**, |
| 28 | +//! especially as N grows. It benefits from: |
| 29 | +//! - **Multi-column schemas** where comparison sort must traverse columns |
| 30 | +//! per comparison, while radix sort pays a fixed cost per byte position |
| 31 | +//! - **String columns**, where the row encoding produces compact, |
| 32 | +//! high-entropy byte sequences that radix passes discriminate quickly |
31 | 33 | //! - **Mixed column types** (primitives, strings, dicts, lists) |
32 | 34 | //! |
33 | | -//! The advantage over [`lexsort_to_indices`] grows with N and with the |
34 | | -//! number of columns. |
35 | | -//! |
36 | 35 | //! # When NOT to use this |
37 | 36 | //! |
38 | 37 | //! Prefer [`lexsort_to_indices`] when: |
| 38 | +//! - **Sorting by a single column.** The row encoding overhead (allocation, |
| 39 | +//! encoding, indirection through `Rows`) outweighs the radix advantage. |
| 40 | +//! Single-column sorts are faster with direct comparison sort on the |
| 41 | +//! columnar array, which avoids encoding entirely. |
39 | 42 | //! - **All sort columns are low-cardinality dictionaries** with no |
40 | 43 | //! high-cardinality column to break ties. The row encoding for |
41 | 44 | //! dictionary values produces long shared prefixes, and radix sort |
42 | 45 | //! gains little from its first few byte passes before falling back |
43 | 46 | //! to comparison sort. |
| 47 | +//! - **Columns with low-entropy leading bytes**, such as `Decimal128` or |
| 48 | +//! `Decimal256`. These types are encoded as 16- or 32-byte big-endian |
| 49 | +//! integers, but real-world values occupy a tiny fraction of the range. |
| 50 | +//! The leading bytes are nearly identical across rows (e.g., `0x80` for |
| 51 | +//! small positive values), so radix passes burn through the max depth |
| 52 | +//! without discriminating rows, then fall back to comparison sort. |
44 | 53 | //! - **A leading primitive column discriminates most rows and a trailing |
45 | 54 | //! column is expensive to encode** (e.g., lists). [`lexsort_to_indices`] |
46 | 55 | //! avoids encoding the trailing column for rows already resolved by |
|
0 commit comments