Add hsm_mode: Half-Sample Mode for continuous data#984
Add hsm_mode: Half-Sample Mode for continuous data#984anurag-mds wants to merge 12 commits intoJuliaStats:masterfrom
Conversation
…JuliaStats#957) Adds hsm_mode() to provide a density-based alternative to mode() for continuous distributions. Handles NaN/Inf filtering, single-element and empty vectors, and edge cases with outliers. Type-stable and zero-allocation where possible. Includes 17 tests covering standard, extreme, and boundary conditions. Based on Bickel & Fruehwirth (2006)
The hsm_mode function was added in the previous commit but wasn't included in the documentation, causing the CI build to fail with a missing docstring error. This commit adds hsm_mode to the Mode and Modes section of the scalar statistics documentation.
|
Was this PR AI-generated? |
Changed window size calculation from floor(n/2) to ceil(n/2) to match the Robertson-Cryer algorithm specification. The half-sample mode should examine windows containing at least half the observations. Updated test expectations to match the corrected algorithm: - [1,2,3] now returns 1.5 (was 1.0) - [-5,-3,-1,0,1,3,5] now returns -0.5 (was -1.0) - [1.0097, 1.0054, ...] now returns ~1.00431 (was ~1.00003) All tests pass. Type preservation (Float32 -> Float32, Int -> Float64) works correctly.
|
The implementation, test and commits are mine. I reviewed existing StatsBase prs to match project conventions, and I used AI assistance only for wording clarity in the pr description and for general performance review and bugs made by me. it was not used for algorithm or code . since in ci pipeline documentation issues which I totally forgot was missing I am actively fixing that like using ceil(len/2) as per the definition. I am eager to explain any design or choices in detail |
I feel like the existing behavior (for floating-point numbers) is a bug. If your data are not integers, we must assume that they come from a continuous distribution and use the HSM algorithm or another estimator for continuous data. Or perhaps introduce a keyword argument: mode(x::AbstractVector{<:AbstractFloat}; discrete::Bool=false) =
discrete ? mode_discrete(x) : mode_hsm(x)
mode(x::AbstractVector) = mode_discrete(x)Here, I'm not a contributor to Distributions.jl, though (although I'd like to become a contributor; my complete PR with hyperbolic distributions seems to have joined its peers, unfortunately), that's just my opinion. |
|
I think using frequency-based mode for floats can be misleading. But making HSM the default for AbstractFloat might break cases where floats are actually categories or rounded values. Maybe a keyword like discrete=false or a dispatch mode(x, HSM()) works better. I can change the API like that if the maintainers think it’s right. |
|
@ForceBru Here's the evidence for why
As you can see Design:
Users can explicitly choose the right tool. No silent behavior changes. So are you okay with this separate function approach or shall we explore the keyword argument variant further ? |
|
All tests pass, edge cases handled, documentation added. @devmotion are there any remaining technical concerns with hsm_mode() or its API before approval? |
|
Thanks for the PR! I tend to agree that a keyword argument would make sense here. The mode is a statistical concept which can be estimated in various ways. The one currently used by The API could look like |
|
Thanks for your detailed review |
Fix half-sample window size calculation Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
Apologies for the back-and-forth and the CI noise I'm currently away from my main development setup, but I’ve noted all changes precisely and will push a consolidated update shortly once I’m back. I’ll comment again once the updates are in. |
…dress all review feedback and add comprehensive tests
|
@nalimilan Does everything seems good? |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
Thanks @nalimilan for the thorough review. I agree with the remaining points (doc wording, method naming, references, and test adjustments). I’m addressing them now and will push a final cleanup commit so that |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
…dback, and update tests accordingly.
|
@nalimilan I have made necessary changes if anything to fix, modify or add do let me know! |

Summary
This PR adds hsm_mode(), an implementation of the half-sample mode (HSM) which is a robust estimator of the mode for continuous distributions.It is introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour while providing a statistically meaningful alternative for continuous data
This addresses and closes issue #957.
Motivation
StatsBase.mode() is frequency-based and works well for discrete data. For continuous distributions, however, samples are usually unique, which makes frequency counts unstable and highly variable in practiceIssue #957 documents this behaviour, particularly for heavy-tailed distributions, where mode() can show extreme variance.
This PR provides an estimator designed specifically for continuous data
Approach
hsm_mode() implements the standard half-sample method described in the literature:Non-finite values (NaN, Inf) are filtered
The data are sorted
The algorithm repeatedly selects the contiguous half-sample with the smallest width
Once ≤ 2 points remain, the midpoint of the final interval is returned
The midpoint may not be a sample value, but provides a stable estimate of the location of highest density.
Time complexity is dominated by sorting (O(n log n)); space complexity is O(n).
After sorting, the contraction loop operates on SubArray views to avoid allocations.
API Design
The estimator is exposed as a new function:
hsm_mode(x::AbstractVector{T}) where T<:Real
It is NOT added as an overload of mode() in order to:
avoid changing existing semantics
clearly distinguish frequency-based and density-based estimation
let users choose the appropriate method explicitly
The return type is an AbstractFloat, promoted from the input element type (e.g. integers → Float64, Float32 → Float32).
Testing and Documentation
Tests cover basic correctness, edge cases, robustness to outliers, handling of non-finite values, and type behavior. All tests pass.
The docstring explains intended use cases, compares with mode(), documents complexity, provides examples, and cites the relevant literature.
References
Robertson & Cryer (1974), JASA
Bickel & Fruehwirth (2006), CSDA
Notes
This PR is intentionally small and focused. Extensions such as weighted HSM or support for missing values are left for future work.
I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.
Images:






Feedback on naming or API placement is welcome.