Add hsm_mode: Half-Sample Mode for continuous data by anurag-mds · Pull Request #984 · JuliaStats/StatsBase.jl

anurag-mds · 2025-12-27T09:12:39Z

Summary

This PR adds hsm_mode(), an implementation of the half-sample mode (HSM) which is a robust estimator of the mode for continuous distributions.

It is introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour while providing a statistically meaningful alternative for continuous data

This addresses and closes issue #957.

Motivation

StatsBase.mode() is frequency-based and works well for discrete data. For continuous distributions, however, samples are usually unique, which makes frequency counts unstable and highly variable in practice

Issue #957 documents this behaviour, particularly for heavy-tailed distributions, where mode() can show extreme variance.

This PR provides an estimator designed specifically for continuous data

Approach

hsm_mode() implements the standard half-sample method described in the literature:

Non-finite values (NaN, Inf) are filtered
The data are sorted
The algorithm repeatedly selects the contiguous half-sample with the smallest width
Once ≤ 2 points remain, the midpoint of the final interval is returned
The midpoint may not be a sample value, but provides a stable estimate of the location of highest density.
Time complexity is dominated by sorting (O(n log n)); space complexity is O(n).
After sorting, the contraction loop operates on SubArray views to avoid allocations.

API Design

The estimator is exposed as a new function:

hsm_mode(x::AbstractVector{T}) where T<:Real

It is NOT added as an overload of mode() in order to:

avoid changing existing semantics
clearly distinguish frequency-based and density-based estimation
let users choose the appropriate method explicitly

The return type is an AbstractFloat, promoted from the input element type (e.g. integers → Float64, Float32 → Float32).

Testing and Documentation

Tests cover basic correctness, edge cases, robustness to outliers, handling of non-finite values, and type behavior. All tests pass.

The docstring explains intended use cases, compares with mode(), documents complexity, provides examples, and cites the relevant literature.

References

Robertson & Cryer (1974), JASA

Bickel & Fruehwirth (2006), CSDA

Notes

This PR is intentionally small and focused. Extensions such as weighted HSM or support for missing values are left for future work.

I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.

Images:

Feedback on naming or API placement is welcome.

…JuliaStats#957) Adds hsm_mode() to provide a density-based alternative to mode() for continuous distributions. Handles NaN/Inf filtering, single-element and empty vectors, and edge cases with outliers. Type-stable and zero-allocation where possible. Includes 17 tests covering standard, extreme, and boundary conditions. Based on Bickel & Fruehwirth (2006)

… made by me.

The hsm_mode function was added in the previous commit but wasn't included in the documentation, causing the CI build to fail with a missing docstring error. This commit adds hsm_mode to the Mode and Modes section of the scalar statistics documentation.

devmotion · 2025-12-27T11:01:55Z

Was this PR AI-generated?

Changed window size calculation from floor(n/2) to ceil(n/2) to match the Robertson-Cryer algorithm specification. The half-sample mode should examine windows containing at least half the observations. Updated test expectations to match the corrected algorithm: - [1,2,3] now returns 1.5 (was 1.0) - [-5,-3,-1,0,1,3,5] now returns -0.5 (was -1.0) - [1.0097, 1.0054, ...] now returns ~1.00431 (was ~1.00003) All tests pass. Type preservation (Float32 -> Float32, Int -> Float64) works correctly.

anurag-mds · 2025-12-27T12:10:38Z

The implementation, test and commits are mine. I reviewed existing StatsBase prs to match project conventions, and I used AI assistance only for wording clarity in the pr description and for general performance review and bugs made by me. it was not used for algorithm or code . since in ci pipeline documentation issues which I totally forgot was missing I am actively fixing that like using ceil(len/2) as per the definition. I am eager to explain any design or choices in detail

ForceBru · 2025-12-28T00:06:10Z

introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour

I feel like the existing behavior (for floating-point numbers) is a bug. If your data are not integers, we must assume that they come from a continuous distribution and use the HSM algorithm or another estimator for continuous data.

Or perhaps introduce a keyword argument:

mode(x::AbstractVector{<:AbstractFloat}; discrete::Bool=false) =
  discrete ? mode_discrete(x) : mode_hsm(x)

mode(x::AbstractVector) = mode_discrete(x)

Here, mode_discrete is what's currently called mode. The keyword argument lets one use the high-variance estimator for AbstractFloat.

I'm not a contributor to Distributions.jl, though (although I'd like to become a contributor; my complete PR with hyperbolic distributions seems to have joined its peers, unfortunately), that's just my opinion.

anurag-mds · 2025-12-28T06:08:38Z

I think using frequency-based mode for floats can be misleading.

But making HSM the default for AbstractFloat might break cases where floats are actually categories or rounded values.

Maybe a keyword like discrete=false or a dispatch mode(x, HSM()) works better. I can change the API like that if the maintainers think it’s right.

anurag-mds · 2025-12-28T07:14:36Z

@ForceBru
You have made an excellent point about API design consistency. You're right
that distinguishing frequency-based vs density-based estimation matters.

Here's the evidence for why hsm_mode() should remain separate:

As you can see
Insight: Frequency-based mode() counts collisions on continuous
data where samples are usually unique. The "most frequent" element becomes
arbitrary and order-dependent. HSM finds the actual density peak instead.

Design:

mode(): Frequency-based, correct for discrete data [keep as-is]
hsm_mode(): Density-based, essential for continuous data [separate]

Users can explicitly choose the right tool. No silent behavior changes.

So are you okay with this separate function approach or shall we explore the keyword argument variant further ?

anurag-mds · 2025-12-29T20:58:20Z

All tests pass, edge cases handled, documentation added. @devmotion are there any remaining technical concerns with hsm_mode() or its API before approval?

nalimilan · 2026-01-05T21:12:16Z

Thanks for the PR!

I tend to agree that a keyword argument would make sense here. The mode is a statistical concept which can be estimated in various ways. The one currently used by mode is the simplest, and of course it's bad for continuous samples unless you apply some binning. But mode(Normal()) returns zero already so the concept makes sense in general.

The API could look like mode(x, method=:halfsample).

src/scalarstats.jl

test/scalarstats.jl

anurag-mds · 2026-01-06T08:53:05Z

Thanks for your detailed review
This is very helpful. I'll address the points you raised API via keyword, iterator support, middle , non-finite handling, test adjustments and fix the CI failures.

Fix half-sample window size calculation Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

anurag-mds · 2026-01-06T23:54:28Z

Apologies for the back-and-forth and the CI noise I'm currently away from my main development setup, but I’ve noted all changes precisely and will push a consolidated update shortly once I’m back. I’ll comment again once the updates are in.
Thanks again for the detailed guidance.

…dress all review feedback and add comprehensive tests

anurag-mds · 2026-01-25T16:24:19Z

@nalimilan Does everything seems good?

nalimilan

Thanks!

src/scalarstats.jl

test/scalarstats.jl

src/scalarstats.jl

test/scalarstats.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

anurag-mds · 2026-02-03T17:28:01Z

Thanks @nalimilan for the thorough review. I agree with the remaining points (doc wording, method naming, references, and test adjustments). I’m addressing them now and will push a final cleanup commit so that
everything is consolidated.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

…dback, and update tests accordingly.

anurag-mds · 2026-02-03T18:20:05Z

@nalimilan I have made necessary changes if anything to fix, modify or add do let me know!
Thanks a lot for the guidance

= added 3 commits December 27, 2025 11:35

Fixed some discrepancies by fixing typos and removing temporary files…

3de6546

… made by me.

Add hsm_mode to documentation

727c530

The hsm_mode function was added in the previous commit but wasn't included in the documentation, causing the CI build to fail with a missing docstring error. This commit adds hsm_mode to the Mode and Modes section of the scalar statistics documentation.

anurag-mds closed this Dec 27, 2025

anurag-mds reopened this Dec 27, 2025

Merge branch 'master' into hsm-mode

e086cc3

nalimilan reviewed Jan 5, 2026

View reviewed changes

Update src/scalarstats.jl

393acd5

Fix half-sample window size calculation Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

anurag-mds added 2 commits January 25, 2026 21:10

Implement half-sample mode (HSM) with method=:halfsample for mode; ad…

98e39db

…dress all review feedback and add comprehensive tests

Merge branch 'master' into hsm-mode

9e8cf98

anurag-mds marked this pull request as draft January 25, 2026 15:49

Fix documentation issues and remove obsolete file

5d9a333

anurag-mds marked this pull request as ready for review January 25, 2026 16:21

nalimilan reviewed Jan 31, 2026

View reviewed changes

Update test/scalarstats.jl

068e968

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

anurag-mds and others added 2 commits February 3, 2026 23:16

Update src/scalarstats.jl

502c458

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Implement half-sample mode (HSM) for mode(), address all reviewer fee…

72d867c

…dback, and update tests accordingly.

anurag-mds marked this pull request as draft February 3, 2026 18:11

anurag-mds marked this pull request as ready for review February 3, 2026 18:19

Conversation

anurag-mds commented Dec 27, 2025

Summary

Motivation

Approach

API Design

Testing and Documentation

References

Notes

I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.

Uh oh!

devmotion commented Dec 27, 2025

Uh oh!

anurag-mds commented Dec 27, 2025

Uh oh!

ForceBru commented Dec 28, 2025

Uh oh!

anurag-mds commented Dec 28, 2025

Uh oh!

anurag-mds commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anurag-mds commented Dec 29, 2025

Uh oh!

nalimilan commented Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anurag-mds commented Jan 6, 2026

Uh oh!

anurag-mds commented Jan 6, 2026

Uh oh!

anurag-mds commented Jan 25, 2026

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anurag-mds commented Feb 3, 2026

Uh oh!

anurag-mds commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anurag-mds commented Dec 28, 2025 •

edited

Loading