Faster algorithm for ordered sampling with replacement by ameligrana · Pull Request #913 · JuliaStats/StatsBase.jl

ameligrana · 2024-01-01T21:27:52Z

This is based on a classical result for example described here https://stats.stackexchange.com/questions/348358/a-fast-uniform-order-statistic-generator (and in the reference of the most upvoted answer). I wasn't able to find a reference describing its modification for sampling a finite population, but I adapted it to such a case.

In particular, the performance increase is substantial, more than 5 times faster when the performance increase stabilize, e.g.

This PR:

julia> using StatsBase, BenchmarkTools

julia> a = [1:1000;];

julia> @btime sample($a, 10^2, ordered=true);
  461.274 ns (3 allocations: 2.62 KiB)

julia> @btime sample($a, 10^6, ordered=true);
  3.527 ms (6 allocations: 22.89 MiB)

Main:

julia> using StatsBase, BenchmarkTools

julia> a = [1:1000;];

julia> @btime sample($a, 10^2, ordered=true);
  1.564 μs (2 allocations: 1.75 KiB)

julia> @btime sample($a, 10^6, ordered=true);
  21.273 ms (4 allocations: 15.26 MiB)

the switching point between this algorithm and the one implemented in main is set at k=10 because I found that empirically at that point the timings were almost equal.

Numerically it should be stable enough, but let me know what you think

… stable

ameligrana · 2024-01-01T22:39:19Z

Test that is failing is using the previous algorithm at lines 83-84 of sampling.jl

83    aa = Int.(sample(r, 10; ordered=true))
84    check_sample_wrep(aa, (3, 12), 0; ordered=true, rev=rev)

so I think it is unrelated right? It is maybe due to the difference in the random number consumed by the new algorithm

ameligrana · 2024-01-01T23:33:28Z

Actually it seems to me that the way tests are written is not very good because no rng is set, indeed trying to run those tests 100 times, sometimes gets to failure anyway (before and after this pr), should we set an rng?

edit: I tried to do it on those tests and it works, but it happens also with other parts of the sampling tests, if I loop over 100 times e.g.

direct_sample!([11:20;], zeros(Int, n, 3))
check_sample_wrep(a, (11, 20), 5.0e-3; ordered=false)

at lines 69-70 tests fails, so I guess establishing a rng for everything should be a good idea (actually maybe some confidence interval testing could be good practice in this case)

ameligrana · 2024-01-09T15:15:36Z

closing because I want to do same more experimentation with the algorithm before proposing it, you can find them here:https://github.com/Tortar/SortedRands.jl

edit: I conducted some local tests, everything seems good to me, wait to hear the opinion of someone else :-)

ameligrana · 2024-04-12T23:43:18Z

gentle bump, since #927 was merged, what about taking a look at another speed-up? :D

src/sampling.jl

ameligrana · 2024-04-18T10:05:41Z

do you have any more review comments @devmotion? :-)

ameligrana · 2024-06-18T12:16:01Z

Bump

ameligrana · 2024-07-12T23:51:59Z

Bump2

ameligrana added 4 commits January 1, 2024 22:16

Faster algorithm for ordered sampling with replacement

dba8ba3

Update sampling.jl

c922c90

let's keep only uniform_orderstat_sample! to see if it is numerically…

a7ef972

… stable

previous methodology then

b424eaf

ameligrana added 5 commits January 1, 2024 23:46

Update sampling.jl

3ae19bf

use better test for small ordered sampling

c5d90b3

Update sampling.jl

e032965

Update sampling.jl

397b8b3

Update sampling.jl

1e9c9ec

try stablerng(1)

2a4683c

ameligrana closed this Jan 9, 2024

ameligrana deleted the patch-1 branch January 9, 2024 15:40

ameligrana restored the patch-1 branch January 14, 2024 01:40

ameligrana reopened this Jan 14, 2024

devmotion reviewed Apr 13, 2024

View reviewed changes

src/sampling.jl Outdated Show resolved Hide resolved

ameligrana added 2 commits April 13, 2024 12:06

use cumsum

5367e69

Merge branch 'JuliaStats:master' into patch-1

c0ec10d

ameligrana requested a review from devmotion April 15, 2024 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster algorithm for ordered sampling with replacement#913

Faster algorithm for ordered sampling with replacement#913
ameligrana wants to merge 12 commits intoJuliaStats:masterfrom
ameligrana:patch-1

ameligrana commented Jan 1, 2024 •

edited

Loading

Uh oh!

ameligrana commented Jan 1, 2024 •

edited

Loading

Uh oh!

ameligrana commented Jan 1, 2024 •

edited

Loading

Uh oh!

ameligrana commented Jan 9, 2024 •

edited

Loading

Uh oh!

ameligrana commented Apr 12, 2024

Uh oh!

Uh oh!

ameligrana commented Apr 18, 2024

Uh oh!

ameligrana commented Jun 18, 2024

Uh oh!

ameligrana commented Jul 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ameligrana commented Jan 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ameligrana commented Jan 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ameligrana commented Jan 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ameligrana commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ameligrana commented Apr 12, 2024

Uh oh!

Uh oh!

ameligrana commented Apr 18, 2024

Uh oh!

ameligrana commented Jun 18, 2024

Uh oh!

ameligrana commented Jul 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ameligrana commented Jan 1, 2024 •

edited

Loading

ameligrana commented Jan 1, 2024 •

edited

Loading

ameligrana commented Jan 1, 2024 •

edited

Loading

ameligrana commented Jan 9, 2024 •

edited

Loading