Skip to content

Faster algorithm for ordered sampling with replacement#913

Open
ameligrana wants to merge 12 commits intoJuliaStats:masterfrom
ameligrana:patch-1
Open

Faster algorithm for ordered sampling with replacement#913
ameligrana wants to merge 12 commits intoJuliaStats:masterfrom
ameligrana:patch-1

Conversation

@ameligrana
Copy link

@ameligrana ameligrana commented Jan 1, 2024

This is based on a classical result for example described here https://stats.stackexchange.com/questions/348358/a-fast-uniform-order-statistic-generator (and in the reference of the most upvoted answer). I wasn't able to find a reference describing its modification for sampling a finite population, but I adapted it to such a case.

In particular, the performance increase is substantial, more than 5 times faster when the performance increase stabilize, e.g.

This PR:

julia> using StatsBase, BenchmarkTools

julia> a = [1:1000;];

julia> @btime sample($a, 10^2, ordered=true);
  461.274 ns (3 allocations: 2.62 KiB)

julia> @btime sample($a, 10^6, ordered=true);
  3.527 ms (6 allocations: 22.89 MiB)

Main:

julia> using StatsBase, BenchmarkTools

julia> a = [1:1000;];

julia> @btime sample($a, 10^2, ordered=true);
  1.564 μs (2 allocations: 1.75 KiB)

julia> @btime sample($a, 10^6, ordered=true);
  21.273 ms (4 allocations: 15.26 MiB)

the switching point between this algorithm and the one implemented in main is set at k=10 because I found that empirically at that point the timings were almost equal.

Numerically it should be stable enough, but let me know what you think

@ameligrana
Copy link
Author

ameligrana commented Jan 1, 2024

Test that is failing is using the previous algorithm at lines 83-84 of sampling.jl

83    aa = Int.(sample(r, 10; ordered=true))
84    check_sample_wrep(aa, (3, 12), 0; ordered=true, rev=rev)

so I think it is unrelated right? It is maybe due to the difference in the random number consumed by the new algorithm

@ameligrana
Copy link
Author

ameligrana commented Jan 1, 2024

Actually it seems to me that the way tests are written is not very good because no rng is set, indeed trying to run those tests 100 times, sometimes gets to failure anyway (before and after this pr), should we set an rng?

edit: I tried to do it on those tests and it works, but it happens also with other parts of the sampling tests, if I loop over 100 times e.g.

direct_sample!([11:20;], zeros(Int, n, 3))
check_sample_wrep(a, (11, 20), 5.0e-3; ordered=false)

at lines 69-70 tests fails, so I guess establishing a rng for everything should be a good idea (actually maybe some confidence interval testing could be good practice in this case)

@ameligrana
Copy link
Author

ameligrana commented Jan 9, 2024

closing because I want to do same more experimentation with the algorithm before proposing it, you can find them here:https://github.com/Tortar/SortedRands.jl

edit: I conducted some local tests, everything seems good to me, wait to hear the opinion of someone else :-)

@ameligrana ameligrana closed this Jan 9, 2024
@ameligrana ameligrana deleted the patch-1 branch January 9, 2024 15:40
@ameligrana ameligrana restored the patch-1 branch January 14, 2024 01:40
@ameligrana ameligrana reopened this Jan 14, 2024
@ameligrana
Copy link
Author

gentle bump, since #927 was merged, what about taking a look at another speed-up? :D

@ameligrana ameligrana requested a review from devmotion April 15, 2024 10:58
@ameligrana
Copy link
Author

do you have any more review comments @devmotion? :-)

@ameligrana
Copy link
Author

Bump

@ameligrana
Copy link
Author

Bump2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants