Skip to content

Conversation

@svteichman
Copy link
Collaborator

@svteichman svteichman commented Dec 19, 2025

This PR updates the discrete null fitting algorithm implemented in PR #175, integrates it into the code base, and tests it against fit_null_symmetric(). By default, this discrete null fitting algorithm is used automatically for discrete designs when $J < 150$, and otherwise the symmetric algorithm is used (the argument null_fit_alg can be used to override these defaults). This $J = 150$ is heuristic, determined by comparing runtimes and log likelihoods between these approaches across a range of $n$, $J$, and $p$ in two real datasets. While the discrete approach typically achieves a higher log likelihood (often not by very much, but occasionally by a lot), it becomes slow than the symmetric approach between $J = 100$ and $J = 200$, and quite a bit slower for $J > 500$.

This is a subset of runtime results from tests that are skipped or commented out in "test-null_fit_discrete":

# the next set of tests compare the timing of fit_null_discrete to fit_null_symmetric
# for a variety of n, J, and p, using the soil dataset included in `corncob` and the
# wirbel dataset included in `radEmu`. Each example runs either 10, 20, or 30 robust 
# score tests and compares across the two methods.
# Different sized datasets are generated by filtering samples, considering taxa at either the 
# species or genus level, and in some cases subsetting to one phylum or another 

# tldr:

# wirbel
# n = 126, J = 128, p = 2, sandwich 42 seconds, discrete 16 seconds
# n = 566, J = 133, p = 5, sandwich 306 seconds, discrete 334 seconds
# n = 126, J = 430, p = 2, sandwich 143 seconds, discrete 421 seconds
# n = 126, J = 758, p = 2, sandwich 8 minutes, discrete 60 minutes

# soil
# n = 119, J = 109, p = 3, sandwich 35 seconds, discrete 12 seconds
# n = 119, J = 147, p = 3, sandwich 50 seconds, discrete 49 seconds
# n = 64, J = 234, p = 2, sandwich 121 seconds, discrete 165 seconds
# n = 64, J = 242, p = 2, sandwich 101 seconds, discrete 95 seconds
# n = 64, J = 479, p = 2, sandwich 140 seconds, discrete 534 seconds

One additional note - I also experimented with increasing the discrete root mean score norm tolerance, but this did not decrease runtime very much, especially for large $J$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants