Skip to content

Conversation

@lkirk
Copy link
Owner

@lkirk lkirk commented Sep 15, 2025

Precompute A/B counts for each sample set. We were previously computing them redundantly each for each site pair in our results matrix. The precomputation happens in a function called get_mutation_sample_sets, which takes our list of sets (tsk_bitset_t) for each mutation and intersects the samples with a particular mutation with the sample sets passed in by the user. The result is an expanded list of sets with one set per mutation per sample set. During this operation, we compute the number of samples containing the given allele for each mutation, avoiding the need to perform redundant count operations on the data.

In addition to precomputation, we add a non-normalized version of compute_general_two_site_stat_result for situations where we're computing stats from biallelic loci. We dispatch the computation of the result based on the number of alleles in the two loci we're comparing. If the number of alleles in both loci is 2, then we simply perform an LD computation on the derived alleles for the two loci. As a result, we remove the need to compute a matrix of LD values, then take a weighted sum. This is much more efficient and means that we only run the full multiallelic LD routine on sites that are multiallelic.

@lkirk lkirk force-pushed the two-locus-precompute-counts branch 2 times, most recently from 97db517 to 1b7c523 Compare September 15, 2025 21:11
Precompute A/B counts for each sample set. We were previously computing
them redundantly each for each site pair in our results matrix. The
precomputation happens in a function called `get_mutation_sample_sets`,
which takes our list of sets (`tsk_bitset_t`) for each mutation and
intersects the samples with a particular mutation with the sample sets
passed in by the user. The result is an expanded list of sets with one
set per mutation per sample set. During this operation, we compute the
number of samples containing the given allele for each mutation,
avoiding the need to perform redundant count operations on the data.

In addition to precomputation, we add a non-normalized version of
`compute_general_two_site_stat_result` for situations where we're
computing stats from biallelic loci. We dispatch the computation of the
result based on the number of alleles in the two loci we're comparing.
If the number of alleles in both loci is 2, then we simply perform an LD
computation on the derived alleles for the two loci. As a result, we
remove the need to compute a matrix of LD values, then take a weighted
sum. This is much more efficient and means that we only run the full
multiallelic LD routine on sites that are multiallelic.
@lkirk lkirk force-pushed the two-locus-precompute-counts branch from 1b7c523 to eb56840 Compare September 15, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants