cattle example, LD calculator cleaned up #51

ningyuxin1999 · 2025-03-01T12:41:21Z

This is the example of cattle configuration. Each population size is dependent on the previous one instead of generated randomly.

nspope · 2025-03-03T17:40:19Z

Great, thanks Yuxin! Could you please rebase this onto main and fix the merge conflicts?

If you haven't done this before, then from the command line

git checkout main
git pull upstream main  # this is assuming that this repo is called "upstream" in `git remote -v`
git checkout processor_cleanup
git rebase main

and then you'll have to edit the files that git lists, to resolve the

<<<<< HEAD
old code
=======
new code
>>>>>> commit

inserts added by git. Once that is done,

git add path/to/modified/file
git rebase --continue

to move to the next commit (or finish the rebase).

Let me know if you have questions!

andrewkern · 2025-03-05T22:56:25Z

hmm this still has merge conflicts @ningyuxin1999

ningyuxin1999 · 2025-03-06T08:30:07Z

hmm this still has merge conflicts @ningyuxin1999

I was only updating some comments yesterday, and now the conflicts should be resolved :)

nspope

Thanks @ningyuxin1999 -- it looks like there's a few things to fix that I've flagged here.

Also, it'd be good if you could try running the training pipeline through on this example just to make sure it doesn't error out (the number of sims can be set to something low, like 100, just want to make sure it doesn't error out).

@andrewkern can you take a quick look as well?

nspope · 2025-03-06T17:37:01Z

workflow/config/cattle_21Gen.yaml

+  max_time: 130000
+  time_rate: 0.06
+  samples:
+    pop: 25


the population name should be "pop0" to match the msprime demography, I think?

edit: now I see, you're doing samples={"pop0": self.samples["pop"]} in the msprime invocation below. But, it'd be easier/nicer to just have the correct name here, so that you can do samples=self.samples

nspope · 2025-03-06T17:38:43Z

workflow/config/cattle_21Gen.yaml

+packed_sequence: False
+
+simulator:
+  class_name: "VariablePopulationSize"


The class name should be "Cattle21Gen", I think

yep, agreed

I changed this, but in the long run I think I would like to use the same class for the cattle example and the VariablePopulationSize example. In my opinion the dependent population sizes make much more sense as a prior and could be used in any case.

workflow/scripts/ts_simulators.py

nspope · 2025-03-06T17:43:28Z

workflow/scripts/ts_simulators.py

+            geno = ts.genotype_matrix().T
+            num_sample = geno.shape[0]
+            if (geno==2).any():
+                num_sample *= 2


you can just use num_sample = ts.num_samples here

nspope · 2025-03-06T17:45:58Z

workflow/scripts/ts_simulators.py

+                row_sum != 0,
+                row_sum != num_sample,
+                row_sum > num_sample * 0.05,
+                num_sample - row_sum > num_sample * 0.05


just to make this clearer to read (and settable), put "maf": 0.05 under the "FIXED PARAMETER" heading in the default_config dict above, and change these lines to:

row_sum > num_sample * self.maf

etc

nspope · 2025-03-06T17:46:32Z

workflow/scripts/ts_simulators.py

+            row_sum = np.sum(geno, axis=0)
+            keep = np.logical_and.reduce([
+                row_sum != 0,
+                row_sum != num_sample,


I think that row_sum != 0 , row_sum != num_sample isn't needed with the MAF filter directly below, so delete these two lines

Yes, I deleted them.

nspope · 2025-03-06T17:52:21Z

workflow/scripts/ts_simulators.py

+        for i in range(1, self.num_time_windows):
+            new_value = 10 ** self.pop_sizes[0] - 1
+            while new_value > self.pop_sizes[1] or new_value < self.pop_sizes[0]:
+                new_value = log_values[i - 1] * 10 ** np.random.uniform(-1, 1)


I'm having trouble getting my head around this -- we're multiplying log10 Ne by U[0.1, 10] -- is this what's intended? e.g. this isn't intended to be natural units Ne multiplied by some scalar?

once I understand what the intended process is here, maybe we could reparameterize in terms of iid uniforms (e.g. without rejection sampling), as this is most definitely not a box uniform any more

i'm a bit confused here too because the vector named log_values will initially be populated by the BoxUniform draws

I think log_values is confusing here. What this should actually do (and is currently not) is drawing the first population size between 10 and 100000 but uniformly in log10-scale. So if we draw alpha from a BoxUniform between 1 and 5 the first population size should then be 10^alpha. In each next time window the population size is then N_{i+1} = N_i * 10^beta where beta is uniform in [-1,1].

This is to prevent the posterior distribution from focusing too much on the larger population sizes and makes thus probably sense also for the variablepopulationsizes class. I think the performance for the more realistic population size scenarios should also improve using this prior with dependent population sizes as it prevents completely weird population size changes. Basically if we would draw the population sizes uniformly and independently between 10 and 100000 too many of the training data would contain large ancestral population sizes.

This is a bit of a tangent, but worth discussing ... I think we should try to keep the posterior and prior parameterization consistent. For example (simplifying things for exposition), say I have a model parameterized by the random walk Ne[i] = Ne[i - 1] * U[lower, upper]. Then the values of theta that the simulator returns -- the targets for the NN -- should be the U[lower, upper] values, not the Ne values, because we're assuming a uniform prior downstream.

Though this is a little cumbersome (in that the user has to then manually reparameterize posterior samples, to get what they actually want), it's crucial if we want to try to combine posteriors across chunks of sequence

Also I think using np.random.uniform here won't respect the random seed, as this has only been set for torch

Ok, we could certainly use the U[lower, upper] values instead of the actual population sizes. But we would have to change a small detail in the generation of the population sizes then. Currently, we redraw U[lower, upper] if Ne[i-1] * U[lower, upper] is larger than the maximal population size allowed (or smaller than the minimal). Instead of redrawing we could simply set the population size to the max or min value allowed in this case.

I have now implemented the new version, where the population sizes will be a bit more sticky at the population size range borders.

workflow/scripts/ts_simulators.py

nspope · 2025-03-06T17:57:38Z

workflow/scripts/ts_simulators.py

-
-    def __init__(self, config: dict):
-        super().__init__(config, self.default_config)
        self.parameters = ["recombination_rate"]


some stuff got accidentally deleted in this class, probably during fixing merge conflicts-- can you please set it back to how it was?

andrewkern · 2025-03-06T18:34:31Z

workflow/scripts/ts_processors.py

-        -------
-        DataFrame
-            Table with the distance_bins as index, and the mean value of
+        Compute LD for a subset of SNPs and return a DataFrame with only the mean r2.


i think it's useful to keep some of the details about the parameters that are needed in here

andrewkern · 2025-03-06T18:35:39Z

workflow/scripts/ts_processors.py

-            snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP) for i in range(len(snps) - 1)],
-                                  axis=0)
+            snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP)
+                                for i in range(len(snps) - 1)], axis=0)


for my eyes, it's way easier to have the list comprehension on a single line, splitting list comprehensions across lines makes things harder to read

andrewkern · 2025-03-06T18:36:59Z

workflow/scripts/ts_processors.py

            ld = pd.concat([ld, sd])

-        ld2 = ld.dropna().groupby("dist_group", observed=False).agg(agg_bins)
+        ld2 = ld.dropna().groupby("dist_group",observed=True).agg(agg_bins)


i believe that setting observed=True here will throw an error, at least it did for me when first reimplementing this routine

andrewkern · 2025-03-06T18:39:52Z

added some additional comments

fbaumdicker · 2025-03-17T22:12:08Z

I started to fix the issues in this branch. Currently, I am no longer including the LD statistics and am just using the sfs. I will look at the LD stuff again later. The new prior is done and seems to work fine.
However, when using the cattle config I currently get stuck in the plot_diagnostics job when drawing the posterior samples.
snakemake --configfile workflow/config/cattle_21Gen.yaml --snakefile workflow/training_workflow.smk
On my PC I run out of GPU memory, on the workstation, drawing from the prior does not progress. Any idea what could cause this @nspope ? I want to fix this first before continuing with the LD calculation part.

nspope · 2025-03-17T22:22:42Z

Hm ... we're batching the sampling during plot diagnostics, so the per batch memory requirement should be pretty tightly constrained (as much as it would be during training). But maybe I need to be transferring results to the CPU at the end of each batch. Any chance you could drop some print statements into workflow/scripts/plot_diagnostics.py to try to figure out if the OOM is happening within Lightning's prediction loop (the predict_step in the LightningModule in that script), or afterwards (after trainer.predict is called)?

I'm not sure what could be happening with the prior -- do you mean that it's stuck in an endless while loop during the simulation? Regardless, this shouldn't be related to GPU usage (simulation is done on the CPU) but is maybe an issue with parallelization on your workstation? Could you try running snakemake with --jobs 1 to see if the issue persists?

andrewkern · 2025-03-18T04:29:17Z

i've seen this behavior once before with the posterior sampling being stuck in an infinite loop. it had to do with not properly defining the prior for use with SBI. looking at your code i don't see anything obvious, but i'd go back and compare it to the VariablePopnSize simulator here (which works) to see if there is something off

fbaumdicker · 2025-03-18T07:26:06Z

Could you try running snakemake with --jobs 1 to see if the issue persists?

I already tried that and it did not change the outcome. I will look into the other points you mentioned.

fbaumdicker · 2025-03-18T13:38:19Z

Thanks for the hint @andrewkern. Took me a while, but I found the bug. I was passing on the transformed values instead of the prior samples. However, I learned a lot while searching for this and found another unrelated bug.
I am also no longer convinced of the current parametrization of the dependent population sizes using the population size changes instead of the actual population sizes. Mainly due to the harder-to-calculate resulting posterior distribution for the actual population sizes. I will think a bit more about this.
Maybe we can define a custom prior that uses dependent sampling but provides the logprob as if it were uniformly sampled?

nspope · 2025-03-18T16:56:02Z

Maybe we can define a custom prior that uses dependent sampling but provides the logprob as if it were uniformly sampled?

Sorry I'm not following -- do you mean, have the stored prior be uniform but the actual prior be something more complex, with the same support?

fbaumdicker · 2025-03-18T22:47:20Z

Yes, that is what I was thinking of. As long as the support is the same this should in principle work, but I do not know how this would affect the predictions. So maybe this is not the route to go?
Actually, it wouldn't be too hard to compute the correct logprob for the scenario I have in mind. However, it would be a custom prior and this is currently not included in the snakemake workflows. So if this would not be too hard to include, this would be my preferred path. We have defined custom priors in the past, so this should not be a big issue as long as we can import the prior to all subsequent steps when necessary. For example the plot_diagnostics script currently assumes BoxUniform as a prior.

nspope · 2025-03-18T23:43:53Z

I don't think it's an issue with what's in there now; as we're using the prior samples (not the log density) for everything currently implemented, so the correct non-uniform prior will get implicitly used. Where it gets a little hairy, conceptually, is when we try to aggregate posteriors across empirical windows, which is what I'm working on now. (Because then we effectively duplicate the prior). But maybe we just accept that this sort of aggregation is not sensible to use with a non-uniform prior.

andrewkern · 2025-04-07T18:40:13Z

checking in on this PR-- @fbaumdicker what's the current status here?

fbaumdicker · 2025-04-10T15:23:51Z

We fixed the (hopefully) last bug today. So I am now cleaning up the code and will then rebase.
We now have a simulator example with a working prior that is not a BoxUniform and where the parameters sampled from the prior depend on each other. We are currently running a comparison with your previous version with a BoxUniform prior with independent parameters on some example datasets. This should give a nice example for the usage of more complex priors. But this PR should be independent of this analysis and be ready to merge soon.

andrewkern · 2025-10-31T13:33:39Z

workflow/config/variablepopsize.yaml

if this is the new analysis, we want this file to be named something else so it doesn't get confusing with the other variable-popn-size experiment we have done. maybe 'dependent-popn-size'?

also rather than put this here, this should go in it's own experiment directory

andrewkern · 2025-10-31T13:34:10Z

workflow/config/variablepopsize.yaml

+gpu_resources: 
+  runtime: "4h"
+  mem_mb: 50000
+  gpus: 20


andrewkern · 2025-10-31T13:35:22Z

workflow/config/variablepopsize.yaml

+  mem_mb: 50000
+  gpus: 20
+  slurm_partition: "kerngpu,gpu"
+  slurm_extra: "--gres=gpu:20 --constraint=a100"


requesting 20 A100s per job is huge-- i don't believe anyone has access to that

oops I think that's a weird typo

andrewkern · 2025-10-31T13:39:41Z

workflow/scripts/ts_simulators.py

+        "samples": {"pop0": 25},
+        "sequence_length": 2e6,
+        "mutation_rate": 1e-8,
+        "num_time_windows": 21,
+        "maf": 0.05,


please revert these changes-- you shouldn't have to change the defaults on VariablePopulationSize. instead just set them in your config for your experiment

andrewkern · 2025-10-31T13:40:00Z

workflow/scripts/ts_simulators.py

        "recomb_rate": [1e-9, 2e-8],  # Range for recombination rate
        # TIME PARAMETERS
-        "max_time": 100000,  # Maximum time for population events
+        "max_time": 130000,  # Maximum time for population events


Suggested change

"max_time": 130000, # Maximum time for population events

"max_time": 100000, # Maximum time for population events

andrewkern · 2025-10-31T13:42:09Z

workflow/scripts/ts_simulators.py

-                samples={"pop0": self.samples["pop"]},
+                samples=self.samples,


unclear to me why this is being changed, but this should be correct. again my advice is to revert all changes to this class

nspope · 2025-10-31T21:12:08Z

Hey @ningyuxin1999, thanks for putting up the code. As Andy says, we want to keep things as compartmentalized as possible to avoid impacting the other experiments. I think the easiest thing to do here would be to make a new PR, where you do the following:

In the new PR, make a new directory experiments/variablepopsize-dependent-prior with a subdirectory npe-config
Copy the altered config/variablepopsize.yaml that you've made in this PR to experiments/variablepopsize-dependent-prior/npe-config/variablepopsize.yaml in the new PR. (Note that Andy pointed to some odd resource usage, like requesting 20 GPUs, this is probably a typo)
Copy over the new DependentVariablePopulationSize class that you've added in this PR to the same location (ts_simulators.py) in the new PR. Please don't edit the other VariablePopulationSize class as this is what Andy was using for his experiments.
Copy over the changes you've made to tskit_windowed_sfs_plus_ld class in this PR into the same location in the new PR (ts_processors.py).
Add a shell script to experiments/variablepopsize-dependent-prior/run-experiment.sh in the new PR that gives the snakemake command to train the model (e.g. using the config in experiments/variablepopsize-dependent-prior/npe-config)
Put any code to generate figs into python scripts in experiments/variablepopsize-dependent-prior in the new PR

Then we'll have the code organized in the same way as for the other parts of the paper, and there shouldn't be edits to classes other than those that are unique to this experiment. The commit history will also be a bit cleaner as there won't be a merge from main into this feature branch. Let me know if you have questions or if you want to talk through the code organization. Thanks!!!

ningyuxin1999 · 2025-11-03T21:49:53Z

Thanks for your reviews and suggestions! @andrewkern @nspope 🙏 I will open the new PR tomorrow or latest on Wednesday with the dependent prior and the new experiments using that prior.

nspope · 2025-11-03T22:43:41Z

Sounds good, sorry for the hassle! And just to be totally clear, in my suggestions above "new PR" == "new feature branch" so we start with a clean commit history.

ningyuxin1999 requested review from andrewkern and nspope March 6, 2025 08:30

nspope requested changes Mar 6, 2025

View reviewed changes

andrewkern reviewed Mar 6, 2025

View reviewed changes

Yuxin Ning and others added 2 commits March 12, 2025 18:23

cattle example, LD calculator cleaned up

c90d793

comment change

9c98d3d

ningyuxin1999 force-pushed the processor_cleanup branch from d7a1d96 to 9c98d3d Compare March 13, 2025 08:33

simplified dependent population sizes prior

941482b

fix simulator with dependent population sizes

8a9958b

fix LD calculation

2f8c063

reture the prior samples in log

fed1997

ningyuxin1999 added 2 commits April 10, 2025 12:04

set the prior range as 1 to 5

e302d8b

use identical param setting

08fcef2

parameters

2ee9d80

ningyuxin1999 and others added 6 commits April 14, 2025 20:29

cpu

8166139

fix

6a175c8

set identical param after failed tries

1b827ed

cleanup and renaming

c865333

Merge branch 'main' into processor_cleanup

9065c9a

add description to ld calculator

850e4f1

ningyuxin1999 requested a review from nspope October 31, 2025 12:13

andrewkern requested changes Oct 31, 2025

View reviewed changes

ningyuxin1999 closed this Nov 3, 2025

	"max_time": 130000, # Maximum time for population events
	"max_time": 100000, # Maximum time for population events

cattle example, LD calculator cleaned up #51

cattle example, LD calculator cleaned up #51

Uh oh!

Conversation

ningyuxin1999 commented Mar 1, 2025

Uh oh!

nspope commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewkern commented Mar 5, 2025

Uh oh!

ningyuxin1999 commented Mar 6, 2025

Uh oh!

nspope left a comment

Choose a reason for hiding this comment

Uh oh!

nspope Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nspope Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewkern commented Mar 6, 2025

Uh oh!

fbaumdicker commented Mar 17, 2025

Uh oh!

nspope commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewkern commented Mar 18, 2025

Uh oh!

fbaumdicker commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nspope commented Mar 3, 2025 •

edited

Loading

nspope Mar 6, 2025 •

edited

Loading

nspope Mar 6, 2025 •

edited

Loading

nspope commented Mar 17, 2025 •

edited

Loading

fbaumdicker commented Mar 18, 2025 •

edited

Loading

nspope commented Mar 18, 2025 •

edited

Loading

nspope commented Oct 31, 2025 •

edited

Loading