-
Notifications
You must be signed in to change notification settings - Fork 1
cattle example, LD calculator cleaned up #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Great, thanks Yuxin! Could you please rebase this onto main and fix the merge conflicts? If you haven't done this before, then from the command line git checkout main
git pull upstream main # this is assuming that this repo is called "upstream" in `git remote -v`
git checkout processor_cleanup
git rebase mainand then you'll have to edit the files that git lists, to resolve the <<<<< HEAD
old code
=======
new code
>>>>>> commitinserts added by git. Once that is done, git add path/to/modified/file
git rebase --continueto move to the next commit (or finish the rebase). Let me know if you have questions! |
|
hmm this still has merge conflicts @ningyuxin1999 |
I was only updating some comments yesterday, and now the conflicts should be resolved :) |
nspope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ningyuxin1999 -- it looks like there's a few things to fix that I've flagged here.
Also, it'd be good if you could try running the training pipeline through on this example just to make sure it doesn't error out (the number of sims can be set to something low, like 100, just want to make sure it doesn't error out).
@andrewkern can you take a quick look as well?
workflow/config/cattle_21Gen.yaml
Outdated
| max_time: 130000 | ||
| time_rate: 0.06 | ||
| samples: | ||
| pop: 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the population name should be "pop0" to match the msprime demography, I think?
edit: now I see, you're doing samples={"pop0": self.samples["pop"]} in the msprime invocation below. But, it'd be easier/nicer to just have the correct name here, so that you can do samples=self.samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| packed_sequence: False | ||
|
|
||
| simulator: | ||
| class_name: "VariablePopulationSize" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class name should be "Cattle21Gen", I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this, but in the long run I think I would like to use the same class for the cattle example and the VariablePopulationSize example. In my opinion the dependent population sizes make much more sense as a prior and could be used in any case.
workflow/scripts/ts_simulators.py
Outdated
| geno = ts.genotype_matrix().T | ||
| num_sample = geno.shape[0] | ||
| if (geno==2).any(): | ||
| num_sample *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use num_sample = ts.num_samples here
workflow/scripts/ts_simulators.py
Outdated
| row_sum != 0, | ||
| row_sum != num_sample, | ||
| row_sum > num_sample * 0.05, | ||
| num_sample - row_sum > num_sample * 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to make this clearer to read (and settable), put "maf": 0.05 under the "FIXED PARAMETER" heading in the default_config dict above, and change these lines to:
row_sum > num_sample * self.maf
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
workflow/scripts/ts_simulators.py
Outdated
| row_sum = np.sum(geno, axis=0) | ||
| keep = np.logical_and.reduce([ | ||
| row_sum != 0, | ||
| row_sum != num_sample, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that row_sum != 0 , row_sum != num_sample isn't needed with the MAF filter directly below, so delete these two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I deleted them.
workflow/scripts/ts_simulators.py
Outdated
| for i in range(1, self.num_time_windows): | ||
| new_value = 10 ** self.pop_sizes[0] - 1 | ||
| while new_value > self.pop_sizes[1] or new_value < self.pop_sizes[0]: | ||
| new_value = log_values[i - 1] * 10 ** np.random.uniform(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble getting my head around this -- we're multiplying log10 Ne by U[0.1, 10] -- is this what's intended? e.g. this isn't intended to be natural units Ne multiplied by some scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once I understand what the intended process is here, maybe we could reparameterize in terms of iid uniforms (e.g. without rejection sampling), as this is most definitely not a box uniform any more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm a bit confused here too because the vector named log_values will initially be populated by the BoxUniform draws
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think log_values is confusing here. What this should actually do (and is currently not) is drawing the first population size between 10 and 100000 but uniformly in log10-scale. So if we draw alpha from a BoxUniform between 1 and 5 the first population size should then be 10^alpha. In each next time window the population size is then N_{i+1} = N_i * 10^beta where beta is uniform in [-1,1].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to prevent the posterior distribution from focusing too much on the larger population sizes and makes thus probably sense also for the variablepopulationsizes class. I think the performance for the more realistic population size scenarios should also improve using this prior with dependent population sizes as it prevents completely weird population size changes. Basically if we would draw the population sizes uniformly and independently between 10 and 100000 too many of the training data would contain large ancestral population sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a tangent, but worth discussing ... I think we should try to keep the posterior and prior parameterization consistent. For example (simplifying things for exposition), say I have a model parameterized by the random walk Ne[i] = Ne[i - 1] * U[lower, upper]. Then the values of theta that the simulator returns -- the targets for the NN -- should be the U[lower, upper] values, not the Ne values, because we're assuming a uniform prior downstream.
Though this is a little cumbersome (in that the user has to then manually reparameterize posterior samples, to get what they actually want), it's crucial if we want to try to combine posteriors across chunks of sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think using np.random.uniform here won't respect the random seed, as this has only been set for torch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we could certainly use the U[lower, upper] values instead of the actual population sizes. But we would have to change a small detail in the generation of the population sizes then. Currently, we redraw U[lower, upper] if Ne[i-1] * U[lower, upper] is larger than the maximal population size allowed (or smaller than the minimal). Instead of redrawing we could simply set the population size to the max or min value allowed in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now implemented the new version, where the population sizes will be a bit more sticky at the population size range borders.
workflow/scripts/ts_simulators.py
Outdated
|
|
||
| def __init__(self, config: dict): | ||
| super().__init__(config, self.default_config) | ||
| self.parameters = ["recombination_rate"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some stuff got accidentally deleted in this class, probably during fixing merge conflicts-- can you please set it back to how it was?
workflow/scripts/ts_processors.py
Outdated
| ------- | ||
| DataFrame | ||
| Table with the distance_bins as index, and the mean value of | ||
| Compute LD for a subset of SNPs and return a DataFrame with only the mean r2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's useful to keep some of the details about the parameters that are needed in here
workflow/scripts/ts_processors.py
Outdated
| snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP) for i in range(len(snps) - 1)], | ||
| axis=0) | ||
| snp_pairs = np.unique([((snps[i] + i) % n_SNP, (snps[i + 1] + i) % n_SNP) | ||
| for i in range(len(snps) - 1)], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my eyes, it's way easier to have the list comprehension on a single line, splitting list comprehensions across lines makes things harder to read
workflow/scripts/ts_processors.py
Outdated
| ld = pd.concat([ld, sd]) | ||
|
|
||
| ld2 = ld.dropna().groupby("dist_group", observed=False).agg(agg_bins) | ||
| ld2 = ld.dropna().groupby("dist_group",observed=True).agg(agg_bins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe that setting observed=True here will throw an error, at least it did for me when first reimplementing this routine
|
added some additional comments |
d7a1d96 to
9c98d3d
Compare
|
I started to fix the issues in this branch. Currently, I am no longer including the LD statistics and am just using the sfs. I will look at the LD stuff again later. The new prior is done and seems to work fine. |
|
Hm ... we're batching the sampling during plot diagnostics, so the per batch memory requirement should be pretty tightly constrained (as much as it would be during training). But maybe I need to be transferring results to the CPU at the end of each batch. Any chance you could drop some print statements into I'm not sure what could be happening with the prior -- do you mean that it's stuck in an endless while loop during the simulation? Regardless, this shouldn't be related to GPU usage (simulation is done on the CPU) but is maybe an issue with parallelization on your workstation? Could you try running snakemake with --jobs 1 to see if the issue persists? |
|
i've seen this behavior once before with the posterior sampling being stuck in an infinite loop. it had to do with not properly defining the prior for use with SBI. looking at your code i don't see anything obvious, but i'd go back and compare it to the |
I already tried that and it did not change the outcome. I will look into the other points you mentioned. |
|
Thanks for the hint @andrewkern. Took me a while, but I found the bug. I was passing on the transformed values instead of the prior samples. However, I learned a lot while searching for this and found another unrelated bug. |
Sorry I'm not following -- do you mean, have the stored prior be uniform but the actual prior be something more complex, with the same support? |
|
Yes, that is what I was thinking of. As long as the support is the same this should in principle work, but I do not know how this would affect the predictions. So maybe this is not the route to go? |
|
I don't think it's an issue with what's in there now; as we're using the prior samples (not the log density) for everything currently implemented, so the correct non-uniform prior will get implicitly used. Where it gets a little hairy, conceptually, is when we try to aggregate posteriors across empirical windows, which is what I'm working on now. (Because then we effectively duplicate the prior). But maybe we just accept that this sort of aggregation is not sensible to use with a non-uniform prior. |
|
checking in on this PR-- @fbaumdicker what's the current status here? |
|
We fixed the (hopefully) last bug today. So I am now cleaning up the code and will then rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is the new analysis, we want this file to be named something else so it doesn't get confusing with the other variable-popn-size experiment we have done. maybe 'dependent-popn-size'?
also rather than put this here, this should go in it's own experiment directory
| gpu_resources: | ||
| runtime: "4h" | ||
| mem_mb: 50000 | ||
| gpus: 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20 GPUs?
| mem_mb: 50000 | ||
| gpus: 20 | ||
| slurm_partition: "kerngpu,gpu" | ||
| slurm_extra: "--gres=gpu:20 --constraint=a100" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requesting 20 A100s per job is huge-- i don't believe anyone has access to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops I think that's a weird typo
| "samples": {"pop0": 25}, | ||
| "sequence_length": 2e6, | ||
| "mutation_rate": 1e-8, | ||
| "num_time_windows": 21, | ||
| "maf": 0.05, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert these changes-- you shouldn't have to change the defaults on VariablePopulationSize. instead just set them in your config for your experiment
| "recomb_rate": [1e-9, 2e-8], # Range for recombination rate | ||
| # TIME PARAMETERS | ||
| "max_time": 100000, # Maximum time for population events | ||
| "max_time": 130000, # Maximum time for population events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "max_time": 130000, # Maximum time for population events | |
| "max_time": 100000, # Maximum time for population events |
| samples={"pop0": self.samples["pop"]}, | ||
| samples=self.samples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unclear to me why this is being changed, but this should be correct. again my advice is to revert all changes to this class
|
Hey @ningyuxin1999, thanks for putting up the code. As Andy says, we want to keep things as compartmentalized as possible to avoid impacting the other experiments. I think the easiest thing to do here would be to make a new PR, where you do the following:
Then we'll have the code organized in the same way as for the other parts of the paper, and there shouldn't be edits to classes other than those that are unique to this experiment. The commit history will also be a bit cleaner as there won't be a merge from main into this feature branch. Let me know if you have questions or if you want to talk through the code organization. Thanks!!! |
|
Thanks for your reviews and suggestions! @andrewkern @nspope 🙏 I will open the new PR tomorrow or latest on Wednesday with the dependent prior and the new experiments using that prior. |
|
Sounds good, sorry for the hassle! And just to be totally clear, in my suggestions above "new PR" == "new feature branch" so we start with a clean commit history. |
This is the example of cattle configuration. Each population size is dependent on the previous one instead of generated randomly.