-
Notifications
You must be signed in to change notification settings - Fork 1
Spidna #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
one decision I punted on here: what to do about the summary stats embeddings that @ningyuxin1999 did |
|
I just discussed this with @ningyuxin1999 today. We can write the summary stats processor for the summary statistics we used, a combination of the SFS and mean LD in different distance bins. |
|
the big question to me here is do we do mean aggregation of those summaries as you were doing originally, and if so how given our workflow. a few alternatives to consider:
|
nspope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
workflow/scripts/ts_processors.py
Outdated
| # MAF filtering | ||
| if self.maf != 0: | ||
| num_sample = snp.shape[1] # Now using shape[1] since matrix isn't transposed | ||
| if (snp==2).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is detecting ploidy, right? It's a little messy to do this on the fly (e.g. would fail if there were no homozygous derived alleles, which probably won't happen in practice, but still ...)
A better way is to do this directly from the individuals in the ts, e.g.
ploidy = [ind.nodes.size for ind in ts.individuals()]
or if we only need the number of haploids
samples = ts.num_samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this MAF filtering step was part of Yuxin's pipeline that I just copied over. I agree it's a bit messy to do it at this step. let me push a change to use samples from ts.num_samples.
|
okay pushed a change @nspope and it still passes tests |
PR for the SPIDNA network and associated simulator / processor for genotypes. Also includes testing.
Here are representative results from a 3 epoch model where I trained the embedder and normalizing flow separately.
Posterior at prior high:

Posterior at prior low:

Posterior at prior mean:

Calibration (looks great!):

Concentration:

Posterior Expectation:

Overall-- I think this is ready to go! We need to decide on parameterization for a 'production' run still, but this run was done with the parameters specified in
workflow/config/variable_popnSize_spidna.yaml