reference: Implement --cluster option

Some existing CNV callers construct a pooled reference from a subset of the available control samples for each test sample processed, selected based on similar read depth profiles: [CANOES](https://academic.oup.com/nar/article/42/12/e97/1097604), [ExomeDepth](https://academic.oup.com/bioinformatics/article/28/21/2747/236565), [CLAMMS](https://academic.oup.com/bioinformatics/article/32/1/133/1743911).

There are some practical problems with the approach of selecting control samples on the fly:
- The number of control samples to use in the subset is either arbitrary, or intensive to compute optimally by simulation.
- The read depths of all controls samples and all bins must be retained and examined -- with a large initial pool, this could cause the `cnv_reference.cnn` file size and in-memory size to bloat.
- For validation purposes, given limited resources, not all possible combinations of control samples can be fully tested since there are a combinatorial number of possible references that could be generated from a fixed pool, even with a fixed subset size.

Strategy:
- In `reference`, optionally (with `--cluster`) cluster the input samples to create additional 'log2_\*', 'spread_\*', and maybe 'depth_\*' columns for distinct subsets of control samples, alongside the usual all-sample 'log2', 'spread' and 'depth' columns. Clustering can be done with `scipy.cluster.hierarchy`, probably using Pearson correlation as the metric and cutting the tree to yield distinct clusters, with a minimum cluster size of e.g. 5 or 10. Consider something like a quick bootstrap or cross-validation of the tree to determine well-supported clusters. Or use [mcl](https://micans.org/mcl/).
- In `fix`, check the given test sample against the available global 'log2' and cluster-specific 'log2_\*' columns to select the most appropriate profile. Log which profile is chosen, and use it for normalization. The metric between sample and profile can be Pearson correlation, again.

This approach keeps the reference size and computation reasonable, and the small number of resulting profiles can be validated individually if desired.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reference: Implement --cluster option #308

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

reference: Implement --cluster option #308

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions