-
Notifications
You must be signed in to change notification settings - Fork 181
Open
Labels
Milestone
Description
Some existing CNV callers construct a pooled reference from a subset of the available control samples for each test sample processed, selected based on similar read depth profiles: CANOES, ExomeDepth, CLAMMS.
There are some practical problems with the approach of selecting control samples on the fly:
- The number of control samples to use in the subset is either arbitrary, or intensive to compute optimally by simulation.
- The read depths of all controls samples and all bins must be retained and examined -- with a large initial pool, this could cause the
cnv_reference.cnnfile size and in-memory size to bloat. - For validation purposes, given limited resources, not all possible combinations of control samples can be fully tested since there are a combinatorial number of possible references that could be generated from a fixed pool, even with a fixed subset size.
Strategy:
- In
reference, optionally (with--cluster) cluster the input samples to create additional 'log2_*', 'spread_*', and maybe 'depth_*' columns for distinct subsets of control samples, alongside the usual all-sample 'log2', 'spread' and 'depth' columns. Clustering can be done withscipy.cluster.hierarchy, probably using Pearson correlation as the metric and cutting the tree to yield distinct clusters, with a minimum cluster size of e.g. 5 or 10. Consider something like a quick bootstrap or cross-validation of the tree to determine well-supported clusters. Or use mcl. - In
fix, check the given test sample against the available global 'log2' and cluster-specific 'log2_*' columns to select the most appropriate profile. Log which profile is chosen, and use it for normalization. The metric between sample and profile can be Pearson correlation, again.
This approach keeps the reference size and computation reasonable, and the small number of resulting profiles can be validated individually if desired.
Reactions are currently unavailable