-
Notifications
You must be signed in to change notification settings - Fork 181
Description
I am exploring the use of CNVkit for broad-scale copy number analysis (in WGS mode) and would like to configure fixed windows of approximately 1 Mb size across the genome (there are justifying reasons for this). At the same time, I need to ensure that these windows do not overlap with centromeric regions, which is happening now.
I am currently using centromere exclusion files and ENCODE to mask these regions during the binning step. However, I have observed that the final segments sometimes span across centromeres, which is problematic. Splitting this at the centromere boundary invalidates the associated statistics (CI, dispersion, bin counts etc) which I need for downstream event filtering.
Ideally, I need all segments to be entirely contained within the span of each chromosome arm, avoiding centromeric crossings altogether.
Currently, my approach is to set --target-min-size/--target-max-size to 1_000_000 +/-100_000. However, this does lead to centromere spanning segmentations. Setting bin size with -b of autobin produces bins shorter than 1 Mb probably due to the abundance of small exclusion regions in ENCODE.
Could you advise on:
- The most effective way to specify 1 Mb bins within CNVkit’s workflow (e.g., through autobin or another recommended approach)?
- Whether there is a way to enforce segmentation boundaries that respect masked regions such as centromeres.
- Any other potential caveats I should be aware of when working with bin sizes this large.
Thank you for your assistance.