Avoiding centromere-bridging segments

I am exploring the use of CNVkit for broad-scale copy number analysis (in WGS mode) and would like to configure fixed windows of approximately 1 Mb size across the genome (there are justifying reasons for this). At the same time, I need to ensure that these windows do not overlap with centromeric regions, which is happening now.

I am currently using centromere exclusion files and ENCODE to mask these regions during the binning step. However, I have observed that the final segments sometimes span across centromeres, which is problematic. Splitting this at the centromere boundary invalidates the associated statistics (CI, dispersion, bin counts etc) which I need for downstream event filtering.

Ideally, I need all segments to be entirely contained within the span of each chromosome arm, avoiding centromeric crossings altogether.

Currently, my approach is to set `--target-min-size`/`--target-max-size`  to `1_000_000 +/-100_000`. However, this does lead to centromere spanning segmentations. Setting bin size with -b of `autobin`  produces bins shorter than 1 Mb probably due to the abundance of small exclusion regions in ENCODE.

Could you advise on:

- The most effective way to specify 1 Mb bins within CNVkit’s workflow (e.g., through autobin or another recommended approach)?
- Whether there is a way to enforce segmentation boundaries that respect masked regions such as centromeres.
- Any other potential caveats I should be aware of when working with bin sizes this large.


Thank you for your assistance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding centromere-bridging segments #969

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Avoiding centromere-bridging segments #969

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions