Skip to content

Avoiding centromere-bridging segments #969

@ilykos

Description

@ilykos

I am exploring the use of CNVkit for broad-scale copy number analysis (in WGS mode) and would like to configure fixed windows of approximately 1 Mb size across the genome (there are justifying reasons for this). At the same time, I need to ensure that these windows do not overlap with centromeric regions, which is happening now.

I am currently using centromere exclusion files and ENCODE to mask these regions during the binning step. However, I have observed that the final segments sometimes span across centromeres, which is problematic. Splitting this at the centromere boundary invalidates the associated statistics (CI, dispersion, bin counts etc) which I need for downstream event filtering.

Ideally, I need all segments to be entirely contained within the span of each chromosome arm, avoiding centromeric crossings altogether.

Currently, my approach is to set --target-min-size/--target-max-size to 1_000_000 +/-100_000. However, this does lead to centromere spanning segmentations. Setting bin size with -b of autobin produces bins shorter than 1 Mb probably due to the abundance of small exclusion regions in ENCODE.

Could you advise on:

  • The most effective way to specify 1 Mb bins within CNVkit’s workflow (e.g., through autobin or another recommended approach)?
  • Whether there is a way to enforce segmentation boundaries that respect masked regions such as centromeres.
  • Any other potential caveats I should be aware of when working with bin sizes this large.

Thank you for your assistance.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions