-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Describe the issue
To prepare proper denoised read counts from GATK4, it is necessary to process and annotate target intervals with GC content and other parameters. However GATK4 mandates to merge overlapping intervals when doing this.
This is a problem when one wants to use not only the mapping bias from GenomicsDB (this will work even if the files are not aligned) but a NormalDB as well (or some files, like amplification_pvalues.csv) will cause an error.
Even when using the BED file generated by IntervalFile.R, the intervals will likely differ slightly enough to cause errors (files attached to this issue).
I have no idea about the cause
The code says
if (length(tumor) != length(normal) ||
length(tumor) != length(log.ratio)) {
.stopRuntimeError("Mis-aligned coverage data.")
But I have no idea what tumor, normal or log.ratio (which I doubt is the cause, it only happens when I use a normalDB) are.
To Reproduce
PureCN.R --out sample1/sample1 --sampleid sample1 --tumor ../tumor_hdf5/sample1.hdf5 --log-ratio-file ../tumor_denoised/sample1_denoisedCR.txt --seg-file ../segmented/sample1.modelFinal.seg --mapping-bias-file ../mapping_bias_agilent_v8_hg38.rds --vcf ../../somatic_analysis/somatic_analysis_2/final/2023-10-12_somatic_analysis_2/sample1-mutect2-annotated.vcf.gz --snp-blacklist /home/incalci/shared/references/ENCODE-hg38-blacklist.bed --genome hg38 --fun-segmentation none --force --post-optimize --seed 123 --min-base-quality 20 --cores 3 --intervals ../intervals/agilent_exon.noX.txt --normaldb ../normalDB_agilent_v8_hg38.rds
Expected behavior
As I used IntervalFile.R to generate an optmized BED, I did not expect intervals not to align.
Log file
INFO [2025-04-17 11:12:14] PureCN 2.12.0 [156/9156]INFO [2025-04-17 11:12:14] ------------------------------------------------------------
INFO [2025-04-17 11:12:14] Arguments: -seg.file ../segmented/sample1.modelFinal.seg -vcf.file ../../somatic_analysis/somatic_analysis_2/final/2023-10-12_somatic_analysis_2/sample1-annotated.vcf.gz -genome hg38 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf ../mapping_bias_agilent_v8_hg38.rds -args.filterInt
ervals 100,0.05 -args.segmentation 0.005,NULL, -sampleid sample1 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file ../intervals/agilent_exon.txt -min.logr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field POP_AF -Cosmic.CNT.info.field Cosmic.CNT -model beta -post.optimize TRUE -log.file sample1/sample1.log -normal.coverage.file <data> -tumor.coverage.file <data> -log.ratio <data> -normalDB <data> -args.filterVcf <data> -fun.segmentation <data> -test.num.copy <data> -test.purity <data> -speedup.heuristics <data> -BPPARAM <data>
INFO [2025-04-17 11:12:14] Using BiocParallel for parallel optimization. INFO [2025-04-17 11:12:14] Loading coverage files... WARN [2025-04-17 11:12:14] Allosome coverage missing, cannot determine sex.
WARN [2025-04-17 11:12:15] Allosome coverage missing, cannot determine sex. WARN [2025-04-17 11:12:18] tumor.coverage.file and interval.file do not align.
FATAL [2025-04-17 11:12:28] Mis-aligned coverage data. FATAL [2025-04-17 11:12:28]
FATAL [2025-04-17 11:12:28] This runtime error might be caused by invalid input data or parameters.
FATAL [2025-04-17 11:12:28] Please report bug (PureCN 2.12.0).
Error: Mis-aligned coverage data.
This runtime error might be caused by invalid input data or parameters. Please report bug (PureCN 2.12.0).
Session Info
R version 4.4.3 (2025-02-28)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.29.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: NA
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] PureCN_2.12.0 VariantAnnotation_1.52.0
[3] Rsamtools_2.22.0 Biostrings_2.74.0
[5] XVector_0.46.0 SummarizedExperiment_1.36.0
[7] Biobase_2.66.0 GenomicRanges_1.58.0
[9] GenomeInfoDb_1.42.0 IRanges_2.40.0
[11] S4Vectors_0.44.0 MatrixGenerics_1.18.0
[13] matrixStats_1.5.0 BiocGenerics_0.52.0
[15] DNAcopy_1.80.0
loaded via a namespace (and not attached):
[1] KEGGREST_1.46.0 gtable_0.3.6 rjson_0.2.23
[4] ggplot2_3.5.1 rhdf5_2.50.0 lattice_0.22-6
[7] rhdf5filters_1.18.0 vctrs_0.6.5 tools_4.4.3
[10] bitops_1.0-9 curl_6.2.1 parallel_4.4.3
[13] tibble_3.2.1 AnnotationDbi_1.68.0 RSQLite_2.3.9
[16] blob_1.2.4 pkgconfig_2.0.3 Matrix_1.7-3
[19] data.table_1.17.0 BSgenome_1.74.0 RColorBrewer_1.1-3
[22] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.3
[25] munsell_0.5.1 codetools_0.2-20 RCurl_1.98-1.16
[28] yaml_2.3.10 pillar_1.10.1 crayon_1.5.3
[31] BiocParallel_1.40.0 DelayedArray_0.32.0 cachem_1.1.0
[34] mclust_6.1.1 abind_1.4-5 restfulr_0.0.15
[37] splines_4.4.3 fastmap_1.2.0 grid_4.4.3
[40] colorspace_2.1-1 cli_3.6.4 SparseArray_1.6.0
[43] magrittr_2.0.3 S4Arrays_1.6.0 GenomicFeatures_1.58.0
[46] XML_3.99-0.17 scales_1.3.0 UCSC.utils_1.2.0
[49] bit64_4.6.0-1 lambda.r_1.2.4 httr_1.4.7
[52] bit_4.6.0 gridExtra_2.3 futile.logger_1.4.3
[55] png_0.1-8 VGAM_1.1-13 memoise_2.0.1
[58] BiocIO_1.16.0 rtracklayer_1.66.0 rlang_1.1.5
[61] futile.options_1.0.1 glue_1.8.0 DBI_1.2.3
[64] formatR_1.14 jsonlite_1.9.1 Rhdf5lib_1.28.0
[67] R6_2.6.1 GenomicAlignments_1.42.0 zlibbioc_1.52.0