Skip to content

Vcfdist requires too much memory when using VCF with SVs included #34

@rickymagner

Description

@rickymagner

Hi, I've been trying to use Vcfdist to compare a SNP + SV callset against truth with both, and it seems to require a prohibitive amount of memory when the SVs are included (it runs relatively quickly with just subsetting to the SNPs + small INDELs). Here is the full command used:

vcfdist comp-svs.vcf.gz /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -b /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed -v 1 --max-threads 8 --max-ram 64 --largest-variant 100000 --credit-threshold 0.7 --phasing-threshold 0.6

In this example I took a NIST-Q100 fully phased truth set here (e.g. file GRCh38_HG2-T2TQ100-V1.1.vcf.gz) and ran it against itself as both truth and query VCF. With 64GB of RAM and 8 cpus it crashes, and running with 256GB or so with many cores runs for a prohibitively long time.

It seems like given how fast it is with the SNPs that there might be something unoptimized when comparing SVs, either in the algorithm or the implementation. I wonder if running a profiler can help see if there's a memory leak or an unreasonable combinatorial explosion in this case that can be managed better.

Here is the full logs for the command:

[INFO vcfdist 22:27:29] Command: 'vcfdist comp-svs.vcf.gz /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -b /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed -v 1 --max-threads 8 --max-ram 64 --largest-variant 100000 --credit-threshold 0.7 --phasing-threshold 0.6'
[INFO vcfdist 22:27:29]
[INFO vcfdist 22:27:29] [0/8] Loading reference FASTA '/cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta'
[INFO vcfdist 22:27:45]
[INFO vcfdist 22:27:45] [Q 0/8] Parsing QUERY VCF 'comp-svs.vcf.gz'
[WARN vcfdist 22:27:45] 'PS' tag not defined in QUERY VCF header, assuming one phase set per contig
[INFO vcfdist 22:27:46] Genotypes:
[INFO vcfdist 22:27:46] 0: 667
[INFO vcfdist 22:27:46] 1: 966
[INFO vcfdist 22:27:46] 0|1: 13558
[INFO vcfdist 22:27:46] 1|0: 13671
[INFO vcfdist 22:27:46] 1|1: 6085
[INFO vcfdist 22:27:46] .|.: 11557
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] 6085 homozygous and multi-allelic variants in QUERY VCF, split for evaluation
[WARN vcfdist 22:27:46] 18014 variants with unknown (./.) alleles in QUERY VCF, skipped
[WARN vcfdist 22:27:46] 1335 variants spanned by deletion in QUERY VCF, skipped
[INFO vcfdist 22:27:46] 6952 variants outside selected regions in QUERY VCF, skipped
[INFO vcfdist 22:27:46] 39 variants on border of selected regions in QUERY VCF, skipped
[WARN vcfdist 22:27:46] 4 large (size > 100000) variants in QUERY VCF, skipped
[WARN vcfdist 22:27:46] 10 complex (CPX) variants in QUERY VCF, split into INS + DEL
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Variant types:
[INFO vcfdist 22:27:46] REF: 1335
[INFO vcfdist 22:27:46] INS: 20761
[INFO vcfdist 22:27:46] DEL: 12787
[INFO vcfdist 22:27:46] CPX: 10
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Contigs:
[INFO vcfdist 22:27:46] [ 0] chr1: 1219 | 1182 variants
[INFO vcfdist 22:27:46] [ 1] chr2: 1210 | 1249 variants
[INFO vcfdist 22:27:46] [ 2] chr3: 857 | 881 variants
[INFO vcfdist 22:27:46] [ 3] chr4: 1066 | 1102 variants
[INFO vcfdist 22:27:46] [ 4] chr5: 881 | 874 variants
[INFO vcfdist 22:27:46] [ 5] chr6: 1104 | 1105 variants
[INFO vcfdist 22:27:46] [ 6] chr7: 1023 | 1053 variants
[INFO vcfdist 22:27:46] [ 7] chr8: 816 | 792 variants
[INFO vcfdist 22:27:46] [ 8] chr9: 640 | 626 variants
[INFO vcfdist 22:27:46] [ 9] chr10: 918 | 881 variants
[INFO vcfdist 22:27:46] [10] chr11: 790 | 798 variants
[INFO vcfdist 22:27:46] [11] chr12: 812 | 784 variants
[INFO vcfdist 22:27:46] [12] chr13: 661 | 645 variants
[INFO vcfdist 22:27:46] [13] chr14: 415 | 371 variants
[INFO vcfdist 22:27:46] [14] chr15: 406 | 395 variants
[INFO vcfdist 22:27:46] [15] chr16: 585 | 564 variants
[INFO vcfdist 22:27:46] [16] chr17: 653 | 623 variants
[INFO vcfdist 22:27:46] [17] chr18: 511 | 516 variants
[INFO vcfdist 22:27:46] [18] chr19: 605 | 581 variants
[INFO vcfdist 22:27:46] [19] chr20: 469 | 492 variants
[INFO vcfdist 22:27:46] [20] chr21: 363 | 364 variants
[INFO vcfdist 22:27:46] [21] chr22: 334 | 328 variants
[INFO vcfdist 22:27:46] [22] chrX: 797 | 202 variants
[INFO vcfdist 22:27:46] [23] chrY: 25 | 0 variants
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] QUERY VCF overview:
[INFO vcfdist 22:27:46] TOTAL: 52599
[INFO vcfdist 22:27:46] KEPT : 33568
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] [T 0/8] Parsing TRUTH VCF '/cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz'
[WARN vcfdist 22:27:46] 'PS' tag not defined in TRUTH VCF header, assuming one phase set per contig
[INFO vcfdist 22:27:46] Genotypes:
[INFO vcfdist 22:27:46] 0: 667
[INFO vcfdist 22:27:46] 1: 966
[INFO vcfdist 22:27:46] 0|1: 13558
[INFO vcfdist 22:27:46] 1|0: 13671
[INFO vcfdist 22:27:46] 1|1: 6085
[INFO vcfdist 22:27:46] .|.: 11557
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] 6085 homozygous and multi-allelic variants in TRUTH VCF, split for evaluation
[WARN vcfdist 22:27:46] 18014 variants with unknown (./.) alleles in TRUTH VCF, skipped
[WARN vcfdist 22:27:46] 1335 variants spanned by deletion in TRUTH VCF, skipped
[INFO vcfdist 22:27:46] 6952 variants outside selected regions in TRUTH VCF, skipped
[INFO vcfdist 22:27:46] 39 variants on border of selected regions in TRUTH VCF, skipped
[WARN vcfdist 22:27:46] 4 large (size > 100000) variants in TRUTH VCF, skipped
[WARN vcfdist 22:27:46] 10 complex (CPX) variants in TRUTH VCF, split into INS + DEL
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Variant types:
[INFO vcfdist 22:27:46] REF: 1335
[INFO vcfdist 22:27:46] INS: 20761
[INFO vcfdist 22:27:46] DEL: 12787
[INFO vcfdist 22:27:46] CPX: 10
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Contigs:
[INFO vcfdist 22:27:46] [ 0] chr1: 1219 | 1182 variants
[INFO vcfdist 22:27:46] [ 1] chr2: 1210 | 1249 variants
[INFO vcfdist 22:27:46] [ 2] chr3: 857 | 881 variants
[INFO vcfdist 22:27:46] [ 3] chr4: 1066 | 1102 variants
[INFO vcfdist 22:27:46] [ 4] chr5: 881 | 874 variants
[INFO vcfdist 22:27:46] [ 5] chr6: 1104 | 1105 variants
[INFO vcfdist 22:27:46] [ 6] chr7: 1023 | 1053 variants
[INFO vcfdist 22:27:46] [ 7] chr8: 816 | 792 variants
[INFO vcfdist 22:27:46] [ 8] chr9: 640 | 626 variants
[INFO vcfdist 22:27:46] [ 9] chr10: 918 | 881 variants
[INFO vcfdist 22:27:46] [10] chr11: 790 | 798 variants
[INFO vcfdist 22:27:46] [11] chr12: 812 | 784 variants
[INFO vcfdist 22:27:46] [12] chr13: 661 | 645 variants
[INFO vcfdist 22:27:46] [13] chr14: 415 | 371 variants
[INFO vcfdist 22:27:46] [14] chr15: 406 | 395 variants
[INFO vcfdist 22:27:46] [15] chr16: 585 | 564 variants
[INFO vcfdist 22:27:46] [16] chr17: 653 | 623 variants
[INFO vcfdist 22:27:46] [17] chr18: 511 | 516 variants
[INFO vcfdist 22:27:46] [18] chr19: 605 | 581 variants
[INFO vcfdist 22:27:46] [19] chr20: 469 | 492 variants
[INFO vcfdist 22:27:46] [20] chr21: 363 | 364 variants
[INFO vcfdist 22:27:46] [21] chr22: 334 | 328 variants
[INFO vcfdist 22:27:46] [22] chrX: 797 | 202 variants
[INFO vcfdist 22:27:46] [23] chrY: 25 | 0 variants
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] TRUTH VCF overview:
[INFO vcfdist 22:27:46] TOTAL: 52599
[INFO vcfdist 22:27:46] KEPT : 33568
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Checking contigs:
[INFO vcfdist 22:27:46] All contig checks passed!
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] [Q 3/8] Wavefront clustering QUERY VCF 'comp-svs.vcf.gz'
/cromwell_root/script: line 46: 17 Killed vcfdist comp-svs.vcf.gz /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -b /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed -v 1 --max-threads 8 --max-ram 64 --largest-variant 100000 --credit-threshold 0.7 --phasing-threshold 0.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions