Skip to content

Strelka VCF missing GT field causes error during "call" command #943

@brandon-hastings

Description

@brandon-hastings

I am using the call command to attempt to generate BAF and allele-specific copy numbers and was running into the issue of negative BAF values described in #601. Following the guidance there, I used the call command and specified the tumor and normal samples from a strelka VCF and got the following error:

Selected test sample TUMOR and control sample NORMAL
Skipping NC_072790.1:221367 G @ TUMOR; 'invalid FORMAT: GT'
Traceback (most recent call last):
  File "/Users/brandonhastings/opt/miniconda3/envs/cnvkit/bin/cnvkit.py", line 10, in <module>
    sys.exit(main())
  File "/Users/brandonhastings/opt/miniconda3/envs/cnvkit/lib/python3.10/site-packages/cnvlib/cnvkit.py", line 10, in main
    args.func(args)
  File "/Users/brandonhastings/opt/miniconda3/envs/cnvkit/lib/python3.10/site-packages/cnvlib/commands.py", line 1178, in _cmd_call
    varr = load_het_snps(
  File "/Users/brandonhastings/opt/miniconda3/envs/cnvkit/lib/python3.10/site-packages/cnvlib/cmdutil.py", line 30, in load_het_snps
    varr = tabio.read(
  File "/Users/brandonhastings/opt/miniconda3/envs/cnvkit/lib/python3.10/site-packages/skgenome/tabio/__init__.py", line 75, in read
    dframe = reader(infile, **kwargs)
  File "/Users/brandonhastings/opt/miniconda3/envs/cnvkit/lib/python3.10/site-packages/skgenome/tabio/vcfio.py", line 62, in read_vcf
    table = pd.DataFrame.from_records(rows, columns=columns)
  File "/Users/brandonhastings/opt/miniconda3/envs/cnvkit/lib/python3.10/site-packages/pandas/core/frame.py", line 2450, in from_records
    first_row = next(data)
  File "/Users/brandonhastings/opt/miniconda3/envs/cnvkit/lib/python3.10/site-packages/skgenome/tabio/vcfio.py", line 233, in _parse_records
    depth, zygosity, alt_count = _extract_genotype(sample, record)
  File "/Users/brandonhastings/opt/miniconda3/envs/cnvkit/lib/python3.10/site-packages/skgenome/tabio/vcfio.py", line 303, in _extract_genotype
    gts = set(sample["GT"])
  File "pysam/libcbcf.pyx", line 3541, in pysam.libcbcf.VariantRecordSample.__getitem__
  File "pysam/libcbcf.pyx", line 813, in pysam.libcbcf.bcf_format_get_value
KeyError: 'invalid FORMAT: GT'

After examining the strelka VCF file, it appears that the GT field is not present (which appears to be deliberate by strelka Illumina/strelka#16). I have pasted the header of my VCF here with the available fields along with the first line. Could support for strelka be added?

##FILTER=<ID=LowDepth,Description="Tumor or normal sample read depth at this locus is below 2">
##FILTER=<ID=LowEVS,Description="Somatic Empirical Variant Score (SomaticEVS) is below threshold">
##FORMAT=<ID=AU,Number=2,Type=Integer,Description="Number of 'A' alleles used in tiers 1,2">
##FORMAT=<ID=CU,Number=2,Type=Integer,Description="Number of 'C' alleles used in tiers 1,2">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth for tier1 (used+filtered)">
##FORMAT=<ID=FDP,Number=1,Type=Integer,Description="Number of basecalls filtered from original read depth for tier1">
##FORMAT=<ID=GU,Number=2,Type=Integer,Description="Number of 'G' alleles used in tiers 1,2">
##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Number of reads with deletions spanning this site at tier1">
##FORMAT=<ID=SUBDP,Number=1,Type=Integer,Description="Number of reads below tier1 mapping quality threshold aligned across this site">
##FORMAT=<ID=TU,Number=2,Type=Integer,Description="Number of 'T' alleles used in tiers 1,2">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Combined depth across samples">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=NT,Number=1,Type=String,Description="Genotype of the normal in all data tiers, as used to classify somatic variants. One of {ref,het,hom,conflict}.">
##INFO=<ID=PNOISE,Number=1,Type=Float,Description="Fraction of panel containing non-reference noise at this site">
##INFO=<ID=PNOISE2,Number=1,Type=Float,Description="Fraction of panel containing more than one non-reference noise obs at this site">
##INFO=<ID=QSS,Number=1,Type=Integer,Description="Quality score for any somatic snv, ie. for the ALT allele to be present at a significantly different frequency in the tumor and normal">
##INFO=<ID=QSS_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref read-position in the tumor">
##INFO=<ID=SGT,Number=1,Type=String,Description="Most likely somatic genotype excluding normal noise states">
##INFO=<ID=SNVSB,Number=1,Type=Float,Description="Somatic SNV site strand bias">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation">
##INFO=<ID=SomaticEVS,Number=1,Type=Float,Description="Somatic Empirical Variant Score (EVS) expressing the phred-scaled probability of the call being a false positive observation.">
##INFO=<ID=TQSS,Number=1,Type=Integer,Description="Data tier used to compute QSS">
##INFO=<ID=TQSS_NT,Number=1,Type=Integer,Description="Data tier used to compute QSS_NT">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
NC_072790.1     221367  .       G       C       .       LowEVS  DP=49;MQ=30.81;MQ0=15;NT=ref;QSS=1;QSS_NT=1;ReadPosRankSum=-0.16;SGT=CG->CG;SNVSB=0.00;SOMATIC;SomaticEVS=0.11;TQSS=1;TQSS_NT=1 DP:FDP:SDP:SUBDP:AU:CU:GU:TU    5:0:0:0:0,0:1,2:4,16:0,0              17:1:0:0:0,0:2,2:14,29:0,0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions