Skip to content

Zilong-Li/SVUPP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Structural Variants Genotyping Using Pre-Phased Reads

https://github.com/Zilong-Li/SVUPP/actions/workflows/main.yml/badge.svg https://zenodo.org/badge/DOI/10.5281/zenodo.17227287.svg

Please cite our paper with the following BibTex template:

@article{li2025SVUPP,
  title = {Pre-Phasing Long Reads Improves Structural Variant Genotyping},
  author = {Li, Zilong and St{\ae}ger, Frederik Filip and Davies, Robert W and Moltke, Ida and Albrechtsen, Anders},
  year = {2025},
  month = oct,
  journal = {Bioinformatics},
  pages = {btaf587},
  issn = {1367-4811},
  doi = {10.1093/bioinformatics/btaf587}
}

Quick Start

# Download
git clone https://github.com/Zilong-Li/SVUPP
cd SVUPP

# Download example data from 1KG
bash ./scripts/download-examples.sh 

# Run after preparing the sample sheet and reference panel
nextflow run main.nf \
         -profile conda \                 # or docker/singularity
         --refpanel tests/refpanel.csv \  # for phased reference panel
         --samples tests/samples.csv \    # samplesheet with long reads
         --svfile tests/delins.sniffles.hg38.liftedT2T.13Nov2023.nygc.vcf.gz  # a known SV list for genotyping

# Output Structure
# results
# β”œβ”€β”€ cutesv2
# β”‚Β Β  β”œβ”€β”€ NA12878.vcf.gz            # Final VCF with SV genotypes
# β”‚Β Β  β”œβ”€β”€ NA12878.vcf.gz.tbi
# β”‚Β Β  └── versions.yml
# β”œβ”€β”€ prepared_reference_rdata.csv  
# β”œβ”€β”€ quilt2_impute
# β”‚Β Β  β”œβ”€β”€ batch1
# β”‚Β Β  └── versions.yml
# β”œβ”€β”€ quilt2_phase
# β”‚Β Β  β”œβ”€β”€ batch1
# β”‚Β Β  └── versions.yml
# β”œβ”€β”€ quilt2_prepare_chunk
# β”‚Β Β  β”œβ”€β”€ chr21.csv
# β”‚Β Β  β”œβ”€β”€ chr22.csv
# β”‚Β Β  └── versions.yml
# β”œβ”€β”€ quilt2_prepare_reference
# β”‚Β Β  β”œβ”€β”€ RData
# β”‚Β Β  └── versions.yml
# └── samples_read_labels.csv

# Read the nextflow.config about advanced and Customization parameters

Table of Contents

Introduction

SVUPP is a pipeline that improves SVs genotyping accuracy by incorporating per-read phasing information into genotype likelihoods. Currently, we first used QUILT2 to phase long reads with a SNP reference panel. Then, we used a forked version of cuteSV2 (aka cuteFC) for assigning SV signals to each read followed by our genotyping formula that incorporates the haplotype probability of reads.

$$ \text{P}(X| H=(H_1,H_2)) \propto ∏_{r=1}^{D} βˆ‘_{h∈\{1,2\}} \text{P}(X_r| H_h) \text{P}(\text{hap}_r=h) $$

Usage

Step 0: install Nextflow

Please follow the official guideline to install the latest Nextflow with DSL2 support.

Step 1: configure the workflow

There are two main CSV files you need to prepare, e.g. tests/samples.csv and tests/refpanel.csv. Check out the README there. In addition, to configure the parameters of the workflow, modify the nextflow.config or use Nextflow command options (for Nextflow experts).

Step 2: choose a container

SVUPP supports Docker, Singularity and Conda containers technology. Therefore, you can choose to use one of the 3 profiles in the nextflow.config namely docker, singularity and conda. NB, if you use either singularity or docker profile, you have to set the params.container to the local image path. Check out the README there for building your local container images or download one from https://doi.org/10.5281/zenodo.17227286. If you use conda profile, you are recommenced to activate a conda environment first before running SVUPP. Also, it may take a while for conda to resolve the environment for the first time depending on the conda version and internet connection.

Step 3: run the workflow

You can git clone this workflow to a customized path, and run without cd into SVUPP

nextflow run SVUPP/main.nf \
         -profile conda \                 # or docker/singularity
         --refpanel SVUPP/tests/refpanel.csv \  # for phased reference panel
         --samples SVUPP/tests/samples.csv \    # samplesheet with long reads
         --svfile SVUPP/tests/delins.sniffles.hg38.liftedT2T.13Nov2023.nygc.vcf.gz  # a list of known SVs

If you are new to Nextflow, here is a quick guide.

FunctionalityNextflow CommandImportant Note
Run job in the backgroundrun -bgDO NOT use nohup or &
Resume from the cached tasksrun -resumeCan work with specific hash
Data cache directoryrun -w dirDefaults β€˜work’
Output directoryrun –ourdirDefaults β€˜results’
Max parallel processesrun -qsDefaults None
Logging historylogFind the status of past runs

Output

All output files are saved in the folder that you specified when running Nextflow command with defaults to results. Here are the details:

Genotyped VCF:results/cuteSV2/$sampleid.vcf.gz
Read labels:results/samples_read_labels.csv
Prepared reference:results/prepared_reference_rdata.csv

Evaluation

For benchmarking studies, It is important to evaluate the results by stratifying the SV complexity and the call rate, which is controled by the GQ thresholding. You can achieve this easily using the latest vcfppR package (version >= 0.8.2).

#remotes::install_github("Zilong-Li/vcfppR") ## use the latest github version
library(vcfppR)
svvcf <- system.file("extdata", "platinum.sv.vcf.gz", package="vcfppR")
svuppvcf <- system.file("extdata", "svupp.call.vcf.gz", package="vcfppR")
truth <- vcftable(svvcf)
truth$neighbors <-as.integer(sub(".*NumNeighbors=([^;]+).*", "\\1", truth$info))
truth <- subset(truth, neighbors == 0) ## subset biallelic SVs
res <- vcfcomp(svuppvcf, truth, stats = "gtgq")
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "l", bty = 'l')

Q&A

Which reference panel should I use?

In principle, choose the one with matched ancestry or a large one with multiple admixed populations, e.g., the UK Biobank. However, in our benchmarking with the Platinum data, we found there was no difference in accuracy between using the UK Biobank and the 1000 Genomes Project. You can download the prepared 1000 Genomes reference panel in RData format for QUILT2 here http://popgen.dk/zilong/datahub/1KGP/quilt2_refpanel_hg38/RData/. See the next section on how to use it directly.

What if I already have the prepared reference panel, i.e the RData, from QUILT2?

  1. Prepare a sheet with two columns named β€˜chunk_id’ and β€˜refpanel_rdata’, such as http://popgen.dk/zilong/datahub/1KGP/quilt2_refpanel_hg38/prepared_reference_rdata.csv.
    chunk_id,refpanel_rdata
    chr22.48718618.55783303,/home/zilong/Projects/SVUPP/work/f2/f9b51191685bdf2fa893e394a834af/RData/QUILT_prepared_reference.chr22.48718618.55783303.RData
    chr22.38068017.44734586,/home/zilong/Projects/SVUPP/work/9b/6e3c921ecb41b2ebe01c8f0d4935ab/RData/QUILT_prepared_reference.chr22.38068017.44734586.RData
    chr22.30094765.34092463,/home/zilong/Projects/SVUPP/work/89/b4676a75daf1e493c82e90d8bf1bdd/RData/QUILT_prepared_reference.chr22.30094765.34092463.RData
        
  2. Run the nextflow
    nextflow run main.nf \
             -profile conda \    # or docker/singularity
             --refdata prepared_reference_rdata.csv \  # the sheet with prepared RData for reference panel
             --samples tests/samples.csv \       # samplesheet with long reads
             --svfile /path/to/vcf/with/sv.vcf   # for SV genotyping
        

Speedup QUILT2 for a large reference panel

QUILT2 can run much faster if only imputing common variants in a large reference panel where the major SNPs are rare. With that in mind, SVUPP runs QUILT2 with --impute_rare_common=FALSE in default, which disables rare variants imputation. To enable it, you should modify the nextflow.config file to set quilt_extra_args to '--impute_rare_common=TRUE'.

What if I already have read labels either from QUILT2 or other read phasing program?

First, Prepare a sheet with two columns named β€˜sample’ and β€˜label’, such as:

sample,label
NA12877,/home/zilong/Projects/SVUPP/work/6c/f6daadafa1fdf4e90c6c8de4c39181/1/NA12877.haptag.tsv
NA12878,/home/zilong/Projects/SVUPP/work/6c/f6daadafa1fdf4e90c6c8de4c39181/1/NA12878.haptag.tsv

The label column stores the path to a space-separated file with no header and the first three columns being qname,phasing_prob,hap. An example:

A00217:76:HFLT3DSXX:4:1457:26015:159840.9991
A00296:43:HCLHLDSXX:2:2502:19642:312190.9992
A00217:76:HFLT3DSXX:1:1336:4616:233590.5000251476585191

Second, run the nextflow

nextflow run main.nf \
  -profile conda \                 # or docker/singularity
  --read_labels samples_read_labels.csv \  # the sheet associate each sample with its read label file 
  --samples tests/samples.csv \    # samplesheet with long reads
  --svfile /path/to/vcf/with/svs   # for SV genotyping

What’s the advantages of QUILT2 vs WhatsHap?

There are two main reasons why QUILT2 is chosen.

  • QUILT2 is better than the alternatives at low-to-medium coverage (<10x) reads phasing.
  • Users only need to have the aligned long reads of the target samples and a public available SNP reference panel, which are easy to obtain (at least for human projects).

However, for some non-human projects, where a public reference panel is rarely available, WhatHap may be a good alternative with the cost of obtaining high quality called SNPs, which are normally generated with high-coverage short reads sequencing of the target samples.

Will this pipeline support WhatsHap?

It would be nice to have, time permitting. Welcome PRs!

About

genotype Structural Variants Using Pre-Phased Reads

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors