Please cite our paper with the following BibTex template:
@article{li2025SVUPP,
title = {Pre-Phasing Long Reads Improves Structural Variant Genotyping},
author = {Li, Zilong and St{\ae}ger, Frederik Filip and Davies, Robert W and Moltke, Ida and Albrechtsen, Anders},
year = {2025},
month = oct,
journal = {Bioinformatics},
pages = {btaf587},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btaf587}
}# Download
git clone https://github.com/Zilong-Li/SVUPP
cd SVUPP
# Download example data from 1KG
bash ./scripts/download-examples.sh
# Run after preparing the sample sheet and reference panel
nextflow run main.nf \
-profile conda \ # or docker/singularity
--refpanel tests/refpanel.csv \ # for phased reference panel
--samples tests/samples.csv \ # samplesheet with long reads
--svfile tests/delins.sniffles.hg38.liftedT2T.13Nov2023.nygc.vcf.gz # a known SV list for genotyping
# Output Structure
# results
# βββ cutesv2
# βΒ Β βββ NA12878.vcf.gz # Final VCF with SV genotypes
# βΒ Β βββ NA12878.vcf.gz.tbi
# βΒ Β βββ versions.yml
# βββ prepared_reference_rdata.csv
# βββ quilt2_impute
# βΒ Β βββ batch1
# βΒ Β βββ versions.yml
# βββ quilt2_phase
# βΒ Β βββ batch1
# βΒ Β βββ versions.yml
# βββ quilt2_prepare_chunk
# βΒ Β βββ chr21.csv
# βΒ Β βββ chr22.csv
# βΒ Β βββ versions.yml
# βββ quilt2_prepare_reference
# βΒ Β βββ RData
# βΒ Β βββ versions.yml
# βββ samples_read_labels.csv
# Read the nextflow.config about advanced and Customization parameters
- Quick Start
- Introduction
- Usage
- Output
- Evaluation
- Q&A
- Which reference panel should I use?
- What if I already have the prepared reference panel, i.e the RData, from QUILT2?
- Speedup QUILT2 for a large reference panel
- What if I already have read labels either from QUILT2 or other read phasing program?
- Whatβs the advantages of QUILT2 vs WhatsHap?
- Will this pipeline support WhatsHap?
SVUPP is a pipeline that improves SVs genotyping accuracy by incorporating per-read phasing information into genotype likelihoods. Currently, we first used QUILT2 to phase long reads with a SNP reference panel. Then, we used a forked version of cuteSV2 (aka cuteFC) for assigning SV signals to each read followed by our genotyping formula that incorporates the haplotype probability of reads.
Please follow the official guideline to install the latest Nextflow with DSL2 support.
There are two main CSV files you need to prepare, e.g. tests/samples.csv and tests/refpanel.csv. Check out the README there. In addition, to configure the parameters of the workflow, modify the nextflow.config or use Nextflow command options (for Nextflow experts).
SVUPP supports Docker, Singularity and Conda containers technology. Therefore, you can choose to use one of the 3 profiles in the nextflow.config namely docker, singularity and conda. NB, if you use either singularity or docker profile, you have to set the params.container to the local image path. Check out the README there for building your local container images or download one from https://doi.org/10.5281/zenodo.17227286. If you use conda profile, you are recommenced to activate a conda environment first before running SVUPP. Also, it may take a while for conda to resolve the environment for the first time depending on the conda version and internet connection.
You can git clone this workflow to a customized path, and run without cd into SVUPP
nextflow run SVUPP/main.nf \
-profile conda \ # or docker/singularity
--refpanel SVUPP/tests/refpanel.csv \ # for phased reference panel
--samples SVUPP/tests/samples.csv \ # samplesheet with long reads
--svfile SVUPP/tests/delins.sniffles.hg38.liftedT2T.13Nov2023.nygc.vcf.gz # a list of known SVsIf you are new to Nextflow, here is a quick guide.
| Functionality | Nextflow Command | Important Note |
|---|---|---|
| Run job in the background | run -bg | DO NOT use nohup or & |
| Resume from the cached tasks | run -resume | Can work with specific hash |
| Data cache directory | run -w dir | Defaults βworkβ |
| Output directory | run βourdir | Defaults βresultsβ |
| Max parallel processes | run -qs | Defaults None |
| Logging history | log | Find the status of past runs |
All output files are saved in the folder that you specified when running Nextflow command with defaults to results. Here are the details:
| Genotyped VCF: | results/cuteSV2/$sampleid.vcf.gz |
| Read labels: | results/samples_read_labels.csv |
| Prepared reference: | results/prepared_reference_rdata.csv |
For benchmarking studies, It is important to evaluate the results by stratifying the SV complexity and the call rate, which is controled by the GQ thresholding. You can achieve this easily using the latest vcfppR package (version >= 0.8.2).
#remotes::install_github("Zilong-Li/vcfppR") ## use the latest github version
library(vcfppR)
svvcf <- system.file("extdata", "platinum.sv.vcf.gz", package="vcfppR")
svuppvcf <- system.file("extdata", "svupp.call.vcf.gz", package="vcfppR")
truth <- vcftable(svvcf)
truth$neighbors <-as.integer(sub(".*NumNeighbors=([^;]+).*", "\\1", truth$info))
truth <- subset(truth, neighbors == 0) ## subset biallelic SVs
res <- vcfcomp(svuppvcf, truth, stats = "gtgq")
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "l", bty = 'l')In principle, choose the one with matched ancestry or a large one with multiple admixed populations, e.g., the UK Biobank. However, in our benchmarking with the Platinum data, we found there was no difference in accuracy between using the UK Biobank and the 1000 Genomes Project. You can download the prepared 1000 Genomes reference panel in RData format for QUILT2 here http://popgen.dk/zilong/datahub/1KGP/quilt2_refpanel_hg38/RData/. See the next section on how to use it directly.
- Prepare a sheet with two columns named βchunk_idβ and βrefpanel_rdataβ, such as http://popgen.dk/zilong/datahub/1KGP/quilt2_refpanel_hg38/prepared_reference_rdata.csv.
chunk_id,refpanel_rdata chr22.48718618.55783303,/home/zilong/Projects/SVUPP/work/f2/f9b51191685bdf2fa893e394a834af/RData/QUILT_prepared_reference.chr22.48718618.55783303.RData chr22.38068017.44734586,/home/zilong/Projects/SVUPP/work/9b/6e3c921ecb41b2ebe01c8f0d4935ab/RData/QUILT_prepared_reference.chr22.38068017.44734586.RData chr22.30094765.34092463,/home/zilong/Projects/SVUPP/work/89/b4676a75daf1e493c82e90d8bf1bdd/RData/QUILT_prepared_reference.chr22.30094765.34092463.RData - Run the nextflow
nextflow run main.nf \ -profile conda \ # or docker/singularity --refdata prepared_reference_rdata.csv \ # the sheet with prepared RData for reference panel --samples tests/samples.csv \ # samplesheet with long reads --svfile /path/to/vcf/with/sv.vcf # for SV genotyping
QUILT2 can run much faster if only imputing common variants in a large reference panel where the major SNPs are rare. With that in mind, SVUPP runs QUILT2 with --impute_rare_common=FALSE in default, which disables rare variants imputation. To enable it, you should modify the nextflow.config file to set quilt_extra_args to '--impute_rare_common=TRUE'.
First, Prepare a sheet with two columns named βsampleβ and βlabelβ, such as:
sample,label
NA12877,/home/zilong/Projects/SVUPP/work/6c/f6daadafa1fdf4e90c6c8de4c39181/1/NA12877.haptag.tsv
NA12878,/home/zilong/Projects/SVUPP/work/6c/f6daadafa1fdf4e90c6c8de4c39181/1/NA12878.haptag.tsvThe label column stores the path to a space-separated file with no header and the first three columns being qname,phasing_prob,hap. An example:
| A00217:76:HFLT3DSXX:4:1457:26015:15984 | 0.999 | 1 |
| A00296:43:HCLHLDSXX:2:2502:19642:31219 | 0.999 | 2 |
| A00217:76:HFLT3DSXX:1:1336:4616:23359 | 0.500025147658519 | 1 |
Second, run the nextflow
nextflow run main.nf \
-profile conda \ # or docker/singularity
--read_labels samples_read_labels.csv \ # the sheet associate each sample with its read label file
--samples tests/samples.csv \ # samplesheet with long reads
--svfile /path/to/vcf/with/svs # for SV genotypingThere are two main reasons why QUILT2 is chosen.
- QUILT2 is better than the alternatives at low-to-medium coverage (<10x) reads phasing.
- Users only need to have the aligned long reads of the target samples and a public available SNP reference panel, which are easy to obtain (at least for human projects).
However, for some non-human projects, where a public reference panel is rarely available, WhatHap may be a good alternative with the cost of obtaining high quality called SNPs, which are normally generated with high-coverage short reads sequencing of the target samples.
It would be nice to have, time permitting. Welcome PRs!