Structural Variants Genotyping Using Pre-Phased Reads

Please cite our paper with the following BibTex template:

@article{li2025SVUPP,
  title = {Pre-Phasing Long Reads Improves Structural Variant Genotyping},
  author = {Li, Zilong and St{\ae}ger, Frederik Filip and Davies, Robert W and Moltke, Ida and Albrechtsen, Anders},
  year = {2025},
  month = oct,
  journal = {Bioinformatics},
  pages = {btaf587},
  issn = {1367-4811},
  doi = {10.1093/bioinformatics/btaf587}
}

Quick Start

# Download
git clone https://github.com/Zilong-Li/SVUPP
cd SVUPP

# Download example data from 1KG
bash ./scripts/download-examples.sh 

# Run after preparing the sample sheet and reference panel
nextflow run main.nf \
         -profile conda \                 # or docker/singularity
         --refpanel tests/refpanel.csv \  # for phased reference panel
         --samples tests/samples.csv \    # samplesheet with long reads
         --svfile tests/delins.sniffles.hg38.liftedT2T.13Nov2023.nygc.vcf.gz  # a known SV list for genotyping

# Output Structure
# results
# ├── cutesv2
# │   ├── NA12878.vcf.gz            # Final VCF with SV genotypes
# │   ├── NA12878.vcf.gz.tbi
# │   └── versions.yml
# ├── prepared_reference_rdata.csv  
# ├── quilt2_impute
# │   ├── batch1
# │   └── versions.yml
# ├── quilt2_phase
# │   ├── batch1
# │   └── versions.yml
# ├── quilt2_prepare_chunk
# │   ├── chr21.csv
# │   ├── chr22.csv
# │   └── versions.yml
# ├── quilt2_prepare_reference
# │   ├── RData
# │   └── versions.yml
# └── samples_read_labels.csv

# Read the nextflow.config about advanced and Customization parameters

Quick Start

Introduction

Usage

Step 0: install Nextflow

Step 1: configure the workflow

Step 2: choose a container

Step 3: run the workflow

Output

Evaluation

Q&A

Which reference panel should I use?

What if I already have the prepared reference panel, i.e the RData, from QUILT2?

Speedup QUILT2 for a large reference panel

What if I already have read labels either from QUILT2 or other read phasing program?

What’s the advantages of QUILT2 vs WhatsHap?

Will this pipeline support WhatsHap?

Introduction

SVUPP is a pipeline that improves SVs genotyping accuracy by incorporating per-read phasing information into genotype likelihoods. Currently, we first used QUILT2 to phase long reads with a SNP reference panel. Then, we used a forked version of cuteSV2 (aka cuteFC) for assigning SV signals to each read followed by our genotyping formula that incorporates the haplotype probability of reads.

$$ \text{P}(X| H=(H_1,H_2)) \propto ∏_{r=1}^{D} ∑_{h∈\{1,2\}} \text{P}(X_r| H_h) \text{P}(\text{hap}_r=h) $$

Usage

Step 0: install Nextflow

Please follow the official guideline to install the latest Nextflow with DSL2 support.

Step 1: configure the workflow

There are two main CSV files you need to prepare, e.g. tests/samples.csv and tests/refpanel.csv. Check out the README there. In addition, to configure the parameters of the workflow, modify the nextflow.config or use Nextflow command options (for Nextflow experts).

Step 2: choose a container

SVUPP supports Docker, Singularity and Conda containers technology. Therefore, you can choose to use one of the 3 profiles in the nextflow.config namely docker, singularity and conda. NB, if you use either singularity or docker profile, you have to set the params.container to the local image path. Check out the README there for building your local container images or download one from https://doi.org/10.5281/zenodo.17227286. If you use conda profile, you are recommenced to activate a conda environment first before running SVUPP. Also, it may take a while for conda to resolve the environment for the first time depending on the conda version and internet connection.

Step 3: run the workflow

You can git clone this workflow to a customized path, and run without cd into SVUPP

nextflow run SVUPP/main.nf \
         -profile conda \                 # or docker/singularity
         --refpanel SVUPP/tests/refpanel.csv \  # for phased reference panel
         --samples SVUPP/tests/samples.csv \    # samplesheet with long reads
         --svfile SVUPP/tests/delins.sniffles.hg38.liftedT2T.13Nov2023.nygc.vcf.gz  # a list of known SVs

If you are new to Nextflow, here is a quick guide.

Functionality	Nextflow Command	Important Note
Run job in the background	run -bg	DO NOT use nohup or &
Resume from the cached tasks	run -resume	Can work with specific hash
Data cache directory	run -w dir	Defaults ‘work’
Output directory	run –ourdir	Defaults ‘results’
Max parallel processes	run -qs	Defaults None
Logging history	log	Find the status of past runs

Output

All output files are saved in the folder that you specified when running Nextflow command with defaults to results. Here are the details:

Genotyped VCF:	results/cuteSV2/$sampleid.vcf.gz
Read labels:	results/samples_read_labels.csv
Prepared reference:	results/prepared_reference_rdata.csv

Evaluation

For benchmarking studies, It is important to evaluate the results by stratifying the SV complexity and the call rate, which is controled by the GQ thresholding. You can achieve this easily using the latest vcfppR package (version >= 0.8.2).

#remotes::install_github("Zilong-Li/vcfppR") ## use the latest github version
library(vcfppR)
svvcf <- system.file("extdata", "platinum.sv.vcf.gz", package="vcfppR")
svuppvcf <- system.file("extdata", "svupp.call.vcf.gz", package="vcfppR")
truth <- vcftable(svvcf)
truth$neighbors <-as.integer(sub(".*NumNeighbors=([^;]+).*", "\\1", truth$info))
truth <- subset(truth, neighbors == 0) ## subset biallelic SVs
res <- vcfcomp(svuppvcf, truth, stats = "gtgq")
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "l", bty = 'l')

Q&A

Which reference panel should I use?

In principle, choose the one with matched ancestry or a large one with multiple admixed populations, e.g., the UK Biobank. However, in our benchmarking with the Platinum data, we found there was no difference in accuracy between using the UK Biobank and the 1000 Genomes Project. You can download the prepared 1000 Genomes reference panel in RData format for QUILT2 here http://popgen.dk/zilong/datahub/1KGP/quilt2_refpanel_hg38/RData/. See the next section on how to use it directly.

What if I already have the prepared reference panel, i.e the RData, from QUILT2?

Prepare a sheet with two columns named ‘chunk_id’ and ‘refpanel_rdata’, such as http://popgen.dk/zilong/datahub/1KGP/quilt2_refpanel_hg38/prepared_reference_rdata.csv.

chunk_id,refpanel_rdata
chr22.48718618.55783303,/home/zilong/Projects/SVUPP/work/f2/f9b51191685bdf2fa893e394a834af/RData/QUILT_prepared_reference.chr22.48718618.55783303.RData
chr22.38068017.44734586,/home/zilong/Projects/SVUPP/work/9b/6e3c921ecb41b2ebe01c8f0d4935ab/RData/QUILT_prepared_reference.chr22.38068017.44734586.RData
chr22.30094765.34092463,/home/zilong/Projects/SVUPP/work/89/b4676a75daf1e493c82e90d8bf1bdd/RData/QUILT_prepared_reference.chr22.30094765.34092463.RData

Run the nextflow

nextflow run main.nf \
         -profile conda \    # or docker/singularity
         --refdata prepared_reference_rdata.csv \  # the sheet with prepared RData for reference panel
         --samples tests/samples.csv \       # samplesheet with long reads
         --svfile /path/to/vcf/with/sv.vcf   # for SV genotyping

Speedup QUILT2 for a large reference panel

QUILT2 can run much faster if only imputing common variants in a large reference panel where the major SNPs are rare. With that in mind, SVUPP runs QUILT2 with --impute_rare_common=FALSE in default, which disables rare variants imputation. To enable it, you should modify the nextflow.config file to set quilt_extra_args to '--impute_rare_common=TRUE'.

What if I already have read labels either from QUILT2 or other read phasing program?

First, Prepare a sheet with two columns named ‘sample’ and ‘label’, such as:

sample,label
NA12877,/home/zilong/Projects/SVUPP/work/6c/f6daadafa1fdf4e90c6c8de4c39181/1/NA12877.haptag.tsv
NA12878,/home/zilong/Projects/SVUPP/work/6c/f6daadafa1fdf4e90c6c8de4c39181/1/NA12878.haptag.tsv

The label column stores the path to a space-separated file with no header and the first three columns being qname,phasing_prob,hap. An example:

A00217:76:HFLT3DSXX:4:1457:26015:15984	0.999	1
A00296:43:HCLHLDSXX:2:2502:19642:31219	0.999	2
A00217:76:HFLT3DSXX:1:1336:4616:23359	0.500025147658519	1

Second, run the nextflow

nextflow run main.nf \
  -profile conda \                 # or docker/singularity
  --read_labels samples_read_labels.csv \  # the sheet associate each sample with its read label file 
  --samples tests/samples.csv \    # samplesheet with long reads
  --svfile /path/to/vcf/with/svs   # for SV genotyping

What’s the advantages of QUILT2 vs WhatsHap?

There are two main reasons why QUILT2 is chosen.

QUILT2 is better than the alternatives at low-to-medium coverage (<10x) reads phasing.
Users only need to have the aligned long reads of the target samples and a public available SNP reference panel, which are easy to obtain (at least for human projects).

However, for some non-human projects, where a public reference panel is rarely available, WhatHap may be a good alternative with the cost of obtaining high quality called SNPs, which are normally generated with high-coverage short reads sequencing of the target samples.

Will this pipeline support WhatsHap?

It would be nice to have, time permitting. Welcome PRs!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
containers		containers
modules		modules
scripts		scripts
subworkflows		subworkflows
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structural Variants Genotyping Using Pre-Phased Reads

Quick Start

Table of Contents

Introduction

Usage

Step 0: install Nextflow

Step 1: configure the workflow

Step 2: choose a container

Step 3: run the workflow

Output

Evaluation

Q&A

Which reference panel should I use?

What if I already have the prepared reference panel, i.e the RData, from QUILT2?

Speedup QUILT2 for a large reference panel

What if I already have read labels either from QUILT2 or other read phasing program?

What’s the advantages of QUILT2 vs WhatsHap?

Will this pipeline support WhatsHap?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Structural Variants Genotyping Using Pre-Phased Reads

Quick Start

Table of Contents

Introduction

Usage

Step 0: install Nextflow

Step 1: configure the workflow

Step 2: choose a container

Step 3: run the workflow

Output

Evaluation

Q&A

Which reference panel should I use?

What if I already have the prepared reference panel, i.e the RData, from QUILT2?

Speedup QUILT2 for a large reference panel

What if I already have read labels either from QUILT2 or other read phasing program?

What’s the advantages of QUILT2 vs WhatsHap?

Will this pipeline support WhatsHap?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages