pavoa-nf is a comprehensive Nextflow pipeline for variant detection from whole genome or whole exome sequencing data. The pipeline performs preprocessing, alignment, variant calling, optimization, and annotation steps to produce high-quality variant calls with comprehensive annotations.
The pipeline integrates industry-standard tools and follows GATK best practices for variant detection, providing a streamlined workflow from raw sequencing reads to annotated variants.
Containers are available with all the tools needed to run the pipeline (see nextflow.config and Usage section). Only Annovar require a local installation.
-
This pipeline is based on nextflow. As we have several nextflow pipelines, we have centralized the common information in the IARC-nf repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines.
-
External software:
- Trim-galore read trimming
- bwa-mem2 fast alignment
- samblaster fast and flexible program for marking duplicates
- sambamba fast processing of NGS alignment
- GATK4 GATK tools : MarkDuplicates, BaseRecalibrator
- Dupcaller UMI trimming and Caller for UDseq
- FastQC read quality assessment
- Qualimap alignment quality control
- MultiQC quality control reports
- Annovar Variant annotation
-
Optional external software for advanced features:
-
Reference files:
- You can generate indexes for bwa-mem2, GATK and dupcaller, and store them with the reference fasta file.
- Or you can let pavoa-nf generate them for you. They will be available in the output folder.
-
VCF files :
- Lists of indels and SNVs, vcf files and corresponding tabix indexes (.tbi)
- Recommended: af-only-gnomad.hg38.vcf.gz, Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.
| Type | Description |
|---|---|
| --input_file | Input file (comma-separated) with 5 columns: SM (sample name), RG (read group ID), pair1 (first fastq of the pair), pair2 (second fastq of the pair), normal (SM of the normal sample for somatic calling) |
| Type | Description |
|---|---|
| --ref | genome reference with its index files for mapping (.fai, .ann, .amb, .pac, .bwt.2bit.64, .0123, .dict and optionnaly .alt), and for mapping (.hp.h5, .ref.h5 and .tn.h5); in the same directory) |
Some regions in the human reference genome (e.g., GRCh38) have alternative haplotypes, represented as ALT contigs. These may cause ambiguous mappings when reads align equally well to both primary and ALT regions.
If a .alt index is provided, the pipeline automatically runs bwa-postalt.js to adjust mapping qualities and resolve ambiguities by linking ALT contigs to their primary counterparts.
1- Install annovar executables (table_annovar.pl, ...).
Add --annovarDBpath /path/to/annovarbin/ to your pavoa-nf command.
2- Download the databases (eg. hg38db).
Add --annovarDBpath /path/to/hg38db to your pavoa-nf command.
3- To activate annotation step, provide a DBlist file (eg. hg38_listAVDB.txt )
Add --annovarDBlist hg38_listAVDB.txt to your pavoa-nf command.
| Name | Default value | Description |
|---|---|---|
| --output_folder | pavoa_output | Output folder for results |
π‘ Tip: this parameters is mandatory if you run a --recall analysis.
| Name | Default value | Description |
|---|---|---|
| --cpu | 8 | number of CPUs |
| --mem | 32 | memory (GB) |
| --cpu_bqsr | 2 | number of CPUs for GATK base quality score recalibration |
| --mem_bqsr | 10 | memory for GATK base quality score recalibration (GB) |
| --cpu_dupcaller | 64 | CPUs for DupCaller |
| --mem_dupcaller | 32 | Memory for DupCaller (GB) |
| Name | Default value | Description |
|---|---|---|
| --umi | NNNNNNNN | Enable UMI-aware duplicate marking. You may pass a TAG or keep default if used as a flag |
| --trim | true | enable adapter sequence trimming |
| --adapter | illumina | adapter type (illumina, nextera, etc.) |
| --length | 30 | Minimum read length after trimming |
| --quality | 30 | Minimum read quality after trimming |
| Name | Default value | Description |
|---|---|---|
| --bwa_option_m | true | Use -M option in BWA and Samblaster (for Picard compatibility) |
| --pl | ILLUMINA | Plateforme name for RG group |
| --bqsr | true | Enable base quality score recalibration |
| --known_sites | none | VCF file, known variants to filter |
| --snp_contam | none | For contamination estimation with mutect |
| --recall | false | Run only calling (bam files must be available in output_folder ) |
π‘ Tip: if --umi is used, then --bqsr automatically is set to false.
| Name | Description |
|---|---|
| --feature_file | Feature file (bed) for Qualimap |
| --multiqc_config | MultiQC configuration file |
| Name | Description |
|---|---|
| --mask | BED file, regions to ignore during calling |
| Name | Description |
|---|---|
| --strelka | true |
| --strelka_bin | |
| --strelka_config | |
| --exome | false |
π‘ Tip: If you are using apptainer profil, --strelka_bin and --strelka_config are set up by default.
π‘ Tip: with --umi, strelka2 is desactivated.
| Name | Description |
|---|---|
| --mutect2 | true |
| --mutect_args | none |
| --nsplit | 1000 |
π‘ Tip: with --umi, mutect2 is desactivated.
###Β Annotation
| Name | Default value | Description |
|---|---|---|
| --annovarDBlist | File with two columns : protocols and operations see example | |
| --annovarDBpath | /data/databases/annovar/hg38db/ | Path to your annovarDB |
| --annovarBinPath | ~/bin/annovar/ | Path to table_annovar.pl |
| --pass | 'PASS' | filter flags, as a comma separated list |
π‘ Tip: No container for annovar, you have to install it.
| Name | Default value | with --umi | Description |
|---|---|---|---|
| --cov_n_thresh | 10 | 1 | Minimum coverage in the normal sample for at given position |
| --cov_t_thresh | 10 | 1 | Minimum coverage in the tumor sample for at givien position |
| --min_vaf_t_thresh | 0.1 | 0 | Minimum Variant Allele Frequency in tumor sample |
| --max_vaf_t_thresh | 1 | 1 | Maximum Variant Allele Frequency in tumor sample |
| --cov_alt_t_thresh | 3 | 1 | Minimum number of read that support the alternative allele in tumor |
π‘ Tip: Default values change with --umi tag.
To run the pipeline on a series of fastq files listed in input.txt and a fasta reference file hg38.fasta, one can type:
nextflow run iarcbioinfo/pavoa-nf -profile singularity --input_file input.txt --ref hg38.fastaFor a complete variant calling workflow with preprocessing, alignment, variant calling with mutect and strelka, and annotation:
nextflow run iarcbioinfo/pavoa-nf -profile apptainer \
--input_file input.txt \
--ref hg38.fasta \
--trim \
--known_sites dbsnp_138.hg38.vcf.gz \
--known_sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--annovarDBlist hg38_listAVDB.txtFor dual index sequencing (UDseq)
nextflow run iarcbioinfo/pavoa-nf -profile apptainer \
--input_file input.txt \
--ref hg38.fasta \
--umi \
--trim \
--known_sites dbsnp_138.hg38.vcf.gz \
--known_sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--annovarDBlist hg38_listAVDB.txtTo run only the calling from a previous analysis located in pavoa_output folder.
nextflow run IARCbioinfo/pavoa-nf -entry dupcaller -profile apptainer \
--input_file input.txt \
--output_folder pavoa_output
--ref hg38.fasta \
--recall \
--known_sites dbsnp_138.hg38.vcf.gz \
--known_sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--annovarDBlist hg38_listAVDB.txt| Type | Description |
|---|---|
| / | Annotated variants in vcf and tabular format |
| BAM/ | folder with BAM and BAI files of alignments |
| QC/fastq/ | read quality reports after trimming |
| QC/BAM/ | alignment quality control reports |
| QC/duplicates/ | variant calling quality control reports |
| QC/multiqc_report.html | comprehensive MultiQC report |
| mutect2/ | Mutect2 outputs |
| mutect2/raw_calls/ | unflagged vcf |
| mutect2/stats | info files from mutect calling |
| Β strelka2/ | Strelka2 outputs |
| strelka2/CallablRegions | Callable regions (bed) |
| dupcaller/ | Dupcaller outputs |
| */calls/ | final vcf |
| */calls/annovar/ | raw annovar annotations |
| */calls/annotations/ | filtered annovar annotations +context +strand |
| */calls/filtered.1 | final filtered vcf |
| index | index files if they have been generated during process |
| nf-pipeline_info/ | log files from all pipeline steps |
π‘ Tip: Save the index files in the same directory as the reference FASTA to avoid regenerating them in future runs.
output/
βββ BAM/
βββ dupcaller/
β βββ calls/
β β βββ annovar/
β β βββ annotations/
β β βββ filtered.1/
βββ mutetc2/
β βββ raw_calls/
β βββ calls/
β β βββ annovar/
β β βββ annotations/
β β βββ filtered.1/
βββ strelka2/
β βββ CallableRegions/
β βββ calls/
β β βββ annovar/
β β βββ annotations/
β β βββ filtered.1/
βββ QC/
β βββ fastq
β βββ BAM/
β βββ duplicates/
β βββ multiqc_report.html
β βββ multiqc_data/
βββ nf-pipeline_info/
The pavoa-nf pipeline performs the following major steps:
-
Preprocessing (P)
- UMI trimming (optional, dupcaller trim)
- Adapter trimming (optional, AdapterRemoval)
- Quality control of reads (FastQC)
-
Alignment (A)
- Read alignment to reference genome (BWA-MEM2)
- Duplicate marking (MarkDuplicate)
- Sorting and indexing (sambamba)
- Base quality score recalibration (optional, GATK)
- Alignment quality control (Qualimap)
-
Variant calling (V)
- Variant calling using selected caller:
- Dupcaller
- Mutect2
- Strelka2
- Variant calling using selected caller:
-
Optimization (O)
-
Annotation (A)
- Variant annotation (Annovar)
- Context annotation (Gama annot)
-
Quality Control and Reporting
- Comprehensive MultiQC report
- Minimum recommended resources: 8 CPUs, 32GB RAM
- For whole genome data: 16+ CPUs, 64GB+ RAM
- Temporary disk space: ~3x input file size
If you use this pipeline, please cite:
- The pipeline:
pavoa-nf: Pipeline for variant detection. Preprocessing Alignment, Variant calling, Optimization, Annotation. https://github.com/IARCbioinfo/pavoa-nf - Nextflow:
Paolo Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316β319 (2017). doi:10.1038/nbt.3820
Please also cite the individual tools used by the pipeline.
| Name | Description | |
|---|---|---|
| Cahais Vincent* | cahaisv@iarc.who.int | Developer to contact for support |
