VIOLA is a patient-specific pipeline for variant prioritization in mitochondrial diseases, integrating genomics, transcriptomics, and phenotype data with machine learning.
- R (≥ 4.2)
- Python (≥ 3.9)
- VEP (≥ 99)
- CADD (≥ 1.6)
VIOLA requires variants annotated with VEP (Variant Effect Predictor), CADD (Combined Annotation Dependent Depletion), and ClinVar to run.
Please run VEP with the following options:
vep -i input_file.vcf -o output_file.vcf --cache --offline --assembly GRCh38 --vcf \
--check_existing --af --af_1kg --af_gnomad --af_espwhere input_file.vcf is the raw VCF file and output_file.vcf is the output VCF file obtained.
Then, run filter_vep to select only rare variants with the following options:
filter_vep -i input_file.vcf -o output_file.vcf -filter "SYMBOL and ((AF < 0.01 or gnomAD_AF < 0.01) or (not AF and not gnomAD_AF and not EUR_AF))"where input_file.vcf is the output VCF file of previous step and the output_file.vcf is the output VCF file obtained.
Finally, run bcftools with the plugin split-vep to obtain a tabulated file.
echo -e "CHROM\tPOS\tREF\tALT\tQUAL\tGT\tAD\tDP\t$(bcftools +split-vep -l input_file.vcf | cut -f 2 | tr '\n' '\t' | sed 's/\t$//')" > output_file.tsv ; \
bcftools +split-vep -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\t[%GT]\t[%AD]\t[%DP]\t%CSQ\n' -d -A tab input_file.vcf >> output_file.tsvwhere input_file.vcf is the ouput file of previous step and the output_file.tsv is the output TSV file obtained and used as input for viola_step1_merge.R.
Please run CADD with the following options:
CADD.sh -a -g GRCh38 -o output.tsv.gz input_file.vcfwhere input_file.vcf is the raw VCF file containing variants and the output.tsv.gz is the output TSV file used as input for viola_step1_merge.R.
A preprocessed ClinVar file is provided in the repository resources: clinvar_210125_hg38_cleaned.tsv.
The following script, located in scripts directory, provides a user-friendly usage of VIOLA pipeline that you can execute from its location.
bash run_viola.sh -h will give you the following help message:
Usage: run_viola.sh -v VEP_INPUT -c CADD_INPUT -h HPO_TABLE -f UNIQUE_VARIANTS_VCF -r RESOURCES [-o OUTPUT_DIR]
Arguments:
-v, --vep Path to VEP input file
-c, --cadd Path to CADD input file
-h, --hpo Path to patient HPO table
-f, --vcf Path to VCF file of unique rare variants
-r, --resources Path to resources folder (contains ClinVar file)
-o, --out Output directory (default: results/viola_run)
--help Show this help message
Alternatively, the VIOLA pipeline is composed of 3 scripts that you can run independently following the instructions below:
viola_step1_merge.Rviola_step2_cluster.pyviola_step3_rank.R
This script merges annotation datasets (e.g. VEP and CADD annotations) into a unified input table for downstream analysis.
The following R libraries are required:
- dplyr
- tidyr
Rscript viola_step1_merge.R -h will give you the following help message:
Usage: viola_step1_merge.R [options]
Options:
-v VEP_INPUT, --vep_input=VEP_INPUT
Path to VEP input file
-c CADD_INPUT, --cadd_input=CADD_INPUT
Path to CADD input file
-l CLINVAR_INPUT, --clinvar_input=CLINVAR_INPUT
Path to ClinVar input file
-o OUTPUT_PATH, --output_path=OUTPUT_PATH
Path to output directory
-h, --help
Show this help message and exitThis script runs the Variational Autoencoder (VAE) for dimensionality reduction and applies DBSCAN clustering to group outlier variants.
The following Python libraries are required:
- tensorflow
- sklearn
- pandas
- numpy
The input file to process is the output file of viola_step1_merge.R script.
python viola_step2_cluster.py -h will give you the following help message:
usage: viola_step2_cluster.py [-h] -f FILE_PATH -o OUTPUT_FOLDER_PATH
optional arguments:
-h, --help show this help message and exit
-f FILE_PATH, --file_path FILE_PATH
Path to the file to process
-o OUTPUT_FOLDER_PATH, --output_folder_path OUTPUT_FOLDER_PATH
Path to the output folderThis script applies filtering (quality, biotype, and Variant Allele Frequency), integrates HPO terms, and generates the final variant ranking.
The following R libraries are required:
- dplyr
- tidyr
- stringr
- data.table
- ontologyIndex
The input file is the one of the outputs of the script viola_step2_cluster.py and contains the suffix: "res_dbscan.csv".
The HPO table must be provided by the user. This is a 2-column CSV file like:
| patient_id | hpo_id |
|---|---|
| patient1 | HP:0001250 |
| patient1 | HP:0000518 |
| patient2 | HP:0001638 |
This file can be obtained by filtering the original VCF using bcftools. If no cohort is available to determine whether a variant is unique, please provide a VCF file containing only rare variants.
Resources folder contains reference files for transcriptomic co-expression matrices and mitochondrial gene lists. The provided repository already includes the necessary files in resources.
Rscript viola_step3_rank.R -h will give you the following help message:
Usage: viola_step3_rank.R [options]
Options:
-f INPUT, --input=INPUT
Path to input file
-o OUTPUT_PATH, --output_path=OUTPUT_PATH
Path to output directory
-t HPO_TABLE, --hpo_table=HPO_TABLE
Path to patient HPO table
-p PATH_UNIQUE_VAR, --path_unique_var=PATH_UNIQUE_VAR
Path to VCF of unique rare variants
-r RESOURCES, --resources=RESOURCES
Path to resources folder containing all required files
-h, --help
Show this help message and exit- Clone this repository
- Install dependencies (Python, R, etc.)
- Prepare input VCF and configuration files