cryVar

A Nextflow pipeline for detecting cryptic variants in SARS-CoV-2 sequencing data.

Overview

The pipeline processes SARS-CoV-2 sequencing data through the following steps:

Alignment: Aligns reads to reference genome using minimap2
Sorting & Indexing: Sorts and indexes BAM files
Primer Trimming: Removes primer sequences using ivar trim
Variant Calling: Calls physically linked variants using covar
Cryptic Variant Detection: Query outbreak.info clinical API for cryptic variants

Prerequisites

Nextflow (>= 25.10.0)
Conda/Mamba (for conda installation)
GISAID account for accessing outbreak.info clinical data

Installation

Option 1: Conda/Mamba Environment

Create the conda environment from the provided environment.yml:

conda env create -f environment.yml
conda activate cryvar

Or with mamba (faster):

mamba env create -f environment.yml
mamba activate cryvar

Input Files

Required Files

FASTQ files: Sequencing reads (single-end or paired-end)
- Place in data/fastq/ directory (default)
- Naming convention: {sample_id}.fastq.gz or {sample_id}_{R1,R2}.fastq.gz for paired-end
Metadata CSV file: Required with the following columns:
- sample_id: Sample identifier (must match FASTQ file names)
- primer_scheme: Primer scheme name (must match a .bed file in reference/primer/)
Additional metadata, such as collection date and location info are optional

Example data/metadata/sample_metadata.csv:
```
sample_id,primer_scheme,collection_date
SRR36112015,ARTICv5.3.2,2023-01-15
```
Reference files:
- Reference genome: reference/reference_genome/NC_045512_Hu-1.fasta
- Annotation file: reference/annotation/NC_045512_Hu-1.gff
- Primer BED files: reference/primer/{primer_scheme}.bed

Usage

GISAID Authentication

In order to access the outbreak.info clinical API, you must authenticate with GISAID.

Run authentication:

python gisaid_authentication.py

This will open a web browser and prompt you to enter your GISAID credentials. If running the pipeline locally, the auth token will be stored in your environment, so this only needs to be done once. If running the pipeline in Docker, you will need to set the GISAID token path as an environment variable.

export GISAID_TOKEN_PATH=(bash find_gisaid_token.sh)

With Custom Parameters

nextflow run main.nf \
  --metadata data/metadata/sample_metadata.csv \
  --reads "data/fastq/*.fastq.gz" \
  --outdir results \
  --minreadlen 80

With Docker

Build the Docker image:

docker build -t cryvar .

Set the GISAID token path:

export GISAID_TOKEN_PATH=(bash find_gisaid_token.sh)

Run the pipeline:

nextflow run main.nf \
  -profile docker \
  --metadata data/metadata/sample_metadata.csv

Note: The process DETECT_CRYPTIC can take up to several hours to complete, so it is recommended to run it in the background.

Parameters

Parameter	Default	Description
`--metadata`	`null`	Required. Path to metadata CSV file
`--reads`	`"data/fastq/*.fastq.gz"`	Glob pattern for input FASTQ files
`--reference`	`"reference/reference_genome/NC_045512_Hu-1.fasta"`	Path to reference genome FASTA
`--annotation`	`"reference/annotation/NC_045512_Hu-1.gff"`	Path to annotation GFF file
`--primer_dir`	`"reference/primer"`	Directory containing primer BED files
`--outdir`	`"results"`	Output directory
`--minreadlen`	`80`	Minimum read length after trimming
`--ivar_options`	`""`	Additional options for ivar trim
`--covar_options`	`""`	Additional options for covar

Output Structure

Pipeline outputs

results/
├── sorted_post_trim/       # Final sorted BAM files
│   ├── {sample_id}.final.sorted.bam
│   └── {sample_id}.final.sorted.bam.bai
└── covar/                  # Variant calling results
|   └── {sample_id}.covar.tsv
|__ detect_cryptic/
    |__ covar_clinical_detections.tsv # All physically linked mutations with number of clinical detects
    |__ cryptic_variants.tsv # Same as above, but filtered for cryptic variants (<= 10 clinical detects, additional QC filters)

covar_clinical_detections.tsv

Column	Description
`nt_mutations`	Nucleotide mutations for this cluster
`aa_mutations`	Corresponding amino acid translations (where possible*)
`cluster_depth`	Total number of read pairs with this cluster of mutations
`total_depth`	Total number of reads spanning this cluster
`frequency`	Mutation frequency (cluster depth / total depth)
`coverage_start`	Maximum read start site for which this cluster was detected
`coverage_end`	Minimum read end site for which this cluster was detected
`query`	Raw mutation query submitted to outbreak.info API
`num_clinical_detections`	Number of clinical detections for this mutation cluster

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
modules		modules
reference		reference
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
detect_cryptic.py		detect_cryptic.py
environment.yml		environment.yml
find_gisaid_token.sh		find_gisaid_token.sh
gisaid_authentication.py		gisaid_authentication.py
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cryVar

Overview

Prerequisites

Installation

Option 1: Conda/Mamba Environment

Input Files

Required Files

Usage

GISAID Authentication

With Custom Parameters

With Docker

Parameters

Output Structure

Pipeline outputs

covar_clinical_detections.tsv

About

Uh oh!

Releases

Packages

Languages

dylanpilz/cryVar

Folders and files

Latest commit

History

Repository files navigation

cryVar

Overview

Prerequisites

Installation

Option 1: Conda/Mamba Environment

Input Files

Required Files

Usage

GISAID Authentication

With Custom Parameters

With Docker

Parameters

Output Structure

Pipeline outputs

covar_clinical_detections.tsv

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages