WGS pipeline parallelization

# Issue

WGS pipeline can be noticeably time-consuming due to deep sequencing over the entire genome (~ 2 billion reads). It would be great to parallelize the post-alignment and variant calling process where it might be.

# Approach

The parallelization can be done for the post-alignment and variant calling processes:

- **Post-alignment step**: multi-threading approach - `Markduplicate + BaseRecal + ApplyBaseRecal` spark versions of `GATK` tools
- **Variant calling step**:  scatter-gather approach - the splitting of reference into pieces. 
E.g. **Mutect2** tool can be run with lists of intervals to restrict operating on a subset of genomic regions.

### Spark-enabled GATK tools

- **MarkDuplicatesSpark**
```
gatk MarkDuplicatesSpark \
   -I sorted_with_readgroup.bam \
   -O output_marked_duplicates.bam \
   -M marked_dup_metrics.txt \ # optional ?
   --spark-runner SPARK \
   --spark-master MASTER_URL
```   
- **BaseRecalibratorSpark**
```
gatk BaseRecalibratorSpark \
   -I output_marked_duplicates.bam \
   -R reference.fasta \
   --known-sites sites_of_variation.vcf \
   --known-sites setOfSitesToMask.vcf \
   -O output_recal.table \
   --spark-runner SPARK \
   --spark-master MASTER_URL
```
- **ApplyBQSRSpark**
```
gatk ApplyBQSRSpark \
   -I output_marked_duplicates.bam \
   -bqsr output_recal.table \
   -O output_bqsr.bam \
   --spark-runner SPARK \
   --spark-master MASTER_URL
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WGS pipeline parallelization #202

Issue

Approach

Spark-enabled GATK tools

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WGS pipeline parallelization #202

Description

Issue

Approach

Spark-enabled GATK tools

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions