-
Notifications
You must be signed in to change notification settings - Fork 2
WGS pipeline parallelization #202
Copy link
Copy link
Open
Labels
Description
Issue
WGS pipeline can be noticeably time-consuming due to deep sequencing over the entire genome (~ 2 billion reads). It would be great to parallelize the post-alignment and variant calling process where it might be.
Approach
The parallelization can be done for the post-alignment and variant calling processes:
- Post-alignment step: multi-threading approach -
Markduplicate + BaseRecal + ApplyBaseRecalspark versions ofGATKtools - Variant calling step: scatter-gather approach - the splitting of reference into pieces.
E.g. Mutect2 tool can be run with lists of intervals to restrict operating on a subset of genomic regions.
Spark-enabled GATK tools
- MarkDuplicatesSpark
gatk MarkDuplicatesSpark \
-I sorted_with_readgroup.bam \
-O output_marked_duplicates.bam \
-M marked_dup_metrics.txt \ # optional ?
--spark-runner SPARK \
--spark-master MASTER_URL
- BaseRecalibratorSpark
gatk BaseRecalibratorSpark \
-I output_marked_duplicates.bam \
-R reference.fasta \
--known-sites sites_of_variation.vcf \
--known-sites setOfSitesToMask.vcf \
-O output_recal.table \
--spark-runner SPARK \
--spark-master MASTER_URL
- ApplyBQSRSpark
gatk ApplyBQSRSpark \
-I output_marked_duplicates.bam \
-bqsr output_recal.table \
-O output_bqsr.bam \
--spark-runner SPARK \
--spark-master MASTER_URL
Reactions are currently unavailable