Skip to content

WGS pipeline parallelization #202

@kamyshova

Description

@kamyshova

Issue

WGS pipeline can be noticeably time-consuming due to deep sequencing over the entire genome (~ 2 billion reads). It would be great to parallelize the post-alignment and variant calling process where it might be.

Approach

The parallelization can be done for the post-alignment and variant calling processes:

  • Post-alignment step: multi-threading approach - Markduplicate + BaseRecal + ApplyBaseRecal spark versions of GATK tools
  • Variant calling step: scatter-gather approach - the splitting of reference into pieces.
    E.g. Mutect2 tool can be run with lists of intervals to restrict operating on a subset of genomic regions.

Spark-enabled GATK tools

  • MarkDuplicatesSpark
gatk MarkDuplicatesSpark \
   -I sorted_with_readgroup.bam \
   -O output_marked_duplicates.bam \
   -M marked_dup_metrics.txt \ # optional ?
   --spark-runner SPARK \
   --spark-master MASTER_URL
  • BaseRecalibratorSpark
gatk BaseRecalibratorSpark \
   -I output_marked_duplicates.bam \
   -R reference.fasta \
   --known-sites sites_of_variation.vcf \
   --known-sites setOfSitesToMask.vcf \
   -O output_recal.table \
   --spark-runner SPARK \
   --spark-master MASTER_URL
  • ApplyBQSRSpark
gatk ApplyBQSRSpark \
   -I output_marked_duplicates.bam \
   -bqsr output_recal.table \
   -O output_bqsr.bam \
   --spark-runner SPARK \
   --spark-master MASTER_URL

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions