Skip to content

Downloading disambiguate reference files and alternative solutions #34

@skchronicles

Description

@skchronicles

About
At the current moment, the cache subcommand of the pipeline does not download disambiguate's reference files, i.e. the bwa indices for each of the supporting reference genomes. As so, these reference files should exist on the host's filesystem prior to execution. These files have already been downloaded/exist on BigSky and Biowulf; however, if the pipeline were to be setup on another cluster, they would need to be downloaded outside the cache subcommand.

Here is an example command to download disambiguate's reference files from helix/biowulf:

rsync -rav -e ssh helix.nih.gov:/data/OpenOmics/references/genomes .

Road map
Here are some proposed long-term solutions:

  1. Move the reference files into our data-share directory for easy downloads, update the cache sub command to pull from this location.
  2. Build the alignment indices on the fly in the output directory and blow them away as a post-processing hook. This should not be a rate-limiting step of the pipeline. It can start running during the bcl2fastq conversion and should be completed way before trimming completes. The only down-side is a slight increase in disk space while the pipeline is running; although if the pipeline cleans up these files after the run completes, it's not really a big deal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions