Shell scripts and workflows for working with Nanopore data. Most scripts use qsub, the GPU basecalling script does not.
Started by @lskatz, contributions from @kapsakcj and potentially YOU!
This is a collection of scripts that do one thing at a time. For example, demultiplexing or basecalling [except for 01_basecall-w-gpu.sh which does both :) ].
Each script should start with np_ to indicate the nanopore workflow. Then,
each script should be named after one of these namespaces, to help indicate which stage of the process.
Separate each namespace with an underscore. Namespaces may not have underscores
in their names (e.g., a namespace of de_multiplex would be invalid.).
- basecalling:
basecall(can be combined with basecalling:basecall-demux) - demultiplexing:
demux_(can be combined with basecalling:basecall-demux) - preparing the data in each barcode:
prepSample_ - assembly:
assemble_ - polishing:
polish_
This is a collection of workflows in the form of shell scripts. They qsub the scripts individually (except for run_basecalling-w-gpu.sh since the GPUs aren't available through qsub yet).
For workflow.sh the first positional parameter must be the project folder. Both input and output go to the project folder.
run_01_basecall-w-gpu.sh - guppy GPU basecalling, demultiplexing, and adapter/barcode trimming with guppy_basecaller
run_01_basecall-w-gpu.sh is the runner/driver script for 01_basecall-w-gpu.sh
- Must be run while logged into directly node98 (Tesla V100 GPUs are available through
qsub, but are not do not have flash-based storage available yet. This script is set up for node98 to take advantage of its SSD) - No one else must be running stuff on the node
- check CPU usage with
htopand GPU usage withnvtopbefore running the script
- check CPU usage with
- Must be MinION data, generated with an R9.4.1 flowcell (FLO-MIN106) and ligation sequencing kit (SQK-LSK109)
- Must be Native Barcodes 1-24 (NBD103/104/114)
- We'd like to add options for other flowcells and sequencing kit when we come across data from those!
- Takes in 3 arguments (in this order):
$OUTDIR- an output directory$FAST5DIR- a directory containing raw fast5 files$MODE- basecalling mode/configuration - eitherfastorhac(high accuracy, recommended mode)
- copies fast5s from
$FAST5DIRto/scratch/$USER/guppy.gpu.XXXXXX - runs
guppy_basecallerin eitherfastorhacmode - Demultiplexes using
guppy_basecallerand additionally trims adapter and barcode sequences (using--trim_barcodes ; --barcode_kits "EXP-NBD103options) - Compresses (gzip) the demultiplexed reads (
--compress_fastqoption) - Copies demultiplexed, trimmed, compressed reads into subdirectories in
$OUTDIR/demux/barcodeXX - Logs STDOUT from last time script was ran in
$OUTDIR/log/logfile-gpu-basecalling.txtand all previous times in$OUTDIR/log/logfile-gpu-basecalling_prev.txt
cd ~/
# download the scripts
# TODO - CHANGE THIS TO DL A SPECIFIC RELEASE
git clone https://github.com/lskatz/nanoporeWorkflow.git
# Specified dirs MUST end with a '/'
Usage:
# high accuracy mode (highly recommend this mode over fast mode, it's worth waiting the extra runtime)
~/nanoporeWorkflow/workflows/run_01_basecall-w-gpu.sh outdir/ fast5dir/ hac
# fast mode
~/nanoporeWorkflow/workflows/run_01_basecall-w-gpu.sh outdir/ fast5dir/ fast
# OUTPUT
$OUTDIR
├── demux
│ ├── barcode06
│ │ └── barcode06.fastq.gz (there will be many .fastq.gz files per barcode)
│ ├── barcode10
│ │ └── barcode10.fastq.gz
│ ├── barcode12
│ │ └── barcode12.fastq.gz
│ ├── none.fastq.gz
│ └── sequencing_summary.txt
└── log
├── logfile-gpu-basecalling_prev.txt # only present if you ran the script more than once
└── logfile-gpu-basecalling.txt- Takes in 1 argument:
$outdir- The output directory from runningrun_01_basecall-w-gpu.sh, containingdemux/barcodeXX/directories
- Prepares a barcoded sample - concatenates all fastq files into one, compresses, and counts read lengths
- Runs
filtlongvia singularity to remove reads <1000bp and downsample reads to 600 Mb (roughly 120X for a 5 Mb genome) - Assembles downsampled/filtered reads using
Flyevia singularity (--plasmidsand-g 5Moptions used) - Polishes flye draft assembly using racon 4 times
- Polishes racon polished assembly using Medaka via singularity (specific to r9.4.1 flowcell and high accuracy basecaller,
--m r941_min_highoption used)
- Must have previously run the above script that basecalls reads on a GPU via node98.
- Not necessary to be on node98. Any server with the ability to
qsubwill work. outdirargument must be the same directory as theOUTDIRfrom the gpu-basecalling script- Recommend
cd'ing to one directory above and use.as theoutdirargument (see USAGE below)
- Recommend
Usage:
# example - if you are one directory above the output directory from the gpu-basecalling script
~/nanoporeWorkflow/workflows/workflow-after-gpu-basecalling.sh outdir/
# OUTPUT - only showing one barcode for brevity, not all files included
$OUTDIR
demux/
├── barcode07
│ ├── all.fastq.gz
│ ├── flye
│ ├── medaka
│ ├── racon
│ ├── readlengths.txt.gz
│ └── reads.minlen1000.600Mb.fastq.gz
└── log
log/
├── assemble-13f6870a-e7ab-4475-8acc-6762e57e5d55.log # one of each of these logs for each barcode
├── polish-medaka-3d22f12c-8a50-4dd7-9cc6-7c1bc5098b48.log
├── polish-racon-6c22aa55-5a95-4d17-9a01-abeade24b431.log
└── prepSample-157680be-7f14-4a32-8a74-4bfe5de0b624.log- It will check for the following files, to determine if it should skip any of the steps. Helps if one part doesn't run correctly and you don't want to repeat a certain step, e.g. re-assembling.
03_prepSample-w-gpu.shlooks for./demux/barcodeXX/reads.minlen1000.600Mb.fastq.gznp_assemble_flye.shlooks for./demux/barcodeXX/flye/assembly.fastanp_consensus_racon.shlooks for./demux/barcodeXX/racon/ctg.consensus.iteration4.fastanp_polish_medaka.shlooks for./demux/barcodeXX/medaka/polished.fasta
If you are interested in contributing to nanoporeWorkflow, please take a look at the contribution guidelines. We welcome issues or pull requests!
- add flags/options for other sequencing kits, barcoding kits, flowcells (direct RNAseq?)
- rapid barcoding kit RBK-004 is next
- Add option for Medaka polishing with r941_min_fast model, if reads were basecalled with the fast Guppy model
- https://github.com/fenderglass/Flye
- https://github.com/isovic/racon
- https://github.com/nanoporetech/medaka
- https://github.com/nanoporetech/qcat
- How to set Guppy parameters (requires Nanopore Community login credentials) https://community.nanoporetech.com/protocols/Guppy-protocol/v/gpb_2003_v1_revl_14dec2018/how-to-configure-guppy-parameters