Pathseq is a pipeline for identifying microbial content (bacteria or viruses) from metagenomic sequencing reads.
Specifically, pathseq takes paired-end shotgun sequencing reads ->
filters reads that are likely to be contamination (human, primate, mammalian) ->
assembles remaining reads into contigs ->
filters contigs for likely contamination ->
taxonomically classifies and quantifies remaining contigs ->
searches remaining unclassified contigs for RNA viral motifs (to find possible novel viruses)
Currently, PathSeq is made to run on our HPC "Skyline", which is a linux cluster (Rocky Linux v9.5) using the workload manager Slurm (version 23.02.6), and a module system.
Plans are in development to containerize rules such that the pipeline is HPC agnostic. However, it is currently untested on other systems.
Pathseq can can run either on a single computer, or it can take advantage of a cluster configuration.
To run on a non-Skyline cluster in cluster mode, you will likely need to change the "module load" statements in the Snakefile rules' shell command portion, or else otherwise ensure the correct dependencies are available to each rule. You will also need to define a "profile" for your cluster. A profile maps the concept of launching a rule as a cluster job to concrete commands that can be understood by your system.
The profile settings can be given to the snakemake command directly as flags, but it is recommended to persist the settings in a profile configuration file (which is passed to the snakemake command as the argument of the --profile flag). More detail is given below in "What the cluster profile means".
To run Pathseq on a non-Skyline single computer, no profile setting files are necessary, and you must also remove the --profile flag from the "runSnakemake.sh" script when running. You will still need to change the "module load" statments, however.
To run on Skyline, you should only need to change the input parameters, about which more information is below in "Input and output formatting" and "Running Pathseq on your own data".
Pathseq takes its metagnomics sequencing data as paired-end reads in fastq/fasta or fastq.gz format. For each sample, the paired end reads must be given as one fasta file of forward reads and one fasta file of backward reads. There are no requirements for what constitutes a "sample". Samples are analyzed separately, so anything you wish to be analyzed separately should be given as separate (right and left) fasta files.
Pathseq must also be given a configuration file to run. This file specifies all arguments (besides the cluster profile, which must be given in the run script. See below for "Running PathSeq on your own data" and "what the cluster profile means"), including the metagenomics sequencing reads you wish to analyze.
More details about arguments required are below:
The config file requires the following parameters to be filled in:
- "inputlist"
The full path to a text file containing a list of the file names of all the forward-read input fasta files of each sample you wish to analyze. Each file name should be on its own line. PathSeq treats each fasta file pair given as its own separate sample, and will run the full pipeline analysis separately for each sample. In other words, the output would be the same as if you ran the pipeline separately for each input file (i.e. you can differentiate which sample any identified microbes originate from), but giving them to PathSeq together allows PathSeq to run them in parallel if enough cluster nodes are given (number of cluster nodes to use is specified in the profile. More detail in "What the cluster profile means"). - "inputdir"
The full path to a directory containing all the forward-read fasta files listed in "inputlist", as well as all their backward-read counterparts. The directory can contain other files, including the "inputlist" file, but "inputlist" does not have to be kept inside "inputdir". - "origin"
Put "RNA" or "DNA" depending on whether your input sequencing reads are from RNA or DNA. - "minContigLen"
The minimum contig length for contig assembly (can stay the same as your read lengths, or be longer). - "codepath"
The full path to the rule scripts, which in this repository are found inside "DouekLabPathSeq/SlurmBaseCode/". These are scripts which will be run as a part of each snakemake rule, and correspond to each step of the pipeline. - "outputpath"
The full path to the directory where you would like house the outputs of the scripts. Within this directory, PathSeq will create separate sub-directories for the output of each sample you give it. More detail below in "Output format". - Reference databases
The full paths to certain reference database locations that are required by some of the bioinformatics softwares used. More detail in "Other arguments needed". - External programs
The full paths to certain executable files that need to be given to certain steps of the pipeline. More detail in "Other arguments needed". - Resource allocation
If running on cluster mode, then for each rule, you must specify the maximum time allowed to the rule ("runtime", which is specified in minutes), and the maximum memory allowed to the rule ("mem_mb", in megabytes). This is so the cluster can properly launch batch jobs for each rule. In "DouekLabPathSeq/configTemplate/config.yaml", resource allocations have been given which we found generally worked well for a microbiome dataset with input fasta files generally under 2-3 GB. You may need to adjust these resource allocations based on your dataset (larger input files, higher coverage, and higher microbial diversity may all increase the amount of resources needed).
The config file's "outputpath" specifies the full path to the directory where you wish the pipeline's output should be written.
- For each separate sample given, a separate subdirectory within the output directory will be created.
- Within each subdirectory, a directory for each pipeline step (roughly corresponding to each snakefile rule) is created. An examination of the "output" section of each rule in the Snakefile will show what Snakemake expects to get back from each rule. An overview is given in "Details on each rule and its dependencies".
- The final output for each sample is contained in
"[outputpath]/[sample name]/RNA_merge_TaxAndQuant/".
This directory contains two file types for each taxonomic level (species, genus, class, family, phylum, order, kingdom).
Both the types of files have a first column showing the taxonomic classification found fitting the given taxonomic level (i.e. in the species files, the species name is listed if PathSeq believes it found an instance of a given species. in the kingdom files, the kingdom name is listed if PathSeq believes it found an instance of a given kingdom).
The "tpm" type files have a second column showing the tpm (transcripts per million, i.e. normalized count) count that Pathseq believes corresponds to the taxonomic classification specified in the first column.
The "pseudocounts" type files have a second column showing the pseudocount (raw count plus a small, non-zero value to "smooth" data) corresponding to the taxonomic classification in the first column.
Some software must be given to rules in the snakefile as paths to executables, rather than as "module load" statements or as scripts in the "SlurmBaseCode" folder.
There is no technical reason for doing so, it is mostly for legacy reasons. These executables are given in "DouekLabPathSeq/PathseqExternalPrograms/". Paths to each one must be supplied to the config.yaml file.
In-house executables:
- PathSeqRemoveHostForKaiju
- PathSeqKaijuConcensusSplitter2
- PathseqmergeQIIME2TaxAndSalmon
- PathseqSplitOutputTableByTaxonomy
Other executables:
- Picard (2.18.14)
- Prodigal (2.6.3)
- Kaiju (1.9.0)
Additionally, the following reference databases are required:
- ERCC92 spike-in controls for bowtie
- human genome reference dataset index for bowtie
- human genome reference dataset for star
- primate genome reference dataset for bowtie
- mammal genome database for blast
- nodes and fmi for kaiju
- reference taxonomy for kaiju
- After installing Pathseq
- In the folder "Tests/TestInput/", there is a set of left and right paired-end fastq files, containing about 125 reads.
- There is a folder, "Tests/Skyline_Full_Test_00/". This folder contains: a. A "config.yaml" file. This file already contains all input arguments required to run pathseq on the "Tests/TestInput/" input data. b. A "runSnakemake.sh" bash script. This bash script uses the "config.yaml" file in the "Skyline_Full_Test_00" directory, and runs Pathseq on it, by running snakemake with "SlurmBaseCode/Snakefile" (the central snakefile).
- Run the following command from within the SkylineTest0 directory to start the pipeline on your current compute node:
./runSnakemake.sh
Or, run it as a batch job on your cluster however you normally submit bash scripts as batch jobs (you may need to change the resource allocation comments at the top to match your system, and the profile contents).
sbatch runSnakemake.sh - Once the code is finished running, check that the outputs in "Tests/Skyline_Full_Test_00/Output/" matches the contents of "Tests/Skyline_Full_Test_00/ExpectedOutput/"
There are four files which may be customized in order to run on your own data.
- A configuration file specifying all arguments to the run.
There is a template at "DouekLabPathSeq/configTemplate/config.yaml".
You must make sure all paths are valid full paths in your system. Full paths allow you to house (potentially very large) input data, output data, and reference databases outside your code area. ALL runs require a configuration file.
In the configuration file, you can also set resource usage for each rule's batch job. "runtime" is the allowed job time specified in minutes, "mem_mb" is the allowed job memory usage in megabytes. - A run script.
There is a template at "DouekLabPathSeq/configTemplate/runSnakemake.sh".
This script takes a path to the Snakefile ("DouekLabPathSeq/SlurmBaseCode/Snakefile"), and an optional path to the profile ("DouekLabPathSeq/skyline_profile"). Depending on where you have stored the DouekLabPathSeq repo, you may need to customize these paths. They can be relative or full paths. The profile flag and path must be included if you wish to run on cluster mode, and must be removed to run without cluster mode. - A profile file(s) (if you are running on cluster mode). An example profile specific to the Skyline HPC is located at "DouekLabPathSeq/skyline_profile/".
This defines what commands on your system are used to launch a batch job (i.e. how your system should translate the contents of the "resources" section of each rule in the Snakefile into a command that can be understand by your cluster's workload management software). This file is only required if you want to run the pipeline in cluster mode.
More details are above in the System requirements section. - A snakemake unlock script. A template is in "DouekLabPathSeq/configTemplate/unlock.sh".
This file should run the same snakefile as the snakemake command in "runSnakemake.sh" does. This script is only necessary if your snakemake run is interrupted in the middle (i.e. time out). In this case, it "unlocks" the snakefile (the snakemake process places a lock on the working directory at the start of a run, and this lock may become stale if the process is interrupted).
Rocky Linux v9.5
slurm 23.02.6
snakemake 7.22.0
python 3.11
-
BowtieUnmasked
Purpose:
- To decontaminate input by retaining only input reads which do not align to the human genome.
Dependencies:
- openjdk (java) v17.0.11
- bowtie2 v2.5.1
- Samtools v2.18.14
- picard v2.18.14
Output:
- files in "[outputpath]/[sample name]/Generated_Data_1st_Bowtie_Alignment_Unmasked_Genome/". This directory contains ERCC spike-in controls output (alignment counts and rates, just to check if bowtie2 is working as intended).
- files in "[outputpath]/[sample name]/Generated_Data_2nd_Bowtie_Alignment_Unmasked_Genome/". This directory contains:
- A file showing alignment rates
- A .sam file showing alignment results
- Two fq files (one of forward reads, one of backward reads) of reads that remain unaligned. The unaligned files form the input into the next rule.
-
StarAfterBowtie
Purpose:
- To decontaminate input by retaining only input reads which do not align to the human genome.
Dependencies:
- Star v2.7.10
- Samtools v1.21
Output:
- files in "[outputpath]/[sample name]/Generated_Data_Star_Alignment/". This directory contains a file showing alignment statistics, a file showing an alignment summary (which inputs aligned), two bam files showing the inputs that ended up aligning, and two fq files (one of forward reads, one of backward reads) containing those input reads which have still not aligned to anything. These unaligned files form the input into the next rule.
-
BowtiePrimate
Purpose:
- To decontaminate input by retaining only input reads which do not align to primate genomes.
Dependencies:
- Bowtie2 v2.5.1
- Samtools v1.21
Output:
- Files in "[outputpath]/[sample name]/primate_alignment_rates/". This directory contains:
- A file showing the alignment rates
- A sam file showing the alignment results
- Two fq files (one of forward reads, one of backward reads) containing those input reads which have still nto aligned to anything. These unaligned files form the input into the next rule.
-
Trinity
Purpose:
- To denovo assembly individual reads into larger contigs.
Dependencies:
- Trinity 2.15.1
- Fastq v0.23.4
- Fastx-toolkit v0.0.14
Output:
- files in "[outputpath]/[sample name]/RNA_trinity_output/". This directory contains a .Trinity.fasta file showing the contigs that Trinity was able to form from the input reads. This file forms the input into the next rule
-
FilterHostBlast
Purpose:
- To decontaminate the contigs by removing any contig which constitutes a blast match with mammalian genomes.
Dependencies:
- Blast-plus v2.16.0
- Openjdk (java) v17.0.11
Output:
- files in "[outputpath]/[sample name]/RNA_trinity_filtered/". This directory contains:
- A file showing the blast results for all input contigs
- A fasta file of all contigs given that did not constitute a blast match with mammalian genomes. This fasta file forms the input into the next rule.
-
kaiju
Purpose:
- To identify genes using prodigal.
- To taxonomically classify contigs based on matches of identified genes to a protein sequence database.
Dependencies:
- Fastx-toolkit v0.0.14
- prodigal v2.6.3
- kaiju v1.9.0
Output:
- files in "[outputpath]/[sample_name]/RNA_kaiju_output/". This directory contains:
- A .tab file showing for each input contig, whether it was unclassified or classified ("C" or "U" in the first column), and what it was classified as.
- A sorted (formatted) .fa file of the nucleotide sequences of all contigs classified as non-host
- A sorted (formatted) .faa file of the protein sequences of the same non-host contigs.
- These three .tab, .fa, and .faa files form the input into the next rule.
-
BuildSalmon
Purpose:
- To build a salmon index for each taxonomic level, and facilitate the following salmon quantification step.
Dependencies:
- openjdk (java) v 17.0.11
- salmon v1.10.2
Output:
- files in "[outputpath]/[sample name]/RNA_salmon_quant/salmon/". This directory contains;
- A [tax level]_sequences.fa file and a [tax level]_table.csv file for each taxonomic level.
- The [tax level]_table.csv files contain tables showing the contig ID of the contig classified in one column, and the taxonomic assignment in the second column. These .csv taxonomic classification summaries are used as an input two steps later, in the "mergeTaxAndQuant" rule.
- This directory also contains a "[tax leve]_salmon/" subdirectory for each taxonomic level. These subdirectories contain indexing information used in the next step "salmon".
-
SalmonQuant
Purpose:
- To quantify the taxonomic classifications given by kaiju (i.e. to assign counts, either tpm or pseudocounts, to each microbe found in the sample).
Dependencies:
- salmon v1.10.2
Output:
- Files in "[outputpath]/[sample name]/RNA_salmon_quant/[tax level]quant[sample name]/". One such directory exists for each taxonomic level.
- Each directory directory contains a .sf file for each taxonomic level, containing the quantifications of the classifications for entries of the original input files. This file contains a quantification summary. The first column shows the contig ID of the contig quantified. Subsequent columns show contig length, effective length, counts in TPM, and the number of reads corresponding to this contig. This .sf file is used in the following step, "mergeTaxAndQuant".
-
MergeTaxAndQuant
Purpose:
- To combine and format information about both taxnonmic classification of inputs, and quantification of those taxonomic classifications.
Dependencies:
- openjdk (java) v17.0.11
Output:
- files in "[outputpath]/[sample name]/RNA_merge_TaxAndQuant/". Directory contains two file types for each taxonomic level (species, genus, class, family, phylum, order, kingdom). Both the types of files have a first column showing the taxonomic classification found fitting the given taxonomic level (i.e. in the species files, the species name is listed if PathSeq believes it found an instance of a given species. in the kingdom files, the kingdom name is listed if PathSeq believes it found an instance of a given kingdom).
- The "tpm" type files have a second column showing the tpm (transcripts per million, i.e. normalized count) count that Pathseq believes corresponds to the taxonomic classification specified in the first column.
- The "pseudocounts" type files have a second column showing the pseudocount (raw count plus a small, non-zero value to "smooth" data) corresponding to the taxonomic classification in the first column.
More information on snakemake 7 cluster execution and profiles: https://snakemake.readthedocs.io/en/v7.22.0/executing/cluster.html
A profile maps the concept of launching a rule as a cluster job to concrete commands that can be understood by your system. It can take resource usage parameters such as those specified in our "configTemplate/config.yaml", and convert them into the correct commands for requesting jobs according to those resource usage limits.
We provide a working profile configuration file in "skyline_profile/config.yaml" (specific to our HPC).
- The core file is "skyline_profile/config.yaml".
This file contains snakemake command parameters in a yaml format (such that the name of each entry corresponds to a snakemake flag, and the value is the argument). Here you can specify running options such as the number of times you'd like to retry a failed rule, frequency of status checks, the command for canceling a job on your cluster (scancel, in the case of Slurm), and the maximum number of jobs you would like to allow snakemake to spawn.
The most important option here is the "cluster" field. You can simply give this field the name of the command for spawning jobs on your cluster (sbatch, in the case of slurm). However, in our skyline_profile example, we have given it the name of a file in the same directory, "bw_submit.py". This python script allows for more complex job launching commands. In our case, it can parse the "runtime" and "mem_mb" fields in the "resources" field of each snakemake rule, and convert these to runtime and memory limits of the requested job launch. It can also handle many other resource usage parameters that we do not currently use in our test. More detail on this file is below.
- The bw_submit.py file is given as the argument to the "cluster" flag of snakemake.
This python script contains a main function. Snakemake handles giving this main function the resource usage parameters specified in our "configTemplate/config.yaml". The main function calls the function "make_sbatch_cmd(props)", which parses the input parameters, and ultimately puts them together into a list of strings, where each string is a correctly formatted combination of a flag and argument. This will be used by snakemake as the command for launching a job for each rule.
In our skyline_profile/bw_submit.py, you can see that the following job launching parameters can be given from the resources section of the snakemake config file:
- tasks/ntasks (number of separate processes/threads to be run in parallel, either on one node or across many nodes)
- nodes (number of nodes requested from cluster)
- mem_mb (amount of memory requested from cluster)
- runtime (time limit requested from cluster)
- gpu (whether or not to ask cluster for gpu resources)
- slurm_partition (which partition to request from, i.e. quick, priority, etc)