filtering and identifying non-host DNA pipeline for 03-713 Bioinformatics Practicum at Carnegie Mellon University, Spring 2021 by Jeremy Fisher, Simon Levine, Sid Reed and Tomas Matteson
fainhD does the following
- Filter RNA-seq data to remove host sequences
- Assemble unknown sequences into contigs
- BLAST virus sequences against known viruses
- Predict functional ORFs in viral sequences
- Search for structural elements in virus sequences
It takes an Illumina paired-read sequencing files and report these into a json-lines file with the following columns:
query_name: arbitrary name assigned by the contig assembly softwarecontiq_seq: contiguous DNA sequence itselfrfam_e_value: expect value for a query into the rfam structure databaseblast_results: paired loci and expect values from BLAST, with keyssseqidfor the loci andevaluefor the expect value
This report, once completed, is deposited in report/{sample}.json, where {sample} is the name of library.
Useful intermediary files are available in the data/processed directory, including:
{sample}_blast_results.csv: comma-delimited table of BLAST alignments to determine the identity of viral sequences{sample}_orf.fasta: open reading frame predictions{sample}_structural_predictions.cmscanvisualized alignments against the rfam database, demonstrating RNA secondary structure (if any is found){sample}_structural_predictions.tbloutalignments against the rfam database, summarized
We organize the pipeline using Snakemake.
Please refer to Snakemake's installation instructions, as Snakemake is the only dependency. It will, in turn, handle the pipelines software and data dependencies (besides the fastq files themselves).
Note that you may need to run pip install ftputil if you plan to use remote files (as is the default).
Then, clone this repository and run as follows:
After cloning, cd into this repository. Change the input.yaml to refer to the proper dataset (as described below). There are two pipelines: one for single-ended sequencing (workflow/pe.smk) and one for pair-ended sequencing (workflow/se.smk). Depending on the sequencing reaction, invoke as follows (with, for example, twelve cores):
snakemake --use-conda -j12 -s workflow/pe.smk
# or...
snakemake --use-conda -j12 -s workflow/se.smkTo use an HPC, an example job script (for pair-ended datasets) is supplied at run.job:
sbatch run.jobOnly paired-end read are supported. If those data are available locally (e.g., if they were downloaded before-hand), put them in the data/raw/reads folder and specify the library name in the input.yaml file. For example, if the read files were "foo_R1.fastq" and "foo_R2.fastq.fastq" and specify the sample name as:
sample: fooWe facilitate using any paired-end RNAseq dataset from SRA. Simply specify the accession number as the sample:
sample: SRR8250770 This would run the pipeline on the HPV-positive experiment under the accession number SRR8250770
Snakemake is built to produce target files from source files. By specifying a sample, internally the pipeline requests the target file report/{sample}.json. This, in turn, produces the other dependencies in the pipeline like, for example, the contigous reads fasta file.
This makes it simple to analyze multiple libraries simultaneously. For example, to analyze the SRA experiments SRR8250770 and SRR8250765 (which are HPV-positive and HPV-negative, respectively):
snakemake --use-conda -j12 -s workflow/pe.smk reports/{SRR8250770,SRR8250765}.jsonIt shall be noted that this is simply a bash expansion. The shell actually runs the following
snakemake --use-conda -j12 workflow/pe.smk reports/SRR8250770.json reports/SRR8250765.json