Skip to content

wslh-bio/SPNtypeID

SPNtypeID

SPNtypeID is a Nextflow pipeline used for genome assembly and serotyping of Streptococcus pneumoniae.

SPNtypeID GPL-3.0 Static Badge

Table of Contents:

Usage
Input
Parameters
Workflow outline
Read trimming and quality assessment
Genome assembly
Assembly quality assessment
Genome coverage
Contamination detection
Serotyping
Output
Results file explanation
Credits
Contributions and Support
Citations

Using the workflow

The pipeline is designed to start from raw, paired-end Illumina reads. Start the pipeline using:

nextflow run SPNtypeID/main.nf --input [path-to-samplesheet] --outdir [path-to-outdir] --ntc_regex [what-expression-to-look-for-in-ntcs] --runname [what-to-call-run] -profile [docker,singularity,aws]

or from github using:

nextflow run wslh-bio/SPNtypeID -r [version] --input [path-to-samplesheet] --outdir [path-to-outdir] -ntc_regex [what-expression-to-look-for-in-ntcs] --runname [what-to-call-run] -profile [docker,singularity,aws]

You can also test the pipeline with example data using -profile test or -profile test_full:

nextflow run SPNtypeID/main.nf --outdir [path-to-outdir] -profile test[_full],[docker,singularity]

Input

SPNTypeID's inputs are paired Illumina FASTQ files for each sample and a comma separated sample sheet containing the sample name, the path to the forward reads file, and the path to the reverse reads file for each sample. A sample sheet can be created using the fastq_dir_to_samplesheet.py script or by hand. An example of the sample sheet's format can be seen in the table below and found here.

sample fastq_1 fastq_2
sample_name /path/to/sample_name_R1.fastq.gz /path/to/sample_name_R2.fastq.gz

Parameters

SPNTypeID's main parameters and their defaults are shown in the table below:

Parameter Parameter description and default
contaminants Path to fasta of contaminants for removal, defaults to BBDuk's adapters fasta
maxcontigs Set the maximum number of contigs allowed in an assembly (default: 300)
maxpctother Sets the maximum percentage of reads from other organisms (default: 1.0)
minavgreadq Sets the minimum average read quality score (default: 30)
mincoverage Sets the minimum coverage (default: 40)
minlength Minimum read length for trimming (default: 10)
minpctspn Sets the minimum percentage of reads that must be S. pneumoniae (default: 60.0)
minpctstrep Sets the minimum percentage of reads that must be Streptococcus (default: 80.0)
ntc_regex Regex pattern for identifying no template control (NTC) files. This is a mandatory parameter if a run has an NTC. (default: null)
qualitytrimscore Sets the BBDuk trimming quality score value (default: 10)
trimdirection Sets the BBDuk trimming direction (default: 'lr')

Workflow outline

Read trimming and quality assessment

Read repair, trimming, and cleaning are performed using BBtools v38.76 to repair fastqs with mismatched read numbers, trim reads of low quality bases, and remove PhiX contamination. Then FastQC v0.11.8 is used assess the quality of the raw and cleaned reads. Bioawk v1.0 is used to calculate the mean and median quality of the cleaned reads.

Genome assembly

Assembly of the cleaned and trimmed reads is performed using Shovill v1.1.0.

Genome assembly quality assessment

Quality assessment of the assemblies is performed using QUAST v5.0.2.

Genome coverage quality assessment

Mean and median genome coverage is determined by mapping the cleaned reads back their the assembly using BWA v0.7.17-r1188 and calculating depth using Samtools v1.10.

Genome length assessment

Genome length is assessed by comparing the expected S. pneumoniae genome length to the observed genome length and calculating a Z score. These statistics (which can be found here were obtained from the PHoeNIx pipeline, which calculated them from 9266 publicly available S. pneumoniae genomes.

Contamination detection

Contamination is detected by classifying reads using Kraken v1.0.0.

Serotyping

Serotyping is performed using SeroBA v2.0.4.

Output files

Example of pipeline output:

outdir
├── assembly_stats_summary
│   └── assembly_stats_results_summary.tsv
├── bbduk
│   ├── *.adapter.stats.txt
│   ├── *.bbduk.log
│   ├── *_repaired_1.fastq.gz
│   ├── *_repaired_2.fastq.gz
│   ├── *.repair.log
│   ├── *_singletons.fastq.gz
│   ├── *_trimmed_1.fastq.gz
│   ├── *_trimmed_2.fastq.gz
│   └── *.trim.txt
├── bbduk_summary
│   └── bbduk_results.tsv
├── bioawk
│   └── *.qual.tsv
├── calculate_assembly_stats
│   └── *_Assembly_ratio_20240124.tsv
├── coverage_stats
│   └── coverage_stats.tsv
├── fastqc
│   ├── *_fastqc.html
│   └── *_fastqc.zip
├── fastqc_summary
│   └── fastqc_summary.tsv
├── kraken_ntc
│   └── *.kraken.txt ***
├── kraken_sample
│   └── *.kraken.txt
├── kraken_summary
│   └── kraken_results.tsv
├── multiqc
│   ├── multiqc_data
│   ├── multiqc_plots
│   └── multiqc_report.html
├── percent_strep_summary
│   └── percent_strep_results.tsv
├── pipeline_info
├── quality_stats
│   └── quality_stats.tsv
├── quast
│   ├── *.quast.report.tsv
│   └── *.transposed.quast.report.tsv
├── quast_summary
│   └── quast_results.tsv
├── rejected_samples
│   └── Empty_samples.csv ***
├── report_*_ntc
│   └── *_spntypeid_report.csv
├── samtools
│   ├── *.bam
│   ├── *.depth.tsv
│   └── *.stats.txt
├── seroba
│   ├── seroba.log
│   └── *.pred.csv
├── seroba_summary
│   └── seroba_results.tsv
└── shovill
    ├── *.contigs.fa
    ├── *.sam
    └── *_shovill_output
        ├── contigs.gfa
        ├── shovill.corrections
        ├── shovill.log
        └── spades.fasta

*** = Optional output

Notable result files:
<runname>_spntypeid_report.csv - Summary table of each step in SPNtypeID
multiqc_report.html - HTML report generated by MultiQC
Empty_samples.csv - Lists any samples that are empty and were removed from the pipeline. If no samples were empty, file will be absent from output directory.

Results file explanation

Output header Purpose
Sample Unique sample identifier
Run Which run the sample is on, dictated by the --runname param
Total Reads How many reads identified in the sample
Reads Removed How many reads removed from the total reads
Median Read Quality The median value of Phred scores
Average Read Quality The mean value of Phred scores
Contigs (#) How many contigs present in the sample
N50 Length of the shortest contig where contigs of greater than this length contain 50% of the total bases
Assembly Length (bp) Total size of the assembly
Ratio of Actual:Expected Genome Length Relationship between the actual genome length to the expected S. pneumoniae genome length
z-score The isolate's relationship compared to the mean S. pneumo genome length
Median Coverage Median amount of times each base was sequenced
Average Coverage Mean amount of times each base was sequenced
Percent Strep Percentage of reads from Streptococcus
Percent SPN Percentage of reads specifically from S. pneumoniae
SecondGenus Other genus present, if detected
Percent SecondGenus Percentage of other genus, if detected
Serotype Serotype determined for each sample
Kraken Database Version Version of the Kraken database utilized
All NTC reads List of all reads detected in all no template controls, if present. If '999999' in column, no NTC was provided
All NTC SPN reads List of all S. pneumoniae reads in all no template controls, if present. If '999999' in column, no NTC was provided
Max NTC read Highest amount of reads found in all no template controls. If '999999' in column, no NTC was provided
Max NTC SPN read Highest amount of S. pneumoniae reads found in all no template controls. If '999999' in column, no NTC was provided
SPNtypeID Version Version of the SPNTypeID pipeline used for analysis

Credits

SPNTypeID was written by Dr. Kelsey Florek, Dr. Abigail C. Shockey, and Eva Gunawan, MS.

We thank the bioinformatics group at the Wisconsin State Laboratory of Hygiene for all of their contributions.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use SPNtypeID for your analysis, please cite it using the following:

K. Florek, E. Gunawan, & A.C. Shockey (2025). SPNtypeID (Version 1.10.0) [https://github.com/wslh-bio/SPNtypeID/tree/main].

An extensive list of references for the tools used by Dryad can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

SPNtypeID is a Nextflow pipeline used for genome assembly and serotyping of Streptococcus pneumoniae.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors