SPNtypeID

SPNtypeID is a Nextflow pipeline used for genome assembly and serotyping of Streptococcus pneumoniae.

Using the workflow

The pipeline is designed to start from raw, paired-end Illumina reads. Start the pipeline using:

nextflow run SPNtypeID/main.nf --input [path-to-samplesheet] --outdir [path-to-outdir] --ntc_regex [what-expression-to-look-for-in-ntcs] --runname [what-to-call-run] -profile [docker,singularity,aws]

or from github using:

nextflow run wslh-bio/SPNtypeID -r [version] --input [path-to-samplesheet] --outdir [path-to-outdir] -ntc_regex [what-expression-to-look-for-in-ntcs] --runname [what-to-call-run] -profile [docker,singularity,aws]

You can also test the pipeline with example data using -profile test or -profile test_full:

nextflow run SPNtypeID/main.nf --outdir [path-to-outdir] -profile test[_full],[docker,singularity]

Input

SPNTypeID's inputs are paired Illumina FASTQ files for each sample and a comma separated sample sheet containing the sample name, the path to the forward reads file, and the path to the reverse reads file for each sample. A sample sheet can be created using the fastq_dir_to_samplesheet.py script or by hand. An example of the sample sheet's format can be seen in the table below and found here.

sample	fastq_1	fastq_2
sample_name	/path/to/sample_name_R1.fastq.gz	/path/to/sample_name_R2.fastq.gz

Parameters

SPNTypeID's main parameters and their defaults are shown in the table below:

Parameter	Parameter description and default
contaminants	Path to fasta of contaminants for removal, defaults to BBDuk's adapters fasta
maxcontigs	Set the maximum number of contigs allowed in an assembly (default: 300)
maxpctother	Sets the maximum percentage of reads from other organisms (default: 1.0)
minavgreadq	Sets the minimum average read quality score (default: 30)
mincoverage	Sets the minimum coverage (default: 40)
minlength	Minimum read length for trimming (default: 10)
minpctspn	Sets the minimum percentage of reads that must be S. pneumoniae (default: 60.0)
minpctstrep	Sets the minimum percentage of reads that must be Streptococcus (default: 80.0)
ntc_regex	Regex pattern for identifying no template control (NTC) files. This is a mandatory parameter if a run has an NTC. (default: null)
qualitytrimscore	Sets the BBDuk trimming quality score value (default: 10)
trimdirection	Sets the BBDuk trimming direction (default: 'lr')

Workflow outline

Read trimming and quality assessment

Read repair, trimming, and cleaning are performed using BBtools v38.76 to repair fastqs with mismatched read numbers, trim reads of low quality bases, and remove PhiX contamination. Then FastQC v0.11.8 is used assess the quality of the raw and cleaned reads. Bioawk v1.0 is used to calculate the mean and median quality of the cleaned reads.

Genome assembly

Assembly of the cleaned and trimmed reads is performed using Shovill v1.1.0.

Genome assembly quality assessment

Quality assessment of the assemblies is performed using QUAST v5.0.2.

Genome coverage quality assessment

Mean and median genome coverage is determined by mapping the cleaned reads back their the assembly using BWA v0.7.17-r1188 and calculating depth using Samtools v1.10.

Genome length assessment

Genome length is assessed by comparing the expected S. pneumoniae genome length to the observed genome length and calculating a Z score. These statistics (which can be found here were obtained from the PHoeNIx pipeline, which calculated them from 9266 publicly available S. pneumoniae genomes.

Contamination detection

Contamination is detected by classifying reads using Kraken v1.0.0.

Serotyping

Serotyping is performed using SeroBA v2.0.4.

Output files

Example of pipeline output:

outdir
├── assembly_stats_summary
│   └── assembly_stats_results_summary.tsv
├── bbduk
│   ├── *.adapter.stats.txt
│   ├── *.bbduk.log
│   ├── *_repaired_1.fastq.gz
│   ├── *_repaired_2.fastq.gz
│   ├── *.repair.log
│   ├── *_singletons.fastq.gz
│   ├── *_trimmed_1.fastq.gz
│   ├── *_trimmed_2.fastq.gz
│   └── *.trim.txt
├── bbduk_summary
│   └── bbduk_results.tsv
├── bioawk
│   └── *.qual.tsv
├── calculate_assembly_stats
│   └── *_Assembly_ratio_20240124.tsv
├── coverage_stats
│   └── coverage_stats.tsv
├── fastqc
│   ├── *_fastqc.html
│   └── *_fastqc.zip
├── fastqc_summary
│   └── fastqc_summary.tsv
├── kraken_ntc
│   └── *.kraken.txt ***
├── kraken_sample
│   └── *.kraken.txt
├── kraken_summary
│   └── kraken_results.tsv
├── multiqc
│   ├── multiqc_data
│   ├── multiqc_plots
│   └── multiqc_report.html
├── percent_strep_summary
│   └── percent_strep_results.tsv
├── pipeline_info
├── quality_stats
│   └── quality_stats.tsv
├── quast
│   ├── *.quast.report.tsv
│   └── *.transposed.quast.report.tsv
├── quast_summary
│   └── quast_results.tsv
├── rejected_samples
│   └── Empty_samples.csv ***
├── report_*_ntc
│   └── *_spntypeid_report.csv
├── samtools
│   ├── *.bam
│   ├── *.depth.tsv
│   └── *.stats.txt
├── seroba
│   ├── seroba.log
│   └── *.pred.csv
├── seroba_summary
│   └── seroba_results.tsv
└── shovill
    ├── *.contigs.fa
    ├── *.sam
    └── *_shovill_output
        ├── contigs.gfa
        ├── shovill.corrections
        ├── shovill.log
        └── spades.fasta

*** = Optional output

Notable result files:
<runname>_spntypeid_report.csv - Summary table of each step in SPNtypeID
multiqc_report.html - HTML report generated by MultiQC
Empty_samples.csv - Lists any samples that are empty and were removed from the pipeline. If no samples were empty, file will be absent from output directory.

Results file explanation

Output header	Purpose
Sample	Unique sample identifier
Run	Which run the sample is on, dictated by the `--runname` param
Total Reads	How many reads identified in the sample
Reads Removed	How many reads removed from the total reads
Median Read Quality	The median value of Phred scores
Average Read Quality	The mean value of Phred scores
Contigs (#)	How many contigs present in the sample
N50	Length of the shortest contig where contigs of greater than this length contain 50% of the total bases
Assembly Length (bp)	Total size of the assembly
Ratio of Actual:Expected Genome Length	Relationship between the actual genome length to the expected S. pneumoniae genome length
z-score	The isolate's relationship compared to the mean S. pneumo genome length
Median Coverage	Median amount of times each base was sequenced
Average Coverage	Mean amount of times each base was sequenced
Percent Strep	Percentage of reads from Streptococcus
Percent SPN	Percentage of reads specifically from S. pneumoniae
SecondGenus	Other genus present, if detected
Percent SecondGenus	Percentage of other genus, if detected
Serotype	Serotype determined for each sample
Kraken Database Version	Version of the Kraken database utilized
All NTC reads	List of all reads detected in all no template controls, if present. If '999999' in column, no NTC was provided
All NTC SPN reads	List of all S. pneumoniae reads in all no template controls, if present. If '999999' in column, no NTC was provided
Max NTC read	Highest amount of reads found in all no template controls. If '999999' in column, no NTC was provided
Max NTC SPN read	Highest amount of S. pneumoniae reads found in all no template controls. If '999999' in column, no NTC was provided
SPNtypeID Version	Version of the SPNTypeID pipeline used for analysis

Credits

SPNTypeID was written by Dr. Kelsey Florek, Dr. Abigail C. Shockey, and Eva Gunawan, MS.

We thank the bioinformatics group at the Wisconsin State Laboratory of Hygiene for all of their contributions.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use SPNtypeID for your analysis, please cite it using the following:

K. Florek, E. Gunawan, & A.C. Shockey (2025). SPNtypeID (Version 1.10.0) [https://github.com/wslh-bio/SPNtypeID/tree/main].

An extensive list of references for the tools used by Dryad can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 525 Commits
.github		.github
assets		assets
bin		bin
conf		conf
lib		lib
modules		modules
samplesheets		samplesheets
subworkflows/local		subworkflows/local
test-dataset		test-dataset
workflows		workflows
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
container_image_manifest.json		container_image_manifest.json
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
parameter-template.json		parameter-template.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPNtypeID

Table of Contents:

Using the workflow

Input

Parameters

Workflow outline

Read trimming and quality assessment

Genome assembly

Genome assembly quality assessment

Genome coverage quality assessment

Genome length assessment

Contamination detection

Serotyping

Output files

Results file explanation

Credits

Contributions and Support

Citations

About

Uh oh!

Releases 15

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPNtypeID

Table of Contents:

Using the workflow

Input

Parameters

Workflow outline

Read trimming and quality assessment

Genome assembly

Genome assembly quality assessment

Genome coverage quality assessment

Genome length assessment

Contamination detection

Serotyping

Output files

Results file explanation

Credits

Contributions and Support

Citations

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages