A high-performance bioinformatics pipeline for extracting and annotating Small Subunit (SSU) rRNA sequences from genomic assemblies using covariance models.
Nextflow Pipeline: Complete conversion from Snakemake to Nextflow for improved scalability and cloud compatibility. This is now the main branch (v1.0.0).
SSUextract identifies SSU rRNA sequences in genomic data through the following workflow:
- Search: Identifies SSU rRNA regions using Infernal's cmsearch with curated covariance models
- Extract: Extracts high-quality SSU sequences based on configurable length thresholds
- Annotate: Assigns taxonomic classification via BLAST against reference databases
- Report: Generates comprehensive summary tables with sequence metadata and taxonomy
Install the pixi package manager:
curl -fsSL https://pixi.sh/install.sh | sh# Clone the repository
git clone https://github.com/NeLLi-team/ssuextract.git
cd ssuextract
# Install dependencies and download the default reference database (SILVA/PR2)
pixi run setup# Run on example data
pixi run run
# Results will be in: results/example/SSUextract supports two reference databases for taxonomic annotation:
A combined database containing:
- SILVA 138.1: Curated bacterial and archaeal 16S/18S rRNA sequences
- PR2 4.12: Protist Ribosomal Reference database for eukaryotic sequences
This database is automatically downloaded during setup.
Taxonomy format: Sequences are annotated with semicolon-delimited taxonomy strings:
>REF_SILVA_AB001445.1.1538;Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;...
The EukCensus 2025 database contains SSU sequences derived from IMG/JGI metagenome assemblies, providing broader coverage of environmental microbial diversity.
Contents:
eukcensus_2025_16S.fna: Bacterial/Archaeal 16S sequences (~1.3M sequences)eukcensus_2025_18S.fna: Eukaryotic 18S sequences (~400K sequences)- Cluster information and metadata files
Taxonomy formats:
- 16S sequences: SILVA-style semicolon-delimited taxonomy
- 18S sequences: Underscore-delimited taxonomy paths
>REF_SPR_AB041247.1.1790_Eukaryota_Amorphea_Obazoa_Opisthokonta_Nucletmycea_Fungi_...
-
Obtain the EukCensus files and place them in the
eukcensus/directory:eukcensus/ ├── eukcensus_2025_16S.fna ├── eukcensus_2025_18S.fna ├── eukcensus_2025_16S_clusters.tsv ├── eukcensus_2025_18S_clusters.tsv └── eukcensus_2025_img_metadata.tsv -
Run the setup command:
pixi run setup-eukcensus
-
Run the pipeline with the EukCensus database:
nextflow run main.nf --querydir data/my_samples --database eukcensus
The cluster files provide pre-computed sequence clustering information:
| Column | Description |
|---|---|
| cluster_id | Unique cluster identifier |
| centroid | Representative sequence ID |
| size | Number of sequences in cluster |
| tax_type | Prokaryote or Eukaryote |
| superg | Supergroup classification |
| phylum | Phylum-level taxonomy |
| class | Class-level taxonomy |
| order | Order-level taxonomy |
| family | Family-level taxonomy |
| genus | Genus-level taxonomy |
The eukcensus_2025_img_metadata.tsv file contains sample metadata from IMG/JGI including:
- Sequencing information (platform, coverage)
- Geographic coordinates
- Environmental parameters (pH, salinity)
- Ecosystem classification
- Assembly statistics
graph TD
A[Input FASTA files] --> B[Header validation]
B --> C[cmsearch<br/>RF00177 & RF01960]
C --> D[Extract coordinates]
D --> E[Extract sequences]
E --> F[BLAST annotation]
F --> G[Merge results]
G --> H[Summary table]
style A fill:#e1f5fe
style H fill:#c8e6c9
style C fill:#fff3e0
style F fill:#fff3e0
ssuextract/
├── main.nf # Nextflow pipeline definition
├── nextflow.config # Nextflow configuration
├── pixi.toml # Dependencies and tasks
├── config/ # Configuration files
│ └── default.yaml # Pipeline configuration
├── scripts/ # Pipeline scripts
│ ├── cmprocessing.py # BLAST result processing
│ ├── get_cmsequences.py# Sequence extraction
│ ├── get_cmstats.py # Alignment statistics
│ ├── get_table.py # Summary table generation
│ └── rename_fnaheaders.py # Header validation
├── resources/ # Static resources
│ ├── models/ # Covariance models
│ │ ├── RF00177.cm # Bacterial/Archaeal SSU
│ │ └── RF01960.cm # Eukaryotic SSU
│ └── database/ # Reference databases
│ ├── silva-138-1_pr2-4-12.* # Default SILVA/PR2
│ └── eukcensus/ # Optional EukCensus database
├── data/ # Input data
│ └── example/ # Example test data
└── results/ # Output directory
└── {dataset}/ # Dataset-specific results
# Run with custom parameters
nextflow run main.nf \
--querydir data/my_dataset \
--threads_per_job 4 \
--min_extract_length 1000 \
--database silva-pr2| Parameter | Default | Description |
|---|---|---|
--querydir |
data/example |
Path to input FASTA files |
--modeldir |
resources/models |
Path to covariance models |
--outdir |
results/{dataset} |
Output directory |
--database |
silva-pr2 |
Database: silva-pr2 or eukcensus |
--database_path |
resources/database |
Path to database files |
--min_extract_length |
500 |
Minimum sequence length (bp) |
--threads_per_job |
2 |
CPU threads per process |
Note on sequence length: The default minimum length is 500 bp. For better taxonomic resolution, we recommend using
--min_extract_length 1000or higher. Longer sequences provide more phylogenetic signal and more accurate taxonomic assignments.
# Place your .fna files in data/your_dataset/
mkdir data/your_dataset
cp /path/to/*.fna data/your_dataset/
# Run pipeline on custom data
nextflow run main.nf --querydir data/your_dataset --threads_per_job 4A comprehensive table containing all extracted SSU sequences:
| Column | Description |
|---|---|
| name | Sequence identifier |
| sample | Source sample name |
| model | CM model used (RF00177/RF01960) |
| length | Sequence length (bp) |
| coordinates | Genomic coordinates |
| strand | DNA strand (+/-) |
| sequence_type | Hit type (simple/complex) |
| contig_name | Source contig |
| blast_sseqid | Best BLAST hit with taxonomy |
| blast_pident | Percent identity (0-100) |
| blast_length | Alignment length (bp) |
| blast_bitscore | BLAST bit score |
| is_assembled | Assembly status |
Counts of SSU types per sample:
| Category | Description |
|---|---|
| BacteriaSSU | Bacterial 16S rRNA |
| ArchaeaSSU | Archaeal 16S rRNA |
| EukaryotaSSU | Eukaryotic 18S rRNA |
| MitochondriaSSU | Mitochondrial rRNA |
| PlastidSSU | Plastid/Chloroplast rRNA |
| CyanobacteriaSSU | Cyanobacterial 16S rRNA |
| RickettsialesSSU | Rickettsiales 16S rRNA |
| PatescibacteriaSSU | Patescibacteria 16S rRNA |
# Setup and installation
pixi run setup # Install deps + download SILVA/PR2 database
pixi run setup-eukcensus # Setup EukCensus database
pixi run setup-all # Setup both databases
# Pipeline execution
pixi run run # Run full pipeline (default config)
pixi run dryrun # Preview what will be executed
# Custom execution
nextflow run main.nf --querydir data/my_dataset
nextflow run main.nf --querydir data/my_dataset --database eukcensus
# Database management
pixi run download-db # Download SILVA/PR2 database
pixi run download-eukcensus # Setup EukCensus database
# Cleanup
pixi run clean # Clean pipeline work directories
pixi run clean-results # Remove all results directories
# Development
pixi run -e dev run-verbose # Verbose output with traces
pixi run -e dev report # Generate HTML report
pixi run -e dev dag # Generate pipeline DAG visualization- Validates FASTA headers using
scripts/rename_fnaheaders.py - Ensures compatible formatting for downstream processing
- Uses Infernal's cmsearch with
--anytruncflag for truncated sequences - Searches for bacterial/archaeal SSU (RF00177) and eukaryotic SSU (RF01960)
- Extracts alignment coordinates using
scripts/get_cmstats.py - Handles truncated alignments at sequence boundaries
- Filters by minimum length (default: 500 bp; recommend 1000+ bp for better resolution)
- Extracts sequences based on coordinates using
scripts/get_cmsequences.py - Handles reverse complement for minus strand hits
- BLAST search against selected reference database
- Processes results with
scripts/cmprocessing.py - Maps hits to taxonomic categories
- Merges all results with
scripts/get_table.py - Creates summary statistics per sample
- Generates final TSV reports
- Verify the database was downloaded correctly
- Check minimum length setting (sequences shorter than threshold are excluded)
- Ensure extracted sequences meet length requirements
pixi run clean
# or
nextflow clean -f- Reduce
threads_per_jobparameter - Use profile configurations:
nextflow run main.nf -profile local - Adjust
max_memoryinnextflow.config
- Ensure input files have
.fna,.fa, or.fastaextensions - Verify
querydirpath exists and contains sequence files - Check file permissions
# For SILVA/PR2
pixi run download-db
# For EukCensus
pixi run setup-eukcensus- Major Release: Complete migration from Snakemake to Nextflow
- Enhanced workflow structure with improved modularity
- Updated covariance models (RF00177.cm and RF01960.cm)
- Comprehensive documentation updates
- Legacy Snakemake version preserved in
ssuextract-snkbranch
- Added support for EukCensus 2025 database
- New
--databaseparameter for database selection - Setup commands for multiple database configurations
- v0.9.0 - Last Snakemake version (in
ssuextract-snkbranch) - View all releases
If you use SSUextract in your research, please cite:
SSUextract: A Nextflow pipeline for SSU rRNA extraction
[Publication details pending]
Contributions are welcome. Please:
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Infernal for covariance model searches
- SILVA and PR2 databases
- IMG/JGI for metagenome data (EukCensus)
- Nextflow workflow engine
- pixi package manager
Developed by the NeLLi Team