This document provides a comprehensive specification of the computational pipelines established for the cumulative dissertation "Genomic Surveillance of Aerial and Aquatic Microbiomes by Nanopore Sequencing." It consolidates the bioinformatic workflows validated in the constituent studies Reska et al. (2024) and Perlas et al. (2025). All processing steps are presented as executable command-line instructions to ensure full reproducibility of these real-time genomic surveillance frameworks across diverse environmental monitoring contexts.
The following table details the specific software versions employed in each study.
| Tool | Publication I (Air) | Publication II (Wetlands) | Purpose |
|---|---|---|---|
| Basecalling | |||
| MinKNOW | v23.04.3 / v23.04.5 | v24.11.10 | Data acquisition & device control |
| Guppy | v6.3.2 | — | High-accuracy basecalling (R10.4.1) |
| Dorado | v4.3.0 | v5.0.0 | Super-accuracy basecalling (R10.4.1) |
| Pre-processing | |||
| Porechop | v0.2.3 | v0.2.4 | Adapter and barcode trimming |
| NanoFilt | v2.8.0 | v2.8.0 | Read quality and length filtering |
| SeqKit | v2.8.2 | v2.3.0 | Read sampling, sorting, and formatting |
| Taxonomy | |||
| Kraken2 | v2.0.7 | v2.1.2 | Metagenomic taxonomic classification |
| MEGAN-CE | — | v6.21.1 | Lowest Common Ancestor (LCA) analysis |
| OBITools4 | — | v1.3.1 | Metabarcoding demultiplexing |
| VSEARCH | — | v2.21 | OTU clustering and chimera removal |
| Cutadapt | — | v4.2 | Primer trimming (amplicon data) |
| Assembly | |||
| MetaFlye | v2.9.1 | v2.9.6 | Long-read de novo assembly |
| nanoMDBG | — | v1.1 | De Bruijn graph assembly (low biomass) |
| Minimap2 | v2.17 | v2.28 | Read mapping and polishing alignment |
| Racon | v1.5 | v1.5.0 | Assembly polishing (consensus) |
| Medaka | — | v1.7.2 | Assembly polishing (neural network) |
| Downstream | |||
| MetaWRAP | v1.3 | — | Metagenomic binning wrapper |
| CheckM | v1.2.2 | — | MAG quality assessment |
| AMRFinderPlus | v3.12.8 | v4.0.23 | Antimicrobial resistance gene detection |
| ABRicate | v1.0.1 | — | Mass screening of contigs |
| DIAMOND | v2.1.11 | v2.1.13 | Protein alignment (virulence/viral) |
| Prokka | — | v1.14.5 | Prokaryotic genome annotation |
| Prodigal | v2.6.1 | v2.6.3 | Gene prediction |
| PlasmidFinder | — | v2.1.6 | Plasmid detection |
| MAFFT | — | v7.526 | Multiple sequence alignment |
| IQ-TREE2 | — | v2.3.4 | Phylogeny inference |
| SAMtools | v1.17 | v1.17 | Alignment file processing |
| BCFtools | — | v1.17 | Variant calling and consensus generation |
Overview. A pipeline optimized for ultra-low-biomass bioaerosol samples, addressing the challenges of high DNA fragmentation and low input yields through sensitive basecalling, rigorous assembly, and genome binning.
1.1 Controlled and Natural Environments (Guppy)
Model: High Accuracy (HAC) for R10.4.1 flow cells.
guppy_basecaller \
-i [input_raw_data_dir] \
-r \
-s [output_dir] \
--detect_barcodes \
-c dna_r10.4.1_e8.2_400bps_hac.cfg \
-x "cuda:0"1.2 Urban Environment (Dorado)
Model: High Accuracy (HAC) dna_r10.4.1_e8.2_400bps_hac, used for validation.
dorado basecaller \
dna_r10.4.1_e8.2_400bps_hac@v4.3.0 \
[input_pod5_dir] \
-r \
--kit-name SQK-RBK114-24 \
--no-trim \
--emit-fastq > [basecalled.fastq]1.3 Demultiplexing (Dorado)
dorado demux \
--output-dir [output_demux_dir] \
--kit-name SQK-RBK114-24 \
[basecalled.fastq]2.1 Adapter Trimming (Porechop)
porechop \
-i [input_barcode.fastq] \
-o [output_trimmed.fastq] \
-t 102.2 Quality and Length Filtering (NanoFilt)
Thresholds: Minimum length 100 bp; minimum average Q-score 8.
cat [input_trimmed.fastq] | NanoFilt -l 100 -q 8 > [output_filtered.fastq]2.3 Normalization (SeqKit)
Purpose: Down-sampling reads for comparable taxonomic assessments (e.g., 30,000 reads for urban samples).
seqkit sample -n 30000 -s 100 [output_filtered.fastq] > [output_normalized.fastq]3.1 De Novo Assembly (MetaFlye)
Strategy: Long-read metagenomic assembly using Nano-HQ mode.
flye --meta \
--nano-hq [input_filtered.fastq] \
--threads [threads] \
-o [output_assembly_dir]3.2 Read Mapping for Polishing (Minimap2)
minimap2 -ax map-ont \
-t [threads] \
[assembly.fasta] \
[input_filtered.fastq] > [alignment.sam]3.3 Assembly Polishing (Racon)
Iterative consensus correction — repeat for 3 rounds.
racon -t [threads] \
[input_filtered.fastq] \
[alignment.sam] \
[assembly.fasta] > [polished_assembly.fasta]4.1 Read-Level Classification (Kraken2)
Database: NCBI nt with memory mapping enabled.
kraken2 \
--db [kraken_db_path] \
--use-names \
--report [report_read.txt] \
--output [output_read.txt] \
--memory-mapping \
--threads 28 \
[output_normalized.fastq]4.2 Contig-Level Classification (Kraken2)
Applied to bins or assembled contigs.
kraken2 \
--db [kraken_db_path] \
--use-names \
--report [report_contig.txt] \
--output [output_contig.txt] \
--memory-mapping \
--threads 28 \
[polished_assembly.fasta]5.1 Metagenomic Binning (MetaWRAP)
Integrates results from MetaBAT2, MaxBin2, and CONCOCT.
metawrap binning \
-o [output_dir] \
-t [threads] \
-a [assembly.fasta] \
--metabat2 --maxbin2 --concoct \
[clean_reads.fastq]5.2 Quality Assessment (CheckM)
Thresholds: ≥30 % completeness, ≤10 % contamination.
checkm lineage_wf \
-t [threads] \
-x fa \
[bin_directory] \
[output_directory]6.1 Resistance and Virulence Gene Screening (ABRicate / AMRFinderPlus)
# ABRicate — mass screening against multiple databases
abricate --db [ncbi|card|vfdb] [input.fasta] > [output.tab]
# AMRFinderPlus — comprehensive AMR and stress-response detection
amrfinder -n [input.fasta] --plus --threads [threads] > [output.amr]Overview. An integrated multi-omic framework for processing shotgun metagenomics, RNA viromics, and targeted amplicons (eDNA / AIV) from passive water samplers.
1.1 Basecalling (Dorado)
Model: Super Accuracy (SUP) v5.0.0.
dorado basecaller \
dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
[input_pod5_dir] \
-r \
--kit-name SQK-RBK114-24 \
--no-trim \
--emit-fastq > [basecalled.fastq]1.2 Adapter Trimming (Porechop)
porechop \
-i [basecalled.fastq] \
-o [trimmed.fastq] \
--threads [threads]1.3 Quality Filtering (NanoFilt)
Standard metagenomics / virome: Length > 100 bp, Q-score > 9.
cat [trimmed.fastq] | NanoFilt -l 100 -q 9 > [filtered_metagenomics.fastq]Targeted AIV sequencing: Relaxed thresholds — length > 150 bp, Q-score > 8.
cat [trimmed_aiv.fastq] | NanoFilt -l 150 -q 8 > [filtered_aiv.fastq]2.1 Metagenomic Profiling (Kraken2)
Database: NCBI nt_core.
kraken2 \
--db [nt_core_db] \
--threads [threads] \
--output [output.kraken] \
--report [report.txt] \
[filtered_metagenomics.fastq]2.2 Normalization (SeqKit)
Down-sampling to 87,000 reads for comparative analysis.
seqkit sample -n 87000 -s 100 [filtered_metagenomics.fastq] > [normalized.fastq]3.1 Workflow A — MetaFlye + Hybrid Polishing
Pipeline: MetaFlye → Racon (×3) → Medaka.
# 1. Assembly
flye --nano-hq [filtered_metagenomics.fastq] \
--out-dir [flye_out] \
--threads [threads] \
--meta
# 2. Racon polishing (rounds 1–3)
minimap2 -ax map-ont -t [threads] \
[flye_out/assembly.fasta] \
[filtered_metagenomics.fastq] > [aln.sam]
racon -t [threads] \
[filtered_metagenomics.fastq] \
[aln.sam] \
[flye_out/assembly.fasta] > [racon_1.fasta]
# Repeat mapping + Racon two more times (racon_2.fasta → racon_3.fasta)
# 3. Medaka polishing (final round)
medaka_consensus \
-i [filtered_metagenomics.fastq] \
-d [racon_3.fasta] \
-o [final_flye_dir] \
-m r1041_e82_400bps_sup_v5.0.03.2 Workflow B — nanoMDBG + Medaka Polishing
Pipeline: nanoMDBG → Medaka.
# 1. Assembly
nanoMDBG [filtered_metagenomics.fastq] [k-mer_size] [output_prefix]
# 2. Medaka polishing
medaka_consensus \
-i [filtered_metagenomics.fastq] \
-d [mdbg_contigs.fasta] \
-o [final_mdbg_dir] \
-m r1041_e82_400bps_sup_v5.0.04.1 Viral Assembly (nanoMDBG)
nanoMDBG was used for viral de novo assembly followed by Medaka polishing. See B.3.2 for commands.
4.2 Viral Taxonomy Assignment (DIAMOND BLASTx)
Database: NCBI non-redundant protein database (NR). Threshold: Contigs with > 80 % identity to kingdom Viruses (taxid: 10239).
diamond blastx \
-d [nr_db.dmnd] \
-q [viral_contigs.fasta] \
-o [viral_matches.tsv] \
-f 6 qseqid sseqid pident length mismatch gapopen \
qstart qend sstart send evalue bitscore \
--sensitive5.1 Antimicrobial Resistance Detection (AMRFinderPlus)
Mode: --plus enabled for stress-response and virulence genes; nucleotide and protein analysis.
amrfinder -n [input.fasta] --plus --threads [threads] > [amr_report.tsv]5.2 Virulence Factor Detection (DIAMOND)
Target: VFDB core proteins (e.g., ctxA/B).
diamond blastx \
-d [vfdb_core.dmnd] \
-q [input_reads.fasta] \
-o [matches.tsv] \
-f 6 qseqid sseqid pident length mismatch gapopen \
qstart qend sstart send evalue bitscore5.3 General Functional Annotation (Prokka)
prokka --force --quiet \
--outdir [output_dir] \
--prefix [sample_id] \
--cpus [threads] \
[polished_assembly.fasta]6.1 Vertebrate Metabarcoding (OBITools / VSEARCH)
Pipeline: Demultiplexing → primer trimming → OTU clustering.
# Demultiplexing
obimultiplex -t [tag_file] -u [unidentified.fastq] [input.fastq] > [demultiplexed.fastq]
# Primer trimming
cutadapt -g [F_primer] -a [R_primer] -o [trimmed.fastq] [demultiplexed.fastq]
# OTU clustering and chimera removal
vsearch --cluster_size [trimmed.fasta] --id 0.97 --centroids [otus.fasta]
vsearch --uchime_denovo [otus.fasta] --nonchimeras [otus_clean.fasta]6.2 AIV Consensus Generation
Alignment to NCBI Influenza Virus Database segments.
# Align and sort
minimap2 -ax map-ont [reference_segment.fasta] [filtered_aiv.fastq] \
| samtools sort > [aligned.bam]
# Variant calling and consensus
bcftools mpileup -f [reference_segment.fasta] [aligned.bam] \
| bcftools call -c \
| vcfutils.pl vcf2fq > [consensus.fastq]7.1 FASTQ → FASTA Conversion (Seqtk)
seqtk seq -a [input.fastq] > [raw.fasta]7.2 Read-Level Alignment (Minimap2)
Mapping to the NCBI-NT MMI index with high stringency.
minimap2 -ax map-ont \
-k 19 -w 10 -I 10G -g 5000 -r 2000 -N 100 \
--lj-min-ratio 0.5 -A 2 -B 5 -O 5,56 -E 4,1 -z 400,50 \
--sam-hit-only \
-t [threads] \
--split-prefix [temp_idx] \
[minimap2_db_mmi] \
[sorted.fasta] > [aligned.sam]7.3 SAM → RMA Conversion (MEGAN6)
Lowest Common Ancestor (LCA) assignment. Taxonomic calls accepted only if > 50 % of near-best alignments match the same genus.
sam2rma \
-i [aligned.sam] \
-r [sorted.fasta] \
-o [filtered.rma] \
-lg -alg longReads \
-t [threads] \
-mdb [megan_db_nucl] \
-ram readCount \
--minSupportPercent 0.017.4 Taxonomic Information Extraction (rma2info)
# Read-to-taxon mapping
rma2info -i [filtered.rma] -o [taxonomy.r2c.txt] -r2c Taxonomy -n
# Class-to-count summary
rma2info -i [filtered.rma] -o [taxonomy.c2c.txt] -c2c Taxonomy -n -r8.1 Multiple Sequence Alignment (MAFFT)
Aligning the consensus H4 HA sequence with GISAID reference sequences.
mafft --auto --thread [threads] [combined_sequences.fasta] > [alignment.aln]8.2 Phylogeny Inference (IQ-TREE2)
Maximum-likelihood tree with ultrafast bootstrap support.
iqtree2 -s [alignment.aln] -m MFP -bb 1000 -alrt 1000 -nt [threads]9.1 Plasmid Detection (PlasmidFinder)
python3 plasmidfinder.py \
-i [input_assembly.fasta] \
-o [output_dir] \
-p [database_path]In addition to command-line processing, the following web-based platforms were integral to the methodology:
| Platform | Application |
|---|---|
| CZID (Chan Zuckerberg ID) | Hybrid taxonomic classification benchmarking (Publication I) and stringent pathogen species cross-referencing (Publication II). |
| GISAID BLAST & FluSurver | Subtyping and mutation analysis of assembled Avian Influenza Virus (AIV) sequences. |
| iTOL (Interactive Tree of Life) | Visualization and annotation of AIV phylogenetic trees. |