SeqForge emphasizes clarity, flexibility, and scale. Each module is standalone but interoperable with others, allowing researchers to plug in just what they need or use the full pipeline. Designed for power users but accessible enough for those just entering the field, SeqForge aims to grow as the field of microbial genomics and bioinformatics expands.
All FASTA-handling modules support the following inputs:
- Single FASTA file (.fa, .fna, .faa, .ffn, .fas, .fasta)
- Single compressed FASTA file (.gz, .zip)
- Directory of FASTA files
- Archive of FASTA files (.zip, .tar, .tar.gz, .tgz)
Purpose: Rapid database creation and high-throughput querying
- makedb
Create BLAST-compatible databases (makeblastdb). Supports both nucleotide and protein databases, with multiprocessing support to boost performance on large datasets. - query
A parallelized BLAST wrapper that allows you to run a set of query sequences (nucleotide or protein) against one or many databases in batch. Includes:- Support for blastn, tblastn, and blastp based on input types (auto-detected).
- Optional reporting of only strongest match per query per genome.
- Automatic filtering based on identity, coverage and/or e-value thresholds.
- Output includes both full and filtered results tables, plus alignment files if desired.
- Motif mining for amino acid queries.
- Visualization of gene hits and sequence matches (hi-res PNG or PDF)
- Motif support in query:
When using blastp (amino acid query against protein database), users may specify one or more amino acid motif(s) (e.g., WXWXIP (single motif) | XAXH GHXXGE (multiple motifs, space-separated list)) using the--motifflag. This performs a regex-based search across all BLAST hits, independent from internally-curated or user-defined percent identity, query coverage, and e-value thresholds, ensuring detection of conserved motifs even in low-identity or heterologous alignments that might otherwise be filtered out. This is particularly useful for detecting signature domains (e.g., catalytic triads, DNA-binding motifs) in diverse sequence families. Whole query matches or just the motif string may be exported to FASTA using--motif-fasta-outwith or without--motif-only.- For focused motif investigations across multiple queries, each motif's associated query file may be linked in the command string. This will result in only results associated with that query file being parsed for motif matches.
- Example:
- Query files: AT_domain.faa, KS_domain.faa
- Motifs: GHXXGE, TAXXSS
- Unlinked Query motif search:
seqforge query -d <DB_DIR> -q <Query_DIR> -o <RESULTS_DIR> -f <FASTA_DIR> --motif GHXXGE TAXXSS- This results in all queries being searched for each motif
- Linked Query motif search:
seqforge query -d <DB_DIR> -q <Query_DIR> -o <RESULTS_DIR> -f <FASTA_DIR> --motif GHXXGE{AT_domain} TAXXSS{KS_domain}
- This results in only AT_domain.faa being searched for GHXXGE and only KS_domain.faa being searched for TAXXSS.
- The exact query file base name should be placed within brackets as a part of the motif string.
- BLAST hits and motif matches may be visualized using the
--visualizeflag. BLAST hits across genomes are represented by a heatmap, where the color intensity of each individual cell reflects that query's percent identity within a specific genome. For queries that return > 1 hit per genome, only the strongest hit will be used to construct the heatmap. If--visualizeis combined with--motif, motif matches will be illustrated as a sequence logo representing amino acid frequencies. Plots may be saved as either a high resolution PNG or in PDF format (recommended).
Example Plots:
Purpose: Extract meaningful biological context from BLAST hits
- extract
Extract aligned sequences identified via the Genome Search Query pipeline from original FASTA files. Features:- Optional translation of nucleotide hits to protein for full gene alignments.
- This should only be used with full gene alignments.
- Upstream/downstream padding for context-based analysis.
- Filtering using percent identity, query coverage, and/or e-value.
- Optional translation of nucleotide hits to protein for full gene alignments.
- extract-contig
Extract entire contigs from reference assemblies based on where BLAST hits occurred. Ideal for identifying genomic context of hits in metagenomic assemblies too large to open via a genome browser.- For .faa files containing coding sequences,
extract-contigwill extract the full protein sequence from the source file, not just the aligned region identified viaqueryfor query coverage < 100%.
- For .faa files containing coding sequences,
Purpose: General-use tools for various genomic workflows
- sanitize
- Remove special characters from input files—required for all Genome Search analyses.
- Choose to rename files using
--in-place(strongly recommended). - Or copy files to a new directory and rename the copies, leaving the original files untouched.
- split-fasta
Split multi-FASTA files into smaller chunks for downstream processing.- Choose fixed sequence count per chunk or split one sequence per file.
- Optional compress output for storage/transfer efficiency.
- search
Extract isolation metadata from GenBank or JSON files.- User-defined or comprehensive field extraction to CSV or TSV.
- fasta-metrics
Compute common assembly metrics from an input FASTA file or all FASTA files within a directory- The following metrics are calculated
- Genome size (bp)
- Number of contigs
- Longest contig
- Shortest contig
- Basic contig size distribution
- Total length of contigs ≥ 1 kb
- GC content (%)
- N count
- auN
- N50, N90
- L50, L90
- Lengths of each contig (CSV only)
- The following metrics are calculated
- unique-headers
Append unique identifiers to each FASTA header within a multi-FASTA file- Submit a single FASTA file or a directory of FASTA files
- Appends source file name + unique alphanumeric barcode to header
- Example:
- Source file: GCA_000346795_1.faa
- Example header: >hypothetical
- unique-headers modification: >hypothetical_GCA_000346795_1_54uMe
SeqForge accepts the following FASTA formats: .fasta, .fa, .fas, .fna, .ffn, .faa
graph TD;
Z{{FASTA}}-->A{Sanitize};
A{Sanitize}-->B{{Database-Creation}};
B{{Database-Creation}}-->C[Query];
C[Query]-->D[Extract];
C[Query]-->E[Extract-Contig];
A{Sanitize}-->F[FASTA-Metrics];
A{Sanitize}-->H[Split-FASTA];
I[JSON/GenBank]-->J[Search];
%% Custom style for Sanitize node
style Z fill:#11d393,stroke:#fff5ee,stroke-width:4,font-size:30,color:#454545
style A fill:#d48074,stroke:#fff5ee,stroke-width:3,font-size:26px
style B fill:#d40078,stroke:#FCF5E5,stroke-width:2,font-size:22px
style C fill:#920075,stroke:#333,stroke-width:1,font-size:18px
style D fill:#650D89,stroke:#333,stroke-width:1,font-size:18px
style E fill:#023788,stroke:#333,stroke-width:1,font-size:18px
style F fill:#fd1d53,stroke:#333,stroke-width:1,font-size:18px
style H fill:#2e2157,stroke:#333,stroke-width:1,font-size:18px
style I fill:#11c9d3,stroke:#333,stroke-width:1,font-size:22px,color:#454545
style J fill:#6e1515,stroke:#333,stroke-width:1,font-size:18px
All pipelines should start with a check for special characters in input FASTA file names. Special characters are defined as any character aside from a-z, A-Z, 0-9 and underscores. The BLAST+ architecture cannot parse special characters; inclusion of these characters in a filename (e.g., GCA_900638215.1.fna, the non-extension period is pipeline-breaking) will result in database creation failure.
To scan your FASTA population, simply run: seqforge --sanitize -f /path/to/FASTA/file(s) -e fasta --dry-run.
This will print any problematic filenames to the console without making changes. If changes are needed, re-run without --dry-run
using either --in-place (recommended) or --sanitize-outdir <dir>. See Module 3: Utilities > seqforge sanitize for more details.
Periods, hyphens, colons, semicolons, and whitespace are replaced by '_'. Parenthesis and quotation marks are deleted. GCA_900638215.1.fna (non-compliant) becomes GCA_900638215_1.fna (compliant).
Suggested Workspace:
For organization, we suggest generating individual directories for each data type prior to the start of any SeqForge module:
cd /path/to/working/directory
mkdir -p FASTA DBs Query Results, where:
DBs: Output directory to store BLAST databases generated via seqforge makedb
FASTA: Directory of genomes to parse, used to create BLAST databases
Query: Directory containing query file(s) to use with seqforge query
Results: Directory where Genome Search results will be stored
We additionally recommend generating output directories for any additional SeqForge modules used.
Author:
Elijah R. Bring Horvath, PhD (https://github.com/ERBringHorvath)
License:
This program is shared under MIT License, which allows for modification and redistribution with attribution.
Install through Conda:
- Install Conda miniforge if not already installed
- Create a new Conda environment
conda create -y -n seqforge python=3.10
- Install SeqForge
conda install -y bioconda::seqforge
For the most up-to-date version, clone from GitHub:
SeqForge uses NCBI BLAST+
- Install NCBI BLAST+
- Download latest version of BLAST+
Or using Conda:
- Install Conda miniforge if not already installed
- Create Conda environment
conda create -y -n seqforge python=3.10
- Activate Conda environment
source activate seqforge
- Install BLAST+
conda -y install bioconda::blast
Verify BLAST Installation
makeblastdb --help
blastn --help
If these commands run without error, BLAST is correctly installed. If an error occurs, refer to the BLAST+ documentation
If not already installed, install Git
Linux/macOS systems may have this installed by default
To test installation, open the terminal and type git --version
For macOS users, you should see something like git version 2.37.1 (Apple Git-137.1)
- Setup:
- We suggest installing SeqForge within your Home folder, such as
/home/user
- We suggest installing SeqForge within your Home folder, such as
- Change directory to desired installation path
cd /home/user
- Clone SeqForge from the repository
git clone https://github.com/ERBringHorvath/SeqForge
- Add SeqForge to your PATH
- Open your profile in a text editor. This might be
~/.bash_profile,~/.bashrc, or~/.zshrc
- Open your profile in a text editor. This might be
- Add the following line to the end of the file:
export PATH=$PATH:/home/user/SeqForge/seqforge
Replace /home/user/SeqForge/seqforge with the actual path to the directory containing the executable.
Whatever the initial directory, this path should end with /SeqForge/seqforge
- Save the file and restart your terminal or run
source ~/.bashrcor~/.bash_profile(Linux/macOS) orsource ~/.zshrc(macOS)
Or run:
echo 'export PATH="$PATH:/$HOME/SeqForge/seqforge"' >> ~/.bashrc
source ~/.bashrc
Install Dependencies
Locate to the SeqForge/ directory:
cd ~/SeqForgepip install .
Verify SeqForge Installation
seqforge --module-health
You should see:
Module Status Report:
- makedb: Available
- query: Available
- split_fasta: Available
- extract_sequences: Available
- extract_contigs: Available
- mask: Available
- search: Available
- sanitize: Available
- fasta_metrics: Available
- generate_unqiue_fasta_headers: Available
If there is a problem, the status will read 'Broken or Missing'
Submit issues here
Help menu and version:
seqforge --help
seqforge --version
Indifivual module help:
seqforge <module> -h
NOTE: Permissions should automatically be applied during installation. If you get a permission denied message when running seqforge
permissions may need to be changed manually. To do this, you can use the following command (requires administrator privileges):
chmod +x ~/SeqForge/seqforge/seqforge
seqforge makedb:
Required arguments:
-f, --fasta-directory: path to the directory containing input files in FASTA format
-o, --out: path to directory where you want to store your databases
Optional arguments:
-T, --threads: number of cores to dedicate for multiprocessing (default = 4)
-s, --sanitize: remove pipeline-breaking special characters from file names (renames in-place)
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: for archive submission; retains temporary directory generated at /tmp/seqforge_fasta_extract_*
-p, --progress: progress reporting mode; choices = 'bar', 'verbose', 'none'. --progress verbose prints a line per item.
Example:
seqforge makedb -f /path/to/FASTA/files -o /path/to/results/folder -T 8
Database type is automatically detected during database creation following standard FASTA extension practices
- .fasta, .fa, .fas, .ffn, .fna == nucleotide
- .faa == protein
seqforge query:
Required arguments:
-d, --database: path to directory containing BLAST+ databases
-q, --query-files: path to directory containing query files in amino acid FASTA format
-o, --output-dir: path to directory to store results
Required for nucleotide queries (blastn):
-N, --nucleotide-query: use blastn for queries in nucleotide FASTA format
Optional arguments:
-T, --threads: number of cores to dedicate for multiprocessing (default = 4)
-R, --report-strongest-match: report only the single strongest match for each query
--min-perc: define minimum percent identity threshold. Default = 90
--min-cov: define minimum query coverage threshold. Default = 75
-e, --evalue: maximum e-value cutoff, default 0.00001
--min-seq-len: define minimum sequence length for short nucleotide sequence queries (use with caution and only with --nucleotide-query)
-a, --no-alignment-files: suppress alignment file creation
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: retain individual *_results.txt files in output directory
--motif: amino acid motif (e.g., WXWXIP or space-separated list) to search within blastp hits. X is treated as a wildcard. Only for use with blastp queries
--motif-fasta-out: export motif query alignments to FASTA
--motif-only: for use with --motif-fasta-out; export only motif string to FASTA
-f, --fasta-directory: path to FASTA file(s) used to create BLAST databases. Required if using --motif
-V, --visualize: generate heatmap of BLAST hits and sequence logo of motif hits if --motif returns matches
-P, --pdf: override PNG output of visualize and instead generated a PDF (use in combination with --visualize)
-p, --progress: progress reporting mode; choices = 'bar', 'verbose', 'none'. --progress verbose prints a line per item.
Basic example:
seqforge query -T 8 -d /path/to/blast/database/files -q /path/to/query/files/
-o /path/to/results/folder
Witches' Brew:
seqforge query -T 8 \
-d /path/to/blast/databases \
-q /path/to/query/files \
-o /path/to/results/directory \
--min-perc 75 --min-cov 70 --evalue 0.001 \
--report-strongest-matches \
--motif TAXXSS{query1} GHXXGE{query2} -f /path/to/FASTA/files \
--motif-fasta-out --visualize --pdf \
--motif-fasta-out
All Query results are concatenated to all_results.csv and either all_filtered_results.csv or filtered_results.csv.
All plots and files are saved to the output directory specified by --output.
seqforge extract:
Required arguments:
-c, --csv-path: path to results csv file from seqforge query
-f, --fasta-directory: path to reference FASTA assemblies
These should be the FASTA files the BLAST databases were created from and should have the same basename as the query results files
-o, --output-fasta: output file to contain sequences, defaults to current working directory
Optional arguments:
-T, --threads: number of cores to dedicate for multiprocessing (default = 1)
-e, --evalue: maximum e-value threshold, default = 0.00001
--min-perc: minimum percent identity threshold. Default = 90
--min-cov: minimum query coverage threshold. Default = 75
--translate: translates extracted nucleotide sequence(s)
--up: extract additional basepairs upstream of aligned sequence
--down: extract additional basepairs downstream of aligned sequence
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: for archive submission; retains temporary directory generated at /tmp/seqforge_fasta_extract_*
NOTE: Translation of sequences is optional, however care should be used when translating extracted nucleotide sequences, as BLAST results may not always contain a full CDS. To allow for this, when the --translate argument is called, extracted sequences will be trimmed to only include complete codons, which may affect interpretation of results. Generally, BLAST+ will return sequence coordinates in the correct frame; this has been tested and no mistranslations have been logged, but there are always exceptions.
NOTE: --up and --down flags are incompatible with --translate, as 6-frame translation is not currenlty supported.
NOTE: Results files and FASTA reference assemblies must share the same basename:
Example basename: 'FILE'
Example FASTA: FILE.fasta
Example results file: FILE_results.txt
If SeqForge is used for database creation and queries, matching basenames will be generated automatically
Example usage:
seqforge extract -c /path/to/results/file -f /path/to/reference/FASTA/files -T 8 -o sequences.fa
Witches' Brew:
seqforge extract -T 8 \
-c /path/to/results/file -f /path/to/reference/FASTA/files \
-o sequences.fa
--min-perc 75 --min-cov 70 --evalue 0.001 \
--up 1200 --down 1200
NOTE:
Results file needs to be all_results.csv, all_filtered_results.csv, or filtered_results.csv, which are automatically generated using seqforge query
If percent identity and query coverage were set manually during seqforge query, these values will need to be reflected when using mutliblast extract using --min-perc and/or --min-cov
For instance, if seqforge query was called using --perc 75, but the seqforge extract minimum percent identity is left at its default value (90), the appropriate sequences may not be extracted, as they may fall beneath the internally curated --min-perc theshold.
seqforge extract will generate a multi-FASTA file of all sequences identified by seqforge queryP/query based on the default or user-defined e-value cutoff.
seqforge extract-contig:
Required arguments:
-c, --csv-path: path to csv results file from seqforge query
-f, --fasta-directory: path to reference FASTA assemblies
These should be the FASTA files the BLAST databases were created from and should have the same basename as the query results files
-o, --output-fasta: output file to contain sequences, defaults to current working directory
Optional arguments:
-T, --threads: number of cores to dedicate, default is 1
-e, --evalue: maximum e-value threshold, default = 0.00001
--min-perc: minimum percent identity threshold. Default = 90
--min-cov: minimum query coverage threshold. Default = 75
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: for archive submission; retains temporary directory generated at /tmp/seqforge_fasta_extract_*
NOTE: Results files and FASTA reference assemblies must share the same basename for both seqforge extract and seqforge extract-contig:
Example basename: 'FILE'
Example FASTA: FILE.fasta
Example results file: FILE_results.txt
If SeqForge is used for database creation and queries, matching basenames are handled automatically
Example usage:
seqforge extract-contig -d /path/to/results/files -f /path/to/reference/FASTA/files -T 8 -o contigs.fa
seqforge extract-contig will generate a multi-FASTA file of all contigs harboring a matching
sequence identified by seqforge query based on the default or user-defined thresholds.
This program was designed for use with metagenome mining, as metagenomic assemblies are often too large to explore using a genome browser. If short-read assembly methods are used, contigs harboring genes of interest may be extracted; contigs will likely be more tractible to parsing using a genome browser if manual annotation is needed.
seqforge sanitize:
-i, --input: FASTA file or directory of FASTA files to be cleaned
-e, --extension: extention of files to sanitize; for all standard FASTA formats, simply pass -e fasta
-I, --in-place: overwrite problematic file names with names containing no special characters (recommended)
-S, --sanitize-outdir: path to new directory for files. This option copies file(s) to a new directory and removes special characters from
the copied file names, leaving the original file names intact (storage intensive)
--dry-run: preview changes without committing
Example usage (sanitize in-place):
seqforge sanitize -i /path/to/file(s) -e fasta -I
Example usage (copy files to new directory and rename only copied files):
seqforge sanitize -i /path/to/file(s) -e fasta -S /path/to/output/directory
seqforge fasta-metrics:
-f, --fasta-directory: path to FASTA file or directory of FASTA files to be analyzed
-o, --output: optional name for CSV summary (default: fasta_metrics_summary.csv)
-M, --min-contig-size: minimum contig size to include for all calculations (default = 500)
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: for archive submission; retains temporary directory generated at /tmp/seqforge_fasta_extract_*
Example usage:
seqforge fasta-metrics -f /path/to/FASTA/file(s) -o /path/to/output/directory
seqforge split-fasta:
-f, --fasta: input multi-FASTA file
-o, --output-dir: output directory for split FASTA files
-F, --fragment: split multi-FASTA file into defined chunks of n sequences each
-C, --compress: compress output files as .gz
Example usage:
seqforge split-fasta -f /path/to/multi-FASTA/file -o /path/to/output/directory
If any FASTA headers contain a pipe symbol (|), we suggest removing it. seqforge split-fasta renames output FASTA files based on the associated FASTA header; in edge cases, inclusion of a pipe in headers can cause issues, as the shell may interpret the symbol as part of a command string, leading to errors.
seqforge search:
-i, --input: input file(s) (.json or .gb/.gbk)
-o, --output: output file (e.g., .csv, .tsv, .json)
--json: parse only JSON files in an input directory
--gb: parse only GenBank files in an input directory
--all: extract all available metadata
--fields: space-separated list of metadata fields to extract
Available fields:
accession, organism, strain, isolation_source, host, region, lat_lon, collection_date,
collected_by, tax_id, comment, keywords, sequencing_tech, release_date
Example usage:
seqforge search -i /path/to/input.(json | gbk) -o metadata.csv --fields accession isolation_source host region
seqforge search -i /path/to/input/files -o metadata.csv --all --json
Appends unique barcode to each FASTA header line
seqforge unique-headers:
-f, --fasta-directory: path to FASTA file(s)
-o, --output-dir: directory for output FASTA files (unless using --in-place)
-I, --in-place: modify input files in-place (uses temporary files for safety)
-p, --progress: progress reporting mode; choices = 'bar', 'verbose', 'none'. --progress verbose prints a line per record.
-D, --deterministic: use a stable MD5-based suffix derived from the sequence and header instead of a random alphanumeric code (default). Ensures reproducible IDs across runs
We suggest using this module for any CDS prediction .faa output file(s) prior to Query-Motif to mitigate potential file name collisions.
Example usage:
seqforge unique-headers -f /path/to/FASTA/file(s) -o /path/to/output/directory -Dp bar
Cite SeqForge:
Bring Horvath, ER, Winter, JM, 2025.
SeqForge: A scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets
BMC Bioinformatics 26, 280. doi.org/10.1186/s12859-025-06297-9
Cite NCBI BLAST+:
Camancho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL, 2009.
BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. doi:10.1186/1471-2105-10-421
Cite Logomaker:
Tareen, A, Kinney, JB, 2020
Logomaker: beautiful sequence logos in Python. Bioinformatics, 36, 7, 2272–2274. doi.org/10.1093/bioinformatics/btz921
