Skip to content

ERBringHorvath/SeqForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

469 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SeqForge

Bioconda conda downloads Python Version License: MIT

Design Philosophy:

SeqForge emphasizes clarity, flexibility, and scale. Each module is standalone but interoperable with others, allowing researchers to plug in just what they need or use the full pipeline. Designed for power users but accessible enough for those just entering the field, SeqForge aims to grow as the field of microbial genomics and bioinformatics expands.

Table of Contents

Overview

Note on FASTA file submission:

All FASTA-handling modules support the following inputs:

  • Single FASTA file (.fa, .fna, .faa, .ffn, .fas, .fasta)
  • Single compressed FASTA file (.gz, .zip)
  • Directory of FASTA files
  • Archive of FASTA files (.zip, .tar, .tar.gz, .tgz)

SeqForge is currently divided into three modules:

Module 1: Genome Search

Purpose: Rapid database creation and high-throughput querying

  • makedb
    Create BLAST-compatible databases (makeblastdb). Supports both nucleotide and protein databases, with multiprocessing support to boost performance on large datasets.
  • query
    A parallelized BLAST wrapper that allows you to run a set of query sequences (nucleotide or protein) against one or many databases in batch. Includes:
    • Support for blastn, tblastn, and blastp based on input types (auto-detected).
    • Optional reporting of only strongest match per query per genome.
    • Automatic filtering based on identity, coverage and/or e-value thresholds.
    • Output includes both full and filtered results tables, plus alignment files if desired.
    • Motif mining for amino acid queries.
    • Visualization of gene hits and sequence matches (hi-res PNG or PDF)
  • Motif support in query:
    When using blastp (amino acid query against protein database), users may specify one or more amino acid motif(s) (e.g., WXWXIP (single motif) | XAXH GHXXGE (multiple motifs, space-separated list)) using the --motif flag. This performs a regex-based search across all BLAST hits, independent from internally-curated or user-defined percent identity, query coverage, and e-value thresholds, ensuring detection of conserved motifs even in low-identity or heterologous alignments that might otherwise be filtered out. This is particularly useful for detecting signature domains (e.g., catalytic triads, DNA-binding motifs) in diverse sequence families. Whole query matches or just the motif string may be exported to FASTA using --motif-fasta-out with or without --motif-only.
    • For focused motif investigations across multiple queries, each motif's associated query file may be linked in the command string. This will result in only results associated with that query file being parsed for motif matches.
    • Example:
      • Query files: AT_domain.faa, KS_domain.faa
      • Motifs: GHXXGE, TAXXSS
      • Unlinked Query motif search:
        seqforge query -d <DB_DIR> -q <Query_DIR> -o <RESULTS_DIR> -f <FASTA_DIR> --motif GHXXGE TAXXSS
        • This results in all queries being searched for each motif
      • Linked Query motif search:
        seqforge query -d <DB_DIR> -q <Query_DIR> -o <RESULTS_DIR> -f <FASTA_DIR> --motif GHXXGE{AT_domain} TAXXSS{KS_domain}
        • This results in only AT_domain.faa being searched for GHXXGE and only KS_domain.faa being searched for TAXXSS.
        • The exact query file base name should be placed within brackets as a part of the motif string.
  • BLAST hits and motif matches may be visualized using the --visualize flag. BLAST hits across genomes are represented by a heatmap, where the color intensity of each individual cell reflects that query's percent identity within a specific genome. For queries that return > 1 hit per genome, only the strongest hit will be used to construct the heatmap. If --visualize is combined with --motif, motif matches will be illustrated as a sequence logo representing amino acid frequencies. Plots may be saved as either a high resolution PNG or in PDF format (recommended).

Example Plots:

query_identity_heatmap motif_1_logo

Module 2: Sequence Investigation

Purpose: Extract meaningful biological context from BLAST hits

  • extract
    Extract aligned sequences identified via the Genome Search Query pipeline from original FASTA files. Features:
    • Optional translation of nucleotide hits to protein for full gene alignments.
      • This should only be used with full gene alignments.
    • Upstream/downstream padding for context-based analysis.
    • Filtering using percent identity, query coverage, and/or e-value.
  • extract-contig
    Extract entire contigs from reference assemblies based on where BLAST hits occurred. Ideal for identifying genomic context of hits in metagenomic assemblies too large to open via a genome browser.
    • For .faa files containing coding sequences, extract-contig will extract the full protein sequence from the source file, not just the aligned region identified via query for query coverage < 100%.

Module 3: Utilities

Purpose: General-use tools for various genomic workflows

  • sanitize
    • Remove special characters from input files—required for all Genome Search analyses.
    • Choose to rename files using --in-place (strongly recommended).
    • Or copy files to a new directory and rename the copies, leaving the original files untouched.
  • split-fasta
    Split multi-FASTA files into smaller chunks for downstream processing.
    • Choose fixed sequence count per chunk or split one sequence per file.
    • Optional compress output for storage/transfer efficiency.
  • search
    Extract isolation metadata from GenBank or JSON files.
    • User-defined or comprehensive field extraction to CSV or TSV.
  • fasta-metrics
    Compute common assembly metrics from an input FASTA file or all FASTA files within a directory
    • The following metrics are calculated
      • Genome size (bp)
      • Number of contigs
      • Longest contig
      • Shortest contig
      • Basic contig size distribution
      • Total length of contigs ≥ 1 kb
      • GC content (%)
      • N count
      • auN
      • N50, N90
      • L50, L90
      • Lengths of each contig (CSV only)
  • unique-headers
    Append unique identifiers to each FASTA header within a multi-FASTA file
    • Submit a single FASTA file or a directory of FASTA files
    • Appends source file name + unique alphanumeric barcode to header
    • Example:
      • Source file: GCA_000346795_1.faa
      • Example header: >hypothetical
      • unique-headers modification: >hypothetical_GCA_000346795_1_54uMe

Note:

SeqForge accepts the following FASTA formats: .fasta, .fa, .fas, .fna, .ffn, .faa



Recommended Workflow:

graph TD;
    Z{{FASTA}}-->A{Sanitize};
    A{Sanitize}-->B{{Database-Creation}};
    B{{Database-Creation}}-->C[Query];
    C[Query]-->D[Extract];
    C[Query]-->E[Extract-Contig];
    A{Sanitize}-->F[FASTA-Metrics];
    A{Sanitize}-->H[Split-FASTA];
    I[JSON/GenBank]-->J[Search];

    %% Custom style for Sanitize node
    style Z fill:#11d393,stroke:#fff5ee,stroke-width:4,font-size:30,color:#454545
    style A fill:#d48074,stroke:#fff5ee,stroke-width:3,font-size:26px
    style B fill:#d40078,stroke:#FCF5E5,stroke-width:2,font-size:22px
    style C fill:#920075,stroke:#333,stroke-width:1,font-size:18px
    style D fill:#650D89,stroke:#333,stroke-width:1,font-size:18px
    style E fill:#023788,stroke:#333,stroke-width:1,font-size:18px
    style F fill:#fd1d53,stroke:#333,stroke-width:1,font-size:18px
    style H fill:#2e2157,stroke:#333,stroke-width:1,font-size:18px
    style I fill:#11c9d3,stroke:#333,stroke-width:1,font-size:22px,color:#454545
    style J fill:#6e1515,stroke:#333,stroke-width:1,font-size:18px
Loading

All pipelines should start with a check for special characters in input FASTA file names. Special characters are defined as any character aside from a-z, A-Z, 0-9 and underscores. The BLAST+ architecture cannot parse special characters; inclusion of these characters in a filename (e.g., GCA_900638215.1.fna, the non-extension period is pipeline-breaking) will result in database creation failure.

To scan your FASTA population, simply run: seqforge --sanitize -f /path/to/FASTA/file(s) -e fasta --dry-run. This will print any problematic filenames to the console without making changes. If changes are needed, re-run without --dry-run using either --in-place (recommended) or --sanitize-outdir <dir>. See Module 3: Utilities > seqforge sanitize for more details.

Periods, hyphens, colons, semicolons, and whitespace are replaced by '_'. Parenthesis and quotation marks are deleted. GCA_900638215.1.fna (non-compliant) becomes GCA_900638215_1.fna (compliant).


Suggested Workspace:
For organization, we suggest generating individual directories for each data type prior to the start of any SeqForge module:

cd /path/to/working/directory
mkdir -p FASTA DBs Query Results, where:

DBs: Output directory to store BLAST databases generated via seqforge makedb
FASTA: Directory of genomes to parse, used to create BLAST databases
Query: Directory containing query file(s) to use with seqforge query
Results: Directory where Genome Search results will be stored

We additionally recommend generating output directories for any additional SeqForge modules used.



Author:
Elijah R. Bring Horvath, PhD (https://github.com/ERBringHorvath)

License:
This program is shared under MIT License, which allows for modification and redistribution with attribution.


Installation

SeqForge Conda Installation

Install through Conda:

  1. Install Conda miniforge if not already installed
  2. Create a new Conda environment
    • conda create -y -n seqforge python=3.10
  3. Install SeqForge
    • conda install -y bioconda::seqforge

Standalone Installation

For the most up-to-date version, clone from GitHub:

SeqForge uses NCBI BLAST+

  1. Install NCBI BLAST+
    • Download latest version of BLAST+

Or using Conda:

  1. Install Conda miniforge if not already installed
  2. Create Conda environment
    • conda create -y -n seqforge python=3.10
  3. Activate Conda environment
    • source activate seqforge
  4. Install BLAST+
    • conda -y install bioconda::blast

Verify BLAST Installation

makeblastdb --help
blastn --help

If these commands run without error, BLAST is correctly installed. If an error occurs, refer to the BLAST+ documentation

If not already installed, install Git
Linux/macOS systems may have this installed by default
To test installation, open the terminal and type git --version
For macOS users, you should see something like git version 2.37.1 (Apple Git-137.1)

  1. Setup:
    • We suggest installing SeqForge within your Home folder, such as /home/user
  2. Change directory to desired installation path
    • cd /home/user
  3. Clone SeqForge from the repository
    • git clone https://github.com/ERBringHorvath/SeqForge
  4. Add SeqForge to your PATH
    • Open your profile in a text editor. This might be ~/.bash_profile, ~/.bashrc, or ~/.zshrc
  5. Add the following line to the end of the file:
    • export PATH=$PATH:/home/user/SeqForge/seqforge

Replace /home/user/SeqForge/seqforge with the actual path to the directory containing the executable.
Whatever the initial directory, this path should end with /SeqForge/seqforge

  1. Save the file and restart your terminal or run source ~/.bashrc or ~/.bash_profile (Linux/macOS) or source ~/.zshrc (macOS)

Or run:

echo 'export PATH="$PATH:/$HOME/SeqForge/seqforge"' >> ~/.bashrc
source ~/.bashrc

Install Dependencies

Locate to the SeqForge/ directory:

  1. cd ~/SeqForge
  2. pip install .

Verify SeqForge Installation

seqforge --module-health

You should see:

Module Status Report:
 - makedb: Available
 - query: Available
 - split_fasta: Available
 - extract_sequences: Available
 - extract_contigs: Available
 - mask: Available
 - search: Available
 - sanitize: Available
 - fasta_metrics: Available
 - generate_unqiue_fasta_headers: Available

If there is a problem, the status will read 'Broken or Missing'
Submit issues here

Help menu and version:
seqforge --help
seqforge --version

Indifivual module help:
seqforge <module> -h

NOTE: Permissions should automatically be applied during installation. If you get a permission denied message when running seqforge permissions may need to be changed manually. To do this, you can use the following command (requires administrator privileges):

chmod +x ~/SeqForge/seqforge/seqforge



Usage

Module 1: Genome Search

Building a BLAST Database

seqforge makedb:
Required arguments:
-f, --fasta-directory: path to the directory containing input files in FASTA format
-o, --out: path to directory where you want to store your databases

Optional arguments:
-T, --threads: number of cores to dedicate for multiprocessing (default = 4)
-s, --sanitize: remove pipeline-breaking special characters from file names (renames in-place)
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: for archive submission; retains temporary directory generated at /tmp/seqforge_fasta_extract_*
-p, --progress: progress reporting mode; choices = 'bar', 'verbose', 'none'. --progress verbose prints a line per item.

Example:
seqforge makedb -f /path/to/FASTA/files -o /path/to/results/folder -T 8

Database type is automatically detected during database creation following standard FASTA extension practices

  • .fasta, .fa, .fas, .ffn, .fna == nucleotide
  • .faa == protein

Querying a Database

seqforge query:
Required arguments:
-d, --database: path to directory containing BLAST+ databases
-q, --query-files: path to directory containing query files in amino acid FASTA format
-o, --output-dir: path to directory to store results

Required for nucleotide queries (blastn):
-N, --nucleotide-query: use blastn for queries in nucleotide FASTA format

Optional arguments:
-T, --threads: number of cores to dedicate for multiprocessing (default = 4)
-R, --report-strongest-match: report only the single strongest match for each query
--min-perc: define minimum percent identity threshold. Default = 90
--min-cov: define minimum query coverage threshold. Default = 75
-e, --evalue: maximum e-value cutoff, default 0.00001
--min-seq-len: define minimum sequence length for short nucleotide sequence queries (use with caution and only with --nucleotide-query)
-a, --no-alignment-files: suppress alignment file creation
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: retain individual *_results.txt files in output directory
--motif: amino acid motif (e.g., WXWXIP or space-separated list) to search within blastp hits. X is treated as a wildcard. Only for use with blastp queries
--motif-fasta-out: export motif query alignments to FASTA
--motif-only: for use with --motif-fasta-out; export only motif string to FASTA
-f, --fasta-directory: path to FASTA file(s) used to create BLAST databases. Required if using --motif
-V, --visualize: generate heatmap of BLAST hits and sequence logo of motif hits if --motif returns matches
-P, --pdf: override PNG output of visualize and instead generated a PDF (use in combination with --visualize)
-p, --progress: progress reporting mode; choices = 'bar', 'verbose', 'none'. --progress verbose prints a line per item.

Basic example:
seqforge query -T 8 -d /path/to/blast/database/files -q /path/to/query/files/
-o /path/to/results/folder

Witches' Brew:
seqforge query -T 8 \
-d /path/to/blast/databases \
-q /path/to/query/files \
-o /path/to/results/directory \
--min-perc 75 --min-cov 70 --evalue 0.001 \
--report-strongest-matches \
--motif TAXXSS{query1} GHXXGE{query2} -f /path/to/FASTA/files \
--motif-fasta-out --visualize --pdf \
--motif-fasta-out

All Query results are concatenated to all_results.csv and either all_filtered_results.csv or filtered_results.csv. All plots and files are saved to the output directory specified by --output.



Module 2: Sequence Investigation

Extract Sequences from a SeqForge Query

seqforge extract:
Required arguments:
-c, --csv-path: path to results csv file from seqforge query
-f, --fasta-directory: path to reference FASTA assemblies
These should be the FASTA files the BLAST databases were created from and should have the same basename as the query results files
-o, --output-fasta: output file to contain sequences, defaults to current working directory

Optional arguments:
-T, --threads: number of cores to dedicate for multiprocessing (default = 1)
-e, --evalue: maximum e-value threshold, default = 0.00001
--min-perc: minimum percent identity threshold. Default = 90
--min-cov: minimum query coverage threshold. Default = 75
--translate: translates extracted nucleotide sequence(s)
--up: extract additional basepairs upstream of aligned sequence
--down: extract additional basepairs downstream of aligned sequence
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: for archive submission; retains temporary directory generated at /tmp/seqforge_fasta_extract_*

NOTE: Translation of sequences is optional, however care should be used when translating extracted nucleotide sequences, as BLAST results may not always contain a full CDS. To allow for this, when the --translate argument is called, extracted sequences will be trimmed to only include complete codons, which may affect interpretation of results. Generally, BLAST+ will return sequence coordinates in the correct frame; this has been tested and no mistranslations have been logged, but there are always exceptions.

NOTE: --up and --down flags are incompatible with --translate, as 6-frame translation is not currenlty supported.

NOTE: Results files and FASTA reference assemblies must share the same basename:

Example basename: 'FILE'
Example FASTA: FILE.fasta
Example results file: FILE_results.txt

If SeqForge is used for database creation and queries, matching basenames will be generated automatically

Example usage:
seqforge extract -c /path/to/results/file -f /path/to/reference/FASTA/files -T 8 -o sequences.fa

Witches' Brew:
seqforge extract -T 8 \
-c /path/to/results/file -f /path/to/reference/FASTA/files \
-o sequences.fa
--min-perc 75 --min-cov 70 --evalue 0.001 \
--up 1200 --down 1200

NOTE:
Results file needs to be all_results.csv, all_filtered_results.csv, or filtered_results.csv, which are automatically generated using seqforge query

If percent identity and query coverage were set manually during seqforge query, these values will need to be reflected when using mutliblast extract using --min-perc and/or --min-cov
For instance, if seqforge query was called using --perc 75, but the seqforge extract minimum percent identity is left at its default value (90), the appropriate sequences may not be extracted, as they may fall beneath the internally curated --min-perc theshold.

seqforge extract will generate a multi-FASTA file of all sequences identified by seqforge queryP/query based on the default or user-defined e-value cutoff.


Extract Entire Contig

seqforge extract-contig:
Required arguments:
-c, --csv-path: path to csv results file from seqforge query
-f, --fasta-directory: path to reference FASTA assemblies
These should be the FASTA files the BLAST databases were created from and should have the same basename as the query results files
-o, --output-fasta: output file to contain sequences, defaults to current working directory

Optional arguments:
-T, --threads: number of cores to dedicate, default is 1
-e, --evalue: maximum e-value threshold, default = 0.00001
--min-perc: minimum percent identity threshold. Default = 90
--min-cov: minimum query coverage threshold. Default = 75
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: for archive submission; retains temporary directory generated at /tmp/seqforge_fasta_extract_*

NOTE: Results files and FASTA reference assemblies must share the same basename for both seqforge extract and seqforge extract-contig:

Example basename: 'FILE'
Example FASTA: FILE.fasta
Example results file: FILE_results.txt

If SeqForge is used for database creation and queries, matching basenames are handled automatically

Example usage:
seqforge extract-contig -d /path/to/results/files -f /path/to/reference/FASTA/files -T 8 -o contigs.fa

seqforge extract-contig will generate a multi-FASTA file of all contigs harboring a matching
sequence identified by seqforge query based on the default or user-defined thresholds.

This program was designed for use with metagenome mining, as metagenomic assemblies are often too large to explore using a genome browser. If short-read assembly methods are used, contigs harboring genes of interest may be extracted; contigs will likely be more tractible to parsing using a genome browser if manual annotation is needed.



Module 3: Utilties

Sanitize File Names

seqforge sanitize:
-i, --input: FASTA file or directory of FASTA files to be cleaned
-e, --extension: extention of files to sanitize; for all standard FASTA formats, simply pass -e fasta
-I, --in-place: overwrite problematic file names with names containing no special characters (recommended)
-S, --sanitize-outdir: path to new directory for files. This option copies file(s) to a new directory and removes special characters from the copied file names, leaving the original file names intact (storage intensive)
--dry-run: preview changes without committing

Example usage (sanitize in-place):
seqforge sanitize -i /path/to/file(s) -e fasta -I
Example usage (copy files to new directory and rename only copied files):
seqforge sanitize -i /path/to/file(s) -e fasta -S /path/to/output/directory


FASTA File Metrics

seqforge fasta-metrics:
-f, --fasta-directory: path to FASTA file or directory of FASTA files to be analyzed
-o, --output: optional name for CSV summary (default: fasta_metrics_summary.csv)
-M, --min-contig-size: minimum contig size to include for all calculations (default = 500)
--temp-dir: specify a temporary directory for archive extraction (default = /tmp/)
--keep-temp-files: for archive submission; retains temporary directory generated at /tmp/seqforge_fasta_extract_*

Example usage:
seqforge fasta-metrics -f /path/to/FASTA/file(s) -o /path/to/output/directory


Split Multi-FASTA files

seqforge split-fasta:
-f, --fasta: input multi-FASTA file
-o, --output-dir: output directory for split FASTA files
-F, --fragment: split multi-FASTA file into defined chunks of n sequences each
-C, --compress: compress output files as .gz

Example usage:
seqforge split-fasta -f /path/to/multi-FASTA/file -o /path/to/output/directory

NOTE:

If any FASTA headers contain a pipe symbol (|), we suggest removing it. seqforge split-fasta renames output FASTA files based on the associated FASTA header; in edge cases, inclusion of a pipe in headers can cause issues, as the shell may interpret the symbol as part of a command string, leading to errors.


Extract Sequence Metadata from JSON or GenBank Files

seqforge search:
-i, --input: input file(s) (.json or .gb/.gbk)
-o, --output: output file (e.g., .csv, .tsv, .json)
--json: parse only JSON files in an input directory
--gb: parse only GenBank files in an input directory
--all: extract all available metadata
--fields: space-separated list of metadata fields to extract

Available fields:
accession, organism, strain, isolation_source, host, region, lat_lon, collection_date,
collected_by, tax_id, comment, keywords, sequencing_tech, release_date

Example usage:
seqforge search -i /path/to/input.(json | gbk) -o metadata.csv --fields accession isolation_source host region
seqforge search -i /path/to/input/files -o metadata.csv --all --json


Generate Unique FASTA Headers

Appends unique barcode to each FASTA header line

seqforge unique-headers:
-f, --fasta-directory: path to FASTA file(s)
-o, --output-dir: directory for output FASTA files (unless using --in-place)
-I, --in-place: modify input files in-place (uses temporary files for safety)
-p, --progress: progress reporting mode; choices = 'bar', 'verbose', 'none'. --progress verbose prints a line per record.
-D, --deterministic: use a stable MD5-based suffix derived from the sequence and header instead of a random alphanumeric code (default). Ensures reproducible IDs across runs

We suggest using this module for any CDS prediction .faa output file(s) prior to Query-Motif to mitigate potential file name collisions.

Example usage:
seqforge unique-headers -f /path/to/FASTA/file(s) -o /path/to/output/directory -Dp bar



Citations

Cite SeqForge:
Bring Horvath, ER, Winter, JM, 2025.
SeqForge: A scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets BMC Bioinformatics 26, 280. doi.org/10.1186/s12859-025-06297-9

Cite NCBI BLAST+:
Camancho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL, 2009.
BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. doi:10.1186/1471-2105-10-421

Cite Logomaker:
Tareen, A, Kinney, JB, 2020
Logomaker: beautiful sequence logos in Python. Bioinformatics, 36, 7, 2272–2274. doi.org/10.1093/bioinformatics/btz921

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors