A comprehensive tool for identifying and classifying virulence factors in bacterial genomes using BLAST against the VFDB Virulence Factor Database, with automated visualization and interactive reporting.
Author: Bruno Gomez-Gil (bruno@ciad.mx). CIAD AC, Mazatlán Unit for Aquaculture
This script performs comprehensive identification of virulence factors in bacterial genome fasta sequences by comparing them against the VFDB database using BLASTN. It automatically parses VFDB headers to extract gene symbols, virulence factor categories, and originating organisms, then provides detailed annotated results including gene symbols, specific virulence factor names, broad categories, and alignment statistics. The tool generates automated visualizations and interactive Krona charts for comprehensive analysis, designed for researchers studying bacterial pathogenicity and virulence gene content.
- Automated VFDB Integration: Automatically detects and parses VFDB database headers
- Comprehensive Annotation: Extracts gene symbols, virulence factor categories, and organism information
- BLAST-based Identification: Uses NCBI BLAST+ for accurate sequence similarity searches
- Dual Output Format: Generates both complete results and filtered best hits automatically
- Enhanced BLAST Output: Includes strand information and alignment coordinates
- Automated Visualization: Generates summary plots showing virulence factor distribution by category
- Interactive Krona Charts: Creates interactive HTML charts for hierarchical data exploration
- Flexible Input Options: Supports custom VFDB FASTA file locations
- Compressed File Support: Automatically handles compressed (.gz) FASTA files
- Multi-threaded Processing: Configurable thread count for faster BLAST searches
- Database Validation: Checks for database existence before running analysis
- Advanced Comparative Heatmaps: Generate high-quality, publication-ready heatmaps from multiple samples using R and ComplexHeatmap
- Git (optional, for cloning/downloading the repository)
- Mamba (optional but highly recommended, for faster conda package installation)
- NCBI BLAST+ (version 2.12.0 or later)
- Python 3.7+
- Pandas
- Biopython
- Matplotlib (optional, for visualization)
- Seaborn (optional, for visualization)
- Krona Tools (optional, for interactive charts) - Install from Krona GitHub
- VFDB Database - Download from VFDB Official Site
- R (optional, for advanced heatmap generation)
- ComplexHeatmap (installed automatically by the script)
- RColorBrewer
- dplyr
- tidyr
- Krona is optional - the script will work without it, but won't generate interactive HTML charts
- If Krona is not installed, the script will gracefully skip HTML generation with a warning
The script is cross-platform and supports Linux, macOS, and Windows:
- Linux/macOS: Best performance and easy installation (Native support for all tools).
- Windows: Fully supported via Mamba/Conda or WSL2 (recommended). Some features, such as Krona, may require additional configuration when installed manually on native Windows.
A functioning installation is composed of three main steps:
- Clone the repository (or download the zip file)
- Install the dependencies (using mamba or conda)
- Download the VFDB fasta file and create the BLAST database
If you do not have mamba already installed in your conda environment, it is highly recommended to install it:
conda install -c conda-forge mamba# 1. Clone the repository
git clone https://github.com/GenomicaMicrob/VF_classifier.git
cd VF_classifier
# Make scripts executable if you want to run them directly
chmod +x *.py *.R
# 2. Create an environment from the provided environment.yml file
mamba env create -f environment.yml
# Activate the environment
conda activate VF_classifierThat's it! The environment.yml file includes all necessary dependencies, including BLAST, Krona, R, and Python packages. Be patient, it may take a while to install all dependencies since they are about 460 MB
Need more options? See INSTALLATION.md for alternative installation methods and detailed setup instructions.
Important: The VFDB data is freely available for research and educational purposes only under the Creative Commons Attribution-NonCommercial (CC BY-NC) license version 4.0. For commercial use, contact the authors directly. Visit https://www.mgc.ac.cn/VFs/ for more information.
# 3. Download and create BLAST database automatically
python VF_classifier.v0.1.0.py --setupIf a "permission error" appears, then download the database manually from https://www.mgc.ac.cn/VFs/download.htm (select DNA sequences of full dataset and proceed as follows:
# Optional: Use a pre-downloaded fasta file instead of downloading
python VF_classifier.v0.1.0.py --setup --input-fasta VFDB_setB_nt.fas.gz
# Optional: Skip automated curation and use original headers
python VF_classifier.v0.1.0.py --setup --no-curationNote: As a safety measure, the script will not overwrite an existing VF_database/ directory. If you wish to re-download or re-curate the database, you must use the --force flag.
The script includes an automated curation step during database indexing to ensure data integrity and consistency.
Discrepancies can occur in the source VFDB descriptors due to truncated names or non-translated symbols (e.g., "Beta-haemolysin" vs "-haemolysin"). This can lead to redundant rows in the final analysis or heatmap.
- VF ID Tracking: The script extracts the internal VF ID (e.g.,
VF0279) from every header. - Consensus Selection: For each VF ID, it identifies all naming variations and automatically selects the longest (most complete) version found across the entire database.
- Normalization: All hits in your results are mapped to these Gold Standard names.
A detailed record of every modification (Original vs Curated) is stored in:
VF_database/curation_log.tsv
Errors might be introduced into the database this way, so be aware. If you prefer not to curate the original file, use the --no-curation flag, but duplication of some genes might appear.
cd VF_classifier
chmod +x VF_classifier.v0.1.0.py # if not already executable
python VF_classifier.v0.1.0.py -i your_genome.fastaThe script can handle both compressed and uncompressed FASTA files:
- .fna (decompressed)
- .fna.gz (compressed) - automatically decompressed
- .fasta (decompressed)
- .fasta.gz (compressed) - automatically decompressed
# Process multiple genomes at once:
python VF_classifier.v0.1.0.py -i Sample1.fna Sample2.fna.gz Sample3.fasta -t 8
# Process multiple genomes with the same extension:
python VF_classifier.v0.1.0.py -i *.gz| Flag | Long | Required | Description |
|---|---|---|---|
--setup |
No | Download VFDB, curate it, and create BLAST database automatically | |
--input-fasta |
No | Use a local VFDB FASTA file for setup instead of downloading | |
--no-curation |
No | Skip automated curation and use original VFDB headers | |
--force |
No | Force database re-download OR overwrite existing sample results | |
--merge |
[files] | No | Consolidate results recursively from VF_results/. |
--out-matrix |
No | Custom output path (default: VF_results/multiple_samples_best_hits.csv) | |
-i |
--input |
Yes | Path to input genome FASTA file(s) (supports multiple) |
-db |
--database |
No | Path to BLAST database prefix (default: VF_database/VFDB_db) |
-v |
--vfdb_fasta |
No | Path to VFDB FASTA file (auto-detected if not specified) |
-t |
--threads |
No | Number of threads for BLAST (default: 4) |
The script creates a VF_results/ directory. For each input sample, a dedicated subfolder is created:
VF_results/[SampleName]/
Contains all individual analysis files for that specific sample.
Safety Note: To prevent accidental data loss, the script will skip any sample that already has an existing folder in VF_results/. Use the --force flag to overwrite existing results.
Inside each sample folder:
[input_basename]_all_hits.tsv
Contains all BLAST matches to probably multiple genomes in the database, with complete annotation information.
[input_basename]_best_hits.tsv
Contains filtered results with only the best match per gene per contig (highest bitscore).
[input_basename]_summary.png
A bar chart showing the distribution of virulence factors by broad category (requires matplotlib/seaborn).
[input_basename]_VF.html
An interactive HTML chart for hierarchical exploration of virulence factors (requires Krona tools).
[input_basename]_data4plot.tsv
Tab-separated file with headers containing aggregated virulence factor data for custom visualizations. You might want to plot a Sunburst chart or a Sankey diagram using this file.
Columns: Count, VF_Broad_Category, VF_Specific_Name
VF_results/multiple_samples_best_hits.csv (optional)
Automatically created when processing multiple files, or by running with --merge. Consolidates identity percentages for all detected genes across all samples.
vf_heatmap.pdf (optional)
A high-quality PDF heatmap generated by the R script, showing clustered samples and genes grouped by broad and specific virulence categories.
- Query_Sequence: ID of the query sequence from your genome
- Query_Start: Start position of the alignment in the query sequence
- Query_End: End position of the alignment in the query sequence
- Strand: Alignment strand (plus/minus)
- VF_Broad_Category: Broad category of virulence factor
- VF_Specific_Name: Specific virulence factor name
- VF_Gene_Symbol: Gene symbol of the virulence factor
- Identity_Pct: Percentage identity of the alignment
- E_Value: E-value of the alignment
- VF_Origin_Organism: Organism from which the virulence factor originates
- VF_ID: VFDB database identifier
[input_basename]_data4plot.tsv
Aggregated data file with headers for custom analysis and plotting.
Columns:
- Count: Number of hits for each virulence factor
- VF_Broad_Category: Broad category of virulence factor
- VF_Specific_Name: Specific virulence factor name
Note:
- TSV files are saved with tab separation for better bioinformatics compatibility
- Visualization files are generated only if the required dependencies are available
- All files are saved in the
VF_results/directory to keep your workspace organized - The
_data4plot.tsvfile is preserved for custom analysis and contains headers
After downloading and making the database, you can test the script with the included Streptococcus agalactiae genomes:
python VF_classifier.v0.1.0.py -i examples/NC_004116.fasta.gzpython VF_classifier.v0.1.0.py -i examples/*.fasta.gz
# Generate an advanced comparative heatmap
./plot_vf_heatmap.RThe plot_vf_heatmap.R script will automatically search for the VF_results/multiple_samples_best_hits.csv file and generate a PDF heatmap.
The script provides a clean, color-coded terminal experience:
Auto-detected VFDB FASTA: VF_database/VFDB_setB_nt.fas
Running BLASTN (Genome: examples/NC_004116.fasta vs DB: VF_database/VFDB_db)...
BLAST finished.
Processing results...
Saved All Hits: VF_results/NC_004116_all_hits.tsv
Saved Best Hits: VF_results/NC_004116_best_hits.tsv
Generating summary plot: VF_results/NC_004116_summary.png...
Generating Krona chart: VF_results/NC_004116_VF.html...
Krona HTML saved: VF_results/NC_004116_VF.html
Here's an example of the summary plot generated by the script:
of the Krona chart:
and of the heatmap:
- BLAST command not found: Ensure BLAST+ is installed and in your PATH
- Database not found: Verify the database path and that makeblastdb was run successfully
- VFDB FASTA not found: Keep VFDB_setB_nt.fas in the same directory as the database or use
-vparameter - No results found: Check that your genome file contains nucleotide sequences and try adjusting the e-value threshold
- Plot not generated: Install matplotlib and seaborn with
pip install matplotlib seaborn - Krona chart not generated: Install Krona tools or ensure ktImportText is in your PATH
- Permission errors: Ensure you have write permissions for creating the
VF_results/directory - Compressed file error: Ensure the .gz file is not corrupted, and the script has read permissions
Error: The provided FASTA file was not found: Check the path specified with-vError: Could not find 'VFDB_setB_nt.fas': Place the file in the database directory or use-vBLAST Error: Ensure the database path is correct: Verify database files exist and are accessible[WARNING] matplotlib/seaborn not installed: Install withpip install matplotlib seaborn[WARNING] 'ktImportText' not found: Install Krona tools or add to PATH
VF_classifier/
├── VF_classifier.v0.1.0.py # Main script (now handles downloads)
├── plot_vf_heatmap.R # Advanced heatmap generation script (R)
├── assets/ # Project assets (logos, etc.)
│ └── logo.png # Project logo
├── requirements.txt # Python dependencies for pip installation
├── environment.yml # Conda environment specification
├── README.md # This file
├── VF_results/ # Output directory (created automatically)
│ ├── [genome]_all_hits.tsv # All BLAST matches
│ ├── [genome]_best_hits.tsv # Filtered best matches
│ ├── [genome]_summary.png # Summary plot
│ ├── [genome]_VF.html # Interactive Krona chart
│ └── [genome]_data4plot.tsv # Data for custom plotting (with headers)
└── VF_database/ # Recommended directory for database files
├── VFDB_setB_nt.fas # VFDB nucleotide sequences
├── VFDB_db.nhr # BLAST database files
├── VFDB_db.nin
├── VFDB_db.nsq
├── VFDB_metadata.txt # Database metadata and timestamps
└── ...
If you use this tool in your research, please cite:
-
VFDB Database: Liu B, Zheng D, Jin Q, et al. VFDB 2019: a pathogenic bacterium virulence factor database. Nucleic Acids Res. 2019;47(D1):D698-D704 10.1093/nar/gki008
-
BLAST: Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol. 1990;215(3):403-410 10.1016/0022-2836(90)90021-x
-
Complex Heatmaps: Zuguang Gu, et al., Complex heatmaps reveal patterns and correlations in multidimensional genomic data, Bioinformatics, 2016. 10.1093/bioinformatics/btw637
This project is licensed under the MIT License - see the parent directory's License.md file for details.
Contributions are welcome! Please feel free to submit issues and enhancement requests.



