NNGeneTree is a phylogenetic analysis and taxonomic classification pipeline for protein sequences. It builds gene trees and finds the nearest neighbors of query sequences in the phylogenetic context, assigning taxonomy information for comprehensive evolutionary analysis.
Built with Nextflow - a dataflow-oriented workflow engine with built-in resume and reporting capabilities.
- Overview
- Features
- Requirements
- Installation
- First-Time Setup
- Usage
- OrthoFinder Preprocessing
- Pipeline Workflow
- Output Description
- Configuration
- Scripts Documentation
- License
- Contact
NNGeneTree leverages the power of phylogenetic analysis to place query protein sequences in an evolutionary context and identify their closest neighbors in sequence space. This approach provides valuable insights into the functional and evolutionary relationships between proteins, complementing traditional similarity-based annotation methods.
- Automated workflow from protein sequences to annotated phylogenetic trees
- Smart neighbor selection based on phylogenetic distance
- NCBI taxonomy integration for comprehensive classification
- Statistical analysis of phylogenetic relationships
- Tree visualizations with taxonomic annotations
- Detailed logging and reports for each analysis step
- Local and HPC support (SLURM cluster deployment)
- Modular design with Pixi package management
The pipeline automatically manages all required tools through Pixi:
| Tool | Purpose |
|---|---|
| Nextflow | Workflow management |
| DIAMOND | Fast protein similarity search |
| BLAST+ | Sequence extraction |
| MAFFT | Multiple sequence alignment |
| TrimAl | Alignment trimming |
| IQ-TREE | Phylogenetic tree construction |
| ETE Toolkit | Tree manipulation |
| BioPython | Sequence analysis and taxonomy retrieval |
| OpenJDK | Required for Nextflow |
# Clone the repository
git clone https://github.com/username/nngenetree.git
cd nngenetree
# Install Pixi (if not already installed)
curl -fsSL https://pixi.sh/install.sh | bash
# Install all dependencies
pixi installAll dependencies are now installed and managed by Pixi.
After installation, you must configure the database path for your system.
# Copy the template
cp conf/local.config.template conf/local.config
# Edit with your settings
nano conf/local.configEdit conf/local.config and set:
blast_db: Path to your DIAMOND-formatted NR database (without.dmndextension)entrez_email: Your email for NCBI Entrez API
Example:
params {
blast_db = '/path/to/nr/database'
entrez_email = 'your.email@example.com'
}# Add to your ~/.bashrc or ~/.zshrc
export NR_DATABASE=/path/to/your/nr/database
export ENTREZ_EMAIL=your.email@example.comIf running on a SLURM cluster, add queue settings to conf/local.config:
process {
queue = 'your_queue_name'
clusterOptions = '--qos=your_qos --account=your_account'
}To run nngenetree from any directory:
# From the nngenetree repository directory
mkdir -p ~/bin
ln -s $(pwd)/nngenetree ~/bin/nngenetree
# Add ~/bin to PATH (if not already)
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Test from any directory
cd /tmp && nngenetree test# Test mode with small built-in database
nngenetree test
# Run on your data locally
nngenetree my_input_dir local
# Run on SLURM cluster (default)
nngenetree my_input_dir slurmNote: Use nngenetree directly if installed to PATH, otherwise use bash nngenetree.
- Automatic resume on failure (
-resume) - HTML execution reports with resource usage
- Built-in timeline and DAG visualizations
- Cloud-ready (AWS, Azure, Google Cloud)
NNGeneTree includes an optional preprocessing script for OrthoFinder integration. This runs separately before the main pipeline and allows you to:
- Identify orthogroups across multiple genomes using OrthoFinder
- Filter orthogroups by target protein IDs
- Create FASTA files for each orthogroup
- Use orthogroup FASTA files as input to NNGeneTree
Genome files must follow this header format:
>{genome_id}|{contig_id}_{protein_id}
Example:
>Hype|contig_50_1
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTA...
# Activate the pixi environment
pixi shell
# Basic usage - process all orthogroups
python bin/orthofinder_preprocess.py \
--genomes_faa_dir path/to/genomes \
--output_dir path/to/output
# Filter for specific proteins
python bin/orthofinder_preprocess.py \
--genomes_faa_dir path/to/genomes \
--output_dir path/to/output \
--target "target_substring"# Step 1: Run OrthoFinder preprocessing
pixi shell
python bin/orthofinder_preprocess.py \
--genomes_faa_dir my_genomes/ \
--output_dir my_orthogroups/ \
--target "species1|contig_10_" \
--threads 16
exit
# Step 2: Run NNGeneTree on the orthogroups
nngenetree my_orthogroups local| Option | Description |
|---|---|
--genomes_faa_dir |
Directory containing genome FASTA files |
--output_dir |
Output directory for orthogroup FASTA files |
--target |
Comma-separated substrings to filter orthogroups |
--orthofinder_results |
Path to existing OrthoFinder results (skip re-running) |
--threads |
Number of threads for OrthoFinder (default: 16) |
--force |
Overwrite existing output directory |
INPUT FASTA FILES (.faa)
|
v
+---------------------+
| DIAMOND BLASTP | Fast protein similarity search (default: 20 hits/query)
+---------------------+
|
v
+---------------------+
| PROCESS & VALIDATE | Extract unique subjects and validate output
+---------------------+
|
v
+---------------------+
| EXTRACT SEQUENCES | Retrieve hit sequences using blastdbcmd
+---------------------+
|
v
+---------------------+
| COMBINE SEQUENCES | Merge query + hit sequences
+---------------------+
|
v
+---------------------+
| MAFFT ALIGNMENT | Multiple sequence alignment
+---------------------+
|
v
+---------------------+
| TRIMAL TRIMMING | Remove poorly aligned regions (gap threshold: 0.1)
+---------------------+
|
v
+---------------------+
| IQTREE | Build phylogenetic tree (LG+G4 model)
+---------------------+
|
+---------------------------+
| |
v v
+-----------------+ +------------------------+
| EXTRACT | | PHYLOGENETIC |
| NEIGHBORS | | PLACEMENT |
| (N=10 default) | +------------------------+
+-----------------+
|
v
+---------------------+
| ASSIGN TAXONOMY | Fetch NCBI taxonomy via Entrez API
+---------------------+
|
v
+---------------------+
| DECORATE TREE | Generate PNG visualizations with taxonomy
+---------------------+
|
v
+---------------------+
| TREE STATISTICS | Calculate phylogenetic statistics
+---------------------+
|
v
+---------------------+
| COMBINE RESULTS | Aggregate placement results to JSON
+---------------------+
|
v
FINAL OUTPUT
Results are saved in <input_dir>_output/. For each input FASTA file:
<sample>/
├── blast_results.m8 # DIAMOND BLAST tabular output
├── unique_subjects.txt # List of unique hit accessions
├── check_blast_output.done # Validation checkpoint
├── extracted_hits.faa # Sequences of BLAST hits
├── combined_sequences.faa # Combined query and hit sequences
├── aln/
│ ├── aligned_sequences.msa # Raw MAFFT alignment
│ └── trimmed_alignment.msa # TrimAl-trimmed alignment
├── tree/
│ ├── final_tree.treefile # Newick tree file
│ ├── final_tree.iqtree # IQ-TREE log file
│ ├── decorated_tree.png # Visualization with taxonomy
│ └── tree_stats.tab # Tree statistics
├── closest_neighbors.csv # Neighbors with phylogenetic distances
├── closest_neighbors_with_taxonomy.csv # Enhanced CSV with NCBI taxonomy
├── taxonomy_assignments.txt # Taxonomy information (tabular)
├── placement_results.json # Detailed neighbor info
├── placement_results.csv # Placement results (CSV format)
└── itol/ # Interactive Tree of Life files
├── itol_labels.txt
├── itol_colors.txt
└── itol_ranges.txt
combined_placement_results.json: All placement results across samples
A log file (<input_dir>_output_completion.log) contains:
- Pipeline version and runtime information
- BLAST hit counts for each sample
- Taxonomy distribution statistics (domains, phyla, classes, orders, families, genera)
- Tree generation status
| Parameter | Description | Default |
|---|---|---|
input_dir |
Directory containing input .faa files | test |
output_dir |
Override default output directory | {input_dir}_output |
blast_db |
Path to BLAST/DIAMOND database | (from local.config) |
blast_hits |
Number of BLAST hits per query | 5 |
closest_neighbors |
Number of closest neighbors to extract | 5 |
query_filter |
Comma-separated query prefixes to filter | - |
query_prefixes |
Prefixes for phylogenetic placement | Hype,Klos |
num_neighbors_placement |
Neighbors for placement | 5 |
itol_tax_level |
Taxonomy level for iTOL | class |
params {
resources {
run_diamond_blastp {
threads = 4
mem_mb = 8000
time = '10m'
}
// Additional resources in nextflow.config
}
}| Profile | Description |
|---|---|
standard |
Default (base configuration) |
local |
Local execution with 16 cores |
slurm |
SLURM cluster execution |
test |
Test profile with small database |
# Use custom config file
nextflow run main.nf -c my_custom_config.txt
# Override specific parameters
nextflow run main.nf --input_dir mydata --blast_hits 50
# Override multiple parameters
nextflow run main.nf \
--input_dir mydata \
--blast_db /path/to/custom/db \
--closest_neighbors 20 \
--output_dir custom_outputView all tasks with pixi task list:
| Task | Description |
|---|---|
test |
Run test pipeline with verification |
clean |
Clean test output and logs |
clean-all |
Clean all output directories |
shell |
Start interactive shell |
lint |
Lint Python scripts (dev env) |
format |
Format Python scripts (dev env) |
All scripts are in bin/ and available in PATH when using pixi shell.
Process closest neighbors CSV files and add NCBI taxonomy:
python bin/parse_closest_neighbors.py -d <directory> -o <output_file>Extract closest neighbors from a phylogenetic tree:
python bin/extract_closest_neighbors.py \
--tree <tree_file> \
--query <query_file> \
--subjects <subjects_file> \
--output <output_file> \
--num_neighbors <N>Extract phylogenetic neighbors with taxonomy for specific query prefixes:
python bin/extract_phylogenetic_neighbors.py \
--tree <tree_file> \
--query-prefixes <prefixes> \
--output-json <json_file> \
--output-csv <csv_file> \
--num-neighbors <N>| Option | Description |
|---|---|
--tree |
Path to tree file |
--query-prefixes |
Comma-separated query prefixes (e.g., "Hype,Klos") |
--output-json |
Output JSON file |
--output-csv |
Output CSV file |
--num-neighbors |
Neighbors per query (default: 5) |
--self-hit-threshold |
Distance threshold for self-hits (default: 0.001) |
Create tree visualizations with taxonomy:
python bin/decorate_tree.py <tree_file> <taxonomy_file> <query_file> <output_png> <itol_prefix>Calculate phylogenetic statistics:
python bin/tree_stats.py <tree_file> <taxonomy_file> <query_file> <output_file>This project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or contributions, please open an issue on the GitHub repository.
Developed at Joint Genome Institute (JGI)