NNGeneTree

NNGeneTree is a phylogenetic analysis and taxonomic classification pipeline for protein sequences. It builds gene trees and finds the nearest neighbors of query sequences in the phylogenetic context, assigning taxonomy information for comprehensive evolutionary analysis.

Built with Nextflow - a dataflow-oriented workflow engine with built-in resume and reporting capabilities.

Overview

NNGeneTree leverages the power of phylogenetic analysis to place query protein sequences in an evolutionary context and identify their closest neighbors in sequence space. This approach provides valuable insights into the functional and evolutionary relationships between proteins, complementing traditional similarity-based annotation methods.

Features

Automated workflow from protein sequences to annotated phylogenetic trees
Smart neighbor selection based on phylogenetic distance
NCBI taxonomy integration for comprehensive classification
Statistical analysis of phylogenetic relationships
Tree visualizations with taxonomic annotations
Detailed logging and reports for each analysis step
Local and HPC support (SLURM cluster deployment)
Modular design with Pixi package management

Requirements

Pixi (for environment and dependency management)
SLURM (optional, for cluster execution)

The pipeline automatically manages all required tools through Pixi:

Tool	Purpose
Nextflow	Workflow management
DIAMOND	Fast protein similarity search
BLAST+	Sequence extraction
MAFFT	Multiple sequence alignment
TrimAl	Alignment trimming
IQ-TREE	Phylogenetic tree construction
ETE Toolkit	Tree manipulation
BioPython	Sequence analysis and taxonomy retrieval
OpenJDK	Required for Nextflow

Installation

Quick Start

# Clone the repository
git clone https://github.com/username/nngenetree.git
cd nngenetree

# Install Pixi (if not already installed)
curl -fsSL https://pixi.sh/install.sh | bash

# Install all dependencies
pixi install

All dependencies are now installed and managed by Pixi.

First-Time Setup

After installation, you must configure the database path for your system.

Option 1: Configuration File (Recommended)

# Copy the template
cp conf/local.config.template conf/local.config

# Edit with your settings
nano conf/local.config

Edit conf/local.config and set:

blast_db: Path to your DIAMOND-formatted NR database (without .dmnd extension)
entrez_email: Your email for NCBI Entrez API

Example:

params {
    blast_db = '/path/to/nr/database'
    entrez_email = 'your.email@example.com'
}

Option 2: Environment Variables

# Add to your ~/.bashrc or ~/.zshrc
export NR_DATABASE=/path/to/your/nr/database
export ENTREZ_EMAIL=your.email@example.com

SLURM Configuration (Optional)

If running on a SLURM cluster, add queue settings to conf/local.config:

process {
    queue = 'your_queue_name'
    clusterOptions = '--qos=your_qos --account=your_account'
}

Execution from Anywhere (Optional)

To run nngenetree from any directory:

# From the nngenetree repository directory
mkdir -p ~/bin
ln -s $(pwd)/nngenetree ~/bin/nngenetree

# Add ~/bin to PATH (if not already)
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Test from any directory
cd /tmp && nngenetree test

Usage

Running the Pipeline

# Test mode with small built-in database
nngenetree test

# Run on your data locally
nngenetree my_input_dir local

# Run on SLURM cluster (default)
nngenetree my_input_dir slurm

Note: Use nngenetree directly if installed to PATH, otherwise use bash nngenetree.

Nextflow Features

Automatic resume on failure (-resume)
HTML execution reports with resource usage
Built-in timeline and DAG visualizations
Cloud-ready (AWS, Azure, Google Cloud)

OrthoFinder Preprocessing (Optional)

NNGeneTree includes an optional preprocessing script for OrthoFinder integration. This runs separately before the main pipeline and allows you to:

Identify orthogroups across multiple genomes using OrthoFinder
Filter orthogroups by target protein IDs
Create FASTA files for each orthogroup
Use orthogroup FASTA files as input to NNGeneTree

Prerequisites

Genome files must follow this header format:

>{genome_id}|{contig_id}_{protein_id}

Example:

>Hype|contig_50_1
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTA...

Running OrthoFinder Preprocessing

# Activate the pixi environment
pixi shell

# Basic usage - process all orthogroups
python bin/orthofinder_preprocess.py \
  --genomes_faa_dir path/to/genomes \
  --output_dir path/to/output

# Filter for specific proteins
python bin/orthofinder_preprocess.py \
  --genomes_faa_dir path/to/genomes \
  --output_dir path/to/output \
  --target "target_substring"

Complete Workflow Example

# Step 1: Run OrthoFinder preprocessing
pixi shell
python bin/orthofinder_preprocess.py \
  --genomes_faa_dir my_genomes/ \
  --output_dir my_orthogroups/ \
  --target "species1|contig_10_" \
  --threads 16
exit

# Step 2: Run NNGeneTree on the orthogroups
nngenetree my_orthogroups local

OrthoFinder Script Options

Option	Description
`--genomes_faa_dir`	Directory containing genome FASTA files
`--output_dir`	Output directory for orthogroup FASTA files
`--target`	Comma-separated substrings to filter orthogroups
`--orthofinder_results`	Path to existing OrthoFinder results (skip re-running)
`--threads`	Number of threads for OrthoFinder (default: 16)
`--force`	Overwrite existing output directory

Pipeline Workflow

INPUT FASTA FILES (.faa)
         |
         v
+---------------------+
| DIAMOND BLASTP      |  Fast protein similarity search (default: 20 hits/query)
+---------------------+
         |
         v
+---------------------+
| PROCESS & VALIDATE  |  Extract unique subjects and validate output
+---------------------+
         |
         v
+---------------------+
| EXTRACT SEQUENCES   |  Retrieve hit sequences using blastdbcmd
+---------------------+
         |
         v
+---------------------+
| COMBINE SEQUENCES   |  Merge query + hit sequences
+---------------------+
         |
         v
+---------------------+
| MAFFT ALIGNMENT     |  Multiple sequence alignment
+---------------------+
         |
         v
+---------------------+
| TRIMAL TRIMMING     |  Remove poorly aligned regions (gap threshold: 0.1)
+---------------------+
         |
         v
+---------------------+
| IQTREE              |  Build phylogenetic tree (LG+G4 model)
+---------------------+
         |
         +---------------------------+
         |                           |
         v                           v
+-----------------+     +------------------------+
| EXTRACT         |     | PHYLOGENETIC           |
| NEIGHBORS       |     | PLACEMENT              |
| (N=10 default)  |     +------------------------+
+-----------------+
         |
         v
+---------------------+
| ASSIGN TAXONOMY     |  Fetch NCBI taxonomy via Entrez API
+---------------------+
         |
         v
+---------------------+
| DECORATE TREE       |  Generate PNG visualizations with taxonomy
+---------------------+
         |
         v
+---------------------+
| TREE STATISTICS     |  Calculate phylogenetic statistics
+---------------------+
         |
         v
+---------------------+
| COMBINE RESULTS     |  Aggregate placement results to JSON
+---------------------+
         |
         v
     FINAL OUTPUT

Output Description

Results are saved in <input_dir>_output/. For each input FASTA file:

Sample Directory Structure

<sample>/
├── blast_results.m8              # DIAMOND BLAST tabular output
├── unique_subjects.txt           # List of unique hit accessions
├── check_blast_output.done       # Validation checkpoint
├── extracted_hits.faa            # Sequences of BLAST hits
├── combined_sequences.faa        # Combined query and hit sequences
├── aln/
│   ├── aligned_sequences.msa     # Raw MAFFT alignment
│   └── trimmed_alignment.msa     # TrimAl-trimmed alignment
├── tree/
│   ├── final_tree.treefile       # Newick tree file
│   ├── final_tree.iqtree         # IQ-TREE log file
│   ├── decorated_tree.png        # Visualization with taxonomy
│   └── tree_stats.tab            # Tree statistics
├── closest_neighbors.csv         # Neighbors with phylogenetic distances
├── closest_neighbors_with_taxonomy.csv  # Enhanced CSV with NCBI taxonomy
├── taxonomy_assignments.txt      # Taxonomy information (tabular)
├── placement_results.json        # Detailed neighbor info
├── placement_results.csv         # Placement results (CSV format)
└── itol/                         # Interactive Tree of Life files
    ├── itol_labels.txt
    ├── itol_colors.txt
    └── itol_ranges.txt

Aggregated Output

combined_placement_results.json: All placement results across samples

Completion Log

A log file (<input_dir>_output_completion.log) contains:

Pipeline version and runtime information
BLAST hit counts for each sample
Taxonomy distribution statistics (domains, phyla, classes, orders, families, genera)
Tree generation status

Configuration

Parameters

Parameter	Description	Default
`input_dir`	Directory containing input .faa files	`test`
`output_dir`	Override default output directory	`{input_dir}_output`
`blast_db`	Path to BLAST/DIAMOND database	(from local.config)
`blast_hits`	Number of BLAST hits per query	5
`closest_neighbors`	Number of closest neighbors to extract	5
`query_filter`	Comma-separated query prefixes to filter	-
`query_prefixes`	Prefixes for phylogenetic placement	`Hype,Klos`
`num_neighbors_placement`	Neighbors for placement	5
`itol_tax_level`	Taxonomy level for iTOL	`class`

Resource Configuration

params {
  resources {
    run_diamond_blastp {
      threads = 4
      mem_mb = 8000
      time = '10m'
    }
    // Additional resources in nextflow.config
  }
}

Execution Profiles

Profile	Description
`standard`	Default (base configuration)
`local`	Local execution with 16 cores
`slurm`	SLURM cluster execution
`test`	Test profile with small database

Override Configuration

# Use custom config file
nextflow run main.nf -c my_custom_config.txt

# Override specific parameters
nextflow run main.nf --input_dir mydata --blast_hits 50

# Override multiple parameters
nextflow run main.nf \
  --input_dir mydata \
  --blast_db /path/to/custom/db \
  --closest_neighbors 20 \
  --output_dir custom_output

Pixi Tasks

View all tasks with pixi task list:

Task	Description
`test`	Run test pipeline with verification
`clean`	Clean test output and logs
`clean-all`	Clean all output directories
`shell`	Start interactive shell
`lint`	Lint Python scripts (dev env)
`format`	Format Python scripts (dev env)

Scripts Documentation

All scripts are in bin/ and available in PATH when using pixi shell.

parse_closest_neighbors.py

Process closest neighbors CSV files and add NCBI taxonomy:

python bin/parse_closest_neighbors.py -d <directory> -o <output_file>

extract_closest_neighbors.py

Extract closest neighbors from a phylogenetic tree:

python bin/extract_closest_neighbors.py \
  --tree <tree_file> \
  --query <query_file> \
  --subjects <subjects_file> \
  --output <output_file> \
  --num_neighbors <N>

extract_phylogenetic_neighbors.py

Extract phylogenetic neighbors with taxonomy for specific query prefixes:

python bin/extract_phylogenetic_neighbors.py \
  --tree <tree_file> \
  --query-prefixes <prefixes> \
  --output-json <json_file> \
  --output-csv <csv_file> \
  --num-neighbors <N>

Option	Description
`--tree`	Path to tree file
`--query-prefixes`	Comma-separated query prefixes (e.g., "Hype,Klos")
`--output-json`	Output JSON file
`--output-csv`	Output CSV file
`--num-neighbors`	Neighbors per query (default: 5)
`--self-hit-threshold`	Distance threshold for self-hits (default: 0.001)

decorate_tree.py

Create tree visualizations with taxonomy:

python bin/decorate_tree.py <tree_file> <taxonomy_file> <query_file> <output_png> <itol_prefix>

tree_stats.py

Calculate phylogenetic statistics:

python bin/tree_stats.py <tree_file> <taxonomy_file> <query_file> <output_file>

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions, issues, or contributions, please open an issue on the GitHub repository.

Developed at Joint Genome Institute (JGI)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
bin		bin
conf		conf
modules		modules
test		test
test_local		test_local
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
nngenetree		nngenetree
pixi.lock		pixi.lock
pixi.toml		pixi.toml

License

NeLLi-team/nngenetree

Folders and files

Latest commit

History

Repository files navigation

NNGeneTree

Table of Contents

Overview

Features

Requirements

Installation

Quick Start

First-Time Setup

Option 1: Configuration File (Recommended)

Option 2: Environment Variables

SLURM Configuration (Optional)

Execution from Anywhere (Optional)

Usage

Running the Pipeline

Nextflow Features

OrthoFinder Preprocessing (Optional)

Prerequisites

Running OrthoFinder Preprocessing

Complete Workflow Example

OrthoFinder Script Options

Pipeline Workflow

Output Description

Sample Directory Structure

Aggregated Output

Completion Log

Configuration

Parameters

Resource Configuration

Execution Profiles

Override Configuration

Pixi Tasks

Scripts Documentation

parse_closest_neighbors.py

extract_closest_neighbors.py

extract_phylogenetic_neighbors.py

decorate_tree.py

tree_stats.py

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Uh oh!

Languages

Packages