This pipeline provides a robust framework for clustering gene neighborhoods derived from EFI-GNT (Enzyme Function Initiative - Genome Neighborhood Tool) data. It enables the identification of functionally similar or distinct genes and their genomic contexts, facilitating comparative genomics and phylogenetic enzyme homology studies. By leveraging SQLite data generated by EFI-GNT (often from EFI-EST BLAST searches), the pipeline classifies genes based on their attributes and those of their neighbors, applying advanced hierarchical clustering algorithms.
- Comprehensive Feature Extraction: Extracts functional annotations (descriptions, PFAM, InterPro IDs) from both target (hit) genes and their genomic neighbors.
- Weighted Feature Representation: Emphasizes the contribution of target genes and direct neighbors to the overall neighborhood similarity score.
- Jaccard Distance Calculation: Quantifies dissimilarity between gene neighborhoods using a custom, parallelized Jaccard distance metric for efficiency.
- Hierarchical Clustering: Applies average linkage hierarchical clustering to group similar gene neighborhoods.
- SSN Cluster Differentiation: Optionally performs clustering independently for each Sequence Similarity Network (SSN) cluster derived from EFI-GNT.
- Neighborhood Collapsing: Reduces redundancy by identifying and collapsing highly similar or identical gene neighborhoods, simplifying complex visualizations.
- Dynamic Dendrogram Visualization: Generates customizable dendrogram plots with various labeling options and highlighting for specific input sequences.
- Detailed Reporting: Produces a comprehensive text report summarizing the pipeline configuration, clustering parameters, and detailed results.
The pipeline requires an SQLite database file generated by EFI-GNT's "Genome Neighborhood Diagrams for GNN" module. This database typically contains:
- A main
attributestable for hit gene (target protein) information. - A
neighborstable for genes found in the genomic context of the hit genes.
Key columns expected include gene IDs, organism names, functional descriptions (desc), PFAM IDs (family), InterPro IDs (ipro_family), relative positions (rel_start, rel_stop), accession IDs (accession), and SSN cluster IDs (cluster_num).
The primary entry point for executing the pipeline is the provided Jupyter Notebook:
EFI-GNT_Clustering-Gene-Neighborhoods.ipynb
To use the pipeline:
- Open the Jupyter Notebook: Launch the
EFI-GNT_Clustering-Gene-Neighborhoods.ipynbin a Jupyter environment. - Configure Parameters: Adjust the configuration variables in the notebook's initial cells:
SQLITE_DB_PATH: Path to your EFI-GNT SQLite database.ORIGINAL_INPUT_SEQUENCE_ID: (Optional) UniProt ID of a sequence to highlight in the dendrograms.DIFFERENTIATE_BY_SSN_CLUSTER:Trueto cluster within each SSN,Falseto cluster all neighborhoods together.chosen_distance_threshold: Jaccard distance threshold for cutting the dendrograms into clusters.PARALLELIZE_PDIST_ENABLED:Trueto enable parallel computation of Jaccard distances (recommended for large datasets).- Collapsing Options: Adjust
COLLAPSE_IDENTICAL_NEIGHBORHOODS_ACTIVE,COLLAPSE_CORE_SIMILARITY_THRESHOLD_ACTIVE, andCOLLAPSE_FULL_NEIGHBORHOOD_SIMILARITY_THRESHOLD_ACTIVEas needed to manage redundancy. - Output Settings: Define
OUTPUT_DIR,REPORT_FILENAME_BASE,OUTPUT_FORMATS, andDPIfor plots and reports.
- Run Cells: Execute the notebook cells sequentially. The script will:
- Load data from the specified SQLite database.
- Extract and weight features for hit genes and their neighbors.
- Optionally collapse similar neighborhoods.
- Calculate Jaccard distances.
- Perform hierarchical clustering.
- Generate and save dendrogram plots.
- Produce a detailed report of the clustering results.
The pipeline generates the following outputs:
- Dendrogram Plots: Saved to the specified
OUTPUT_DIRin formats like PDF, SVG, and PNG. These visualizations illustrate the hierarchical relationships between gene neighborhoods. - Clustering Report: A detailed text file (
.txt) named based onREPORT_FILENAME_BASEin theOUTPUT_DIR, containing:- Pipeline configuration.
- Overview of SSN cluster distribution.
- Details of collapsed neighborhood groups (if enabled).
- List of neighborhoods within each identified cluster.
This pipeline and its associated scripts were solely developed by Philipp Trollmann during his first PhD rotation in Dr. Pinghua Liu's lab at Boston University.