Skip to content

FilyCode/Enzyme-Finder-Parts

Repository files navigation

Enzyme Finder Python Pipelines

Overview

This pipeline provides a robust framework for clustering gene neighborhoods derived from EFI-GNT (Enzyme Function Initiative - Genome Neighborhood Tool) data. It enables the identification of functionally similar or distinct genes and their genomic contexts, facilitating comparative genomics and phylogenetic enzyme homology studies. By leveraging SQLite data generated by EFI-GNT (often from EFI-EST BLAST searches), the pipeline classifies genes based on their attributes and those of their neighbors, applying advanced hierarchical clustering algorithms.

Features

  • Comprehensive Feature Extraction: Extracts functional annotations (descriptions, PFAM, InterPro IDs) from both target (hit) genes and their genomic neighbors.
  • Weighted Feature Representation: Emphasizes the contribution of target genes and direct neighbors to the overall neighborhood similarity score.
  • Jaccard Distance Calculation: Quantifies dissimilarity between gene neighborhoods using a custom, parallelized Jaccard distance metric for efficiency.
  • Hierarchical Clustering: Applies average linkage hierarchical clustering to group similar gene neighborhoods.
  • SSN Cluster Differentiation: Optionally performs clustering independently for each Sequence Similarity Network (SSN) cluster derived from EFI-GNT.
  • Neighborhood Collapsing: Reduces redundancy by identifying and collapsing highly similar or identical gene neighborhoods, simplifying complex visualizations.
  • Dynamic Dendrogram Visualization: Generates customizable dendrogram plots with various labeling options and highlighting for specific input sequences.
  • Detailed Reporting: Produces a comprehensive text report summarizing the pipeline configuration, clustering parameters, and detailed results.

Input Data

The pipeline requires an SQLite database file generated by EFI-GNT's "Genome Neighborhood Diagrams for GNN" module. This database typically contains:

  • A main attributes table for hit gene (target protein) information.
  • A neighbors table for genes found in the genomic context of the hit genes.

Key columns expected include gene IDs, organism names, functional descriptions (desc), PFAM IDs (family), InterPro IDs (ipro_family), relative positions (rel_start, rel_stop), accession IDs (accession), and SSN cluster IDs (cluster_num).

Usage

The primary entry point for executing the pipeline is the provided Jupyter Notebook:

  • EFI-GNT_Clustering-Gene-Neighborhoods.ipynb

To use the pipeline:

  1. Open the Jupyter Notebook: Launch the EFI-GNT_Clustering-Gene-Neighborhoods.ipynb in a Jupyter environment.
  2. Configure Parameters: Adjust the configuration variables in the notebook's initial cells:
    • SQLITE_DB_PATH: Path to your EFI-GNT SQLite database.
    • ORIGINAL_INPUT_SEQUENCE_ID: (Optional) UniProt ID of a sequence to highlight in the dendrograms.
    • DIFFERENTIATE_BY_SSN_CLUSTER: True to cluster within each SSN, False to cluster all neighborhoods together.
    • chosen_distance_threshold: Jaccard distance threshold for cutting the dendrograms into clusters.
    • PARALLELIZE_PDIST_ENABLED: True to enable parallel computation of Jaccard distances (recommended for large datasets).
    • Collapsing Options: Adjust COLLAPSE_IDENTICAL_NEIGHBORHOODS_ACTIVE, COLLAPSE_CORE_SIMILARITY_THRESHOLD_ACTIVE, and COLLAPSE_FULL_NEIGHBORHOOD_SIMILARITY_THRESHOLD_ACTIVE as needed to manage redundancy.
    • Output Settings: Define OUTPUT_DIR, REPORT_FILENAME_BASE, OUTPUT_FORMATS, and DPI for plots and reports.
  3. Run Cells: Execute the notebook cells sequentially. The script will:
    • Load data from the specified SQLite database.
    • Extract and weight features for hit genes and their neighbors.
    • Optionally collapse similar neighborhoods.
    • Calculate Jaccard distances.
    • Perform hierarchical clustering.
    • Generate and save dendrogram plots.
    • Produce a detailed report of the clustering results.

Output

The pipeline generates the following outputs:

  • Dendrogram Plots: Saved to the specified OUTPUT_DIR in formats like PDF, SVG, and PNG. These visualizations illustrate the hierarchical relationships between gene neighborhoods.
  • Clustering Report: A detailed text file (.txt) named based on REPORT_FILENAME_BASE in the OUTPUT_DIR, containing:
    • Pipeline configuration.
    • Overview of SSN cluster distribution.
    • Details of collapsed neighborhood groups (if enabled).
    • List of neighborhoods within each identified cluster.

Authorship

This pipeline and its associated scripts were solely developed by Philipp Trollmann during his first PhD rotation in Dr. Pinghua Liu's lab at Boston University.

About

A Repo of various Python pipelines to find enzyme homologues.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors