diff --git a/.codeboarding/Analysis_Visualization.html b/.codeboarding/Analysis_Visualization.html new file mode 100644 index 0000000..37b6306 --- /dev/null +++ b/.codeboarding/Analysis_Visualization.html @@ -0,0 +1,423 @@ + + +
+ + +The `Analysis & Visualization` component, as described, serves as an umbrella for functionalities related to protein metrics, visualization, and the management of their underlying external dependencies. Based on the provided `Analysis summary` and the related classes/methods, this component can be broken down into four fundamental sub-components, each representing a distinct and crucial aspect of the `proteinflow` library. These components are: `Data Management`, `Visualization`, `Metrics and Analysis`, and `External Dependencies and Utilities`.
+ + +This component is responsible for defining and managing the core data structures that represent protein information. It handles the loading of protein entries from various formats, such as PDB files and serialized pickle files, and provides foundational data objects for the entire system. The class hierarchy shows `SAbDabEntry` inheriting from `PDBEntry`, indicating a structured approach to handling different protein data types.
+`SAbDabEntry` (1:1)`PDBEntry` (1:1)This component focuses on the graphical representation and animation of protein structures. It takes processed protein data and renders it for user viewing, offering functionalities like showing animations from PDB or pickle files, and merging multiple protein structures for combined display.
+`visualize` (1:1)This component offers a comprehensive suite of computational tools for analyzing protein sequences and structures. It includes functions for calculating various biological and structural metrics (e.g., BLOSUM62 score, TM-score, language model perplexity) and integrating with external models for structure generation (e.g., ESMFold, IgFold).
+`metrics` (1:1)This component manages optional external dependencies and provides general utility functions. Its primary roles include checking for the availability of required external packages (`requires_extra`) and facilitating the acquisition of visualization views (`_get_view`). It acts as an abstraction layer, ensuring that core functionalities can gracefully handle optional integrations.
+`requires_extra` (1:1)`_get_view` (1:1)The `Core Data Management` component is fundamental to `proteinflow` as it establishes the initial pipeline for acquiring, structuring, and preparing raw protein data. It ensures that all subsequent operations, such as feature extraction and model training, have access to high-quality, standardized input. Without these foundational steps, the project would lack the necessary data integrity and accessibility to function effectively.
+ + +This component is responsible for fetching raw protein data (PDB and SAbDab files) from external databases and managing their local storage. It acts as the primary entry point for data acquisition.
+`proteinflow.download` (1:1)This class serves as the foundational data structure for parsing and representing information from standard PDB or mmCIF files. It extracts atomic coordinates, sequences, and basic structural properties, including initial ligand information.
+`proteinflow.data.PDBEntry` (1:1)Extending `PDBEntry`, this specialized class handles antibody structures from the SAbDab database. It incorporates specific logic for identifying Complementarity Determining Regions (CDRs) and managing antibody chain types, building upon the base PDB structure.
+`proteinflow.data.SAbDabEntry` (1:1)This is the central, standardized data model that aggregates and processes information from `PDBEntry` and `SAbDabEntry`. It represents the cleaned, filtered, and unified protein data, ready for feature extraction and downstream analysis.
+`proteinflow.data.ProteinEntry` (1:1)This module is dedicated to the identification, parsing, and detailed processing of ligand molecules associated with protein structures. It handles tasks such as extracting ligand data from PDB files and managing their chemical properties.
+`proteinflow.ligand` (1:1)This component orchestrates the overall data processing pipeline. It manages the filtering, cleaning, and conversion of raw protein entries into standardized `ProteinEntry` objects, ensuring data quality and preparing it for further use.
+`proteinflow.processing` (1:1)This subsystem focuses on organizing and partitioning processed protein data into distinct train, validation, and test sets, often employing clustering techniques to ensure diverse and representative splits. It also provides PyTorch-compatible `Dataset` and `DataLoader` classes for efficient batching and preparation of data, making it ready for machine learning model training and evaluation.
+ + +This module orchestrates the division of the protein dataset into training, validation, and test sets. It employs advanced strategies, including sequence and structural similarity-based clustering (e.g., using MMseqs2 and Foldseek), to ensure robust data separation and prevent data leakage, crucial for unbiased model evaluation.
+`proteinflow.split` (0:0)`proteinflow.split.utils` (0:0)`proteinflow.split.split_data` (0:0)`proteinflow.split._build_dataset_partition` (0:0)`proteinflow.split._split_dataset_with_graphs` (0:0)`proteinflow.split._get_split_dictionaries` (0:0)This module provides the necessary PyTorch-compatible `Dataset` and `DataLoader` classes, facilitating the seamless integration of processed protein data with deep learning models. It handles efficient data loading, batching, and preparation for training and evaluation.
+`proteinflow.data.torch` (0:0)`proteinflow.data.torch.ProteinDataset` (242:1131)`proteinflow.data.torch.ProteinLoader` (67:239)This fundamental component defines the structure for encapsulating all relevant information for a single protein entry, including sequence, coordinates, chain IDs, and associated ligand data. It provides methods for parsing, validating, and extracting specific features, serving as the core data representation throughout the data preparation pipeline. `proteinflow.data.SAbDabEntry` inherits from `proteinflow.data.PDBEntry`, extending the base protein data structure for antibody-specific entries.
+`proteinflow.data` (0:0)`proteinflow.data.utils` (0:0)`proteinflow.data.PDBEntry` (0:0)`proteinflow.data.SAbDabEntry` (0:0)`proteinflow.data.utils.from_pickle` (0:0)`proteinflow.data.utils.to_pdb` (0:0)`proteinflow.data.utils.get_chains` (0:0)`proteinflow.data.utils.get_sequence` (0:0)`proteinflow.data.utils.get_coordinates` (0:0)`proteinflow.data.utils.retrieve_ligands_from_pickle` (0:0)This module specializes in handling ligand-related data within protein entries. It includes functionalities for loading ligand information (e.g., SMILES strings) and performing chemical similarity-based clustering, which can be integrated into data splitting strategies.
+`proteinflow.ligand` (0:0)`proteinflow.ligand._load_smiles` (653:678)`proteinflow.ligand._merge_chains_ligands` (694:737)`proteinflow.ligand._run_tanimoto_clustering` (983:1001)This module provides a suite of helper functions that support the intricate logic within the `Data Splitting Module`. These utilities are essential for tasks such as finding correspondences between protein chains, loading PDB files, merging chains, and managing biounit information during the data splitting process.
+`proteinflow.split.utils` (0:0)`proteinflow.split.utils._find_correspondences` (139:149)`proteinflow.split.utils._load_pdbs` (72:99)`proteinflow.split.utils._merge_chains` (25:69)`proteinflow.split.utils._biounits_in_clusters_dict` (152:164)The CLI Interface is fundamental because it is the user's gateway to the entire ProteinFlow system. Without it, users would not be able to initiate or control any of the data pipeline operations. It abstracts away the underlying complexity of the data processing components, providing a simplified and unified command-line experience. Its role as an orchestrator and dispatcher is critical for coordinating the execution of various data-related tasks (downloading, generating, splitting) in a structured manner. The integration with Logging and Reporting is also vital, as it provides the necessary feedback loop for users to understand the status and outcomes of their initiated processes, making the system robust and user-friendly.
+ + +The CLI Interface serves as the primary command-line entry point for users to interact with the ProteinFlow data pipeline. Its fundamental role is to orchestrate the entire data processing workflow by translating user commands into specific actions. It acts as a dispatcher, invoking the appropriate backend functions from other core components such as the Data Downloader, Data Generator, and Data Splitter. Furthermore, it integrates with the Logging and Reporting component to provide operational feedback, status updates, and error summaries to the user, ensuring transparency and aiding in debugging. This component is crucial because it provides the user-facing control mechanism, making the complex data pipeline accessible and manageable.
+`proteinflow.cli` (18:20)Handles the acquisition of data.
+None
+Manages the synthesis or transformation of data.
+None
+Manages dataset partitioning and re-consolidation.
+None
+Provides operational feedback, status updates, and error summaries.
+None
+The `ProteinFlow` project is structured around a streamlined pipeline for acquiring, processing, organizing, and preparing protein data for machine learning tasks, complemented by analysis and visualization capabilities. The architecture is designed to facilitate efficient handling of large biological datasets.
+ + +The primary command-line interface that serves as the entry point for users to initiate and control the entire data pipeline. It orchestrates the execution of data acquisition, processing, and organization workflows.
+`proteinflow.cli` (18:20)This foundational component is responsible for the acquisition of raw protein data (PDB and SAbDab files), defining the core data structures for representing proteins and associated ligands, and performing the initial processing steps. This includes filtering, cleaning, and converting raw data into standardized `ProteinEntry` objects, handling quality checks, and managing ligand-specific details.
+`proteinflow.data` (1:1)`proteinflow.data.PDBEntry` (1:1)`proteinflow.data.SAbDabEntry` (1:1)`proteinflow.download` (1:1)`proteinflow.processing` (1:1)`proteinflow.ligand` (1:1)Focuses on organizing and partitioning the processed protein data into distinct train, validation, and test sets, often employing clustering techniques to ensure diverse and representative splits. It also provides PyTorch-compatible `Dataset` and `DataLoader` classes for efficient batching and preparation of data, making it ready for machine learning model training and evaluation.
+`proteinflow.split` (1:1)`proteinflow.data.torch` (1:1)Offers a comprehensive suite of tools for calculating various protein-related metrics (e.g., sequence similarity, language model perplexity) and for visualizing protein structures and animations. This component also manages optional external dependencies required for its advanced functionalities.
+`proteinflow.metrics` (1:1)`proteinflow.visualize` (1:1)`proteinflow.extra` (1:1)