From 5d32a86e2ee25679afbb9e756582819afb8212a8 Mon Sep 17 00:00:00 2001 From: imilev Date: Sun, 29 Jun 2025 03:16:12 +0300 Subject: [PATCH] Added high-level diagrams --- .codeboarding/External_Tool_Execution.md | 145 +++++++++++++++++ .codeboarding/Machine_Learning_Output.md | 129 +++++++++++++++ .codeboarding/Variant_Data_Processing.md | 107 +++++++++++++ .codeboarding/Variant_Post_processing.md | 145 +++++++++++++++++ .codeboarding/on_boarding.md | 191 +++++++++++++++++++++++ 5 files changed, 717 insertions(+) create mode 100644 .codeboarding/External_Tool_Execution.md create mode 100644 .codeboarding/Machine_Learning_Output.md create mode 100644 .codeboarding/Variant_Data_Processing.md create mode 100644 .codeboarding/Variant_Post_processing.md create mode 100644 .codeboarding/on_boarding.md diff --git a/.codeboarding/External_Tool_Execution.md b/.codeboarding/External_Tool_Execution.md new file mode 100644 index 00000000..60ef6d13 --- /dev/null +++ b/.codeboarding/External_Tool_Execution.md @@ -0,0 +1,145 @@ +```mermaid + +graph LR + + External_Tool_Orchestration["External Tool Orchestration"] + + Alignment_Tool_Execution_Modules["Alignment Tool Execution Modules"] + + Somatic_Caller_Execution_Modules["Somatic Caller Execution Modules"] + + Container_Configuration["Container Configuration"] + + External_Tool_Orchestration -- "orchestrates" --> Alignment_Tool_Execution_Modules + + External_Tool_Orchestration -- "orchestrates" --> Somatic_Caller_Execution_Modules + + External_Tool_Orchestration -- "uses" --> Container_Configuration + + Alignment_Tool_Execution_Modules -- "uses" --> Container_Configuration + + Somatic_Caller_Execution_Modules -- "uses" --> Container_Configuration + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +This subsystem is designed to automate the generation and execution of scripts for various external bioinformatics tools within containerized environments (Docker/Singularity). Its primary goal is to produce initial raw alignment (BAM) and variant (VCF) files by orchestrating a series of specialized tool executions. + + + +### External Tool Orchestration + +This is the central orchestrator of the External Tool Execution subsystem. It is responsible for generating and managing the execution scripts for both alignment and somatic variant calling pipelines. It initiates and oversees the workflows that leverage external bioinformatics tools within containerized environments, ultimately producing the raw alignment and variant files. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq/utilities/dockered_pipelines/makeAlignmentScripts.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/makeSomaticScripts.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/run_workflows.py` (1:1) + + + + + +### Alignment Tool Execution Modules + +This component comprises a collection of specialized modules, each encapsulating the specific logic and commands required to run individual external bioinformatics tools for alignment-related tasks (e.g., BWA for alignment, Picard for duplicate marking, merging BAMs/Fastqs, trimming). These modules are invoked and managed by the External Tool Orchestration component to perform the alignment steps of the pipeline. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq/utilities/dockered_pipelines/alignments/align.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/alignments/markdup.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/alignments/mergeBams.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/alignments/mergeFastqs.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/alignments/spreadFastq.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/alignments/trim.py` (1:1) + + + + + +### Somatic Caller Execution Modules + +Similar to the alignment modules, this component consists of distinct modules, each dedicated to running a specific external somatic variant calling bioinformatics tool (e.g., MuTect2, VarDict, Strelka2, SomaticSniper). These modules contain the necessary commands and configurations for executing the callers within containerized environments, contributing to the generation of raw variant call files. They are orchestrated by the External Tool Orchestration component. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq/utilities/dockered_pipelines/somatic_mutations/JointSNVMix2.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/somatic_mutations/LoFreq.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/somatic_mutations/MuSE.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/somatic_mutations/MuTect2.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/somatic_mutations/Scalpel.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/somatic_mutations/SomaticSniper.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/somatic_mutations/Strelka2.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/somatic_mutations/VarDict.py` (1:1) + +- `somaticseq/utilities/dockered_pipelines/somatic_mutations/VarScan2.py` (1:1) + + + + + +### Container Configuration + +This utility component provides common functionalities and options for managing the containerized environments (Docker/Singularity) in which the external bioinformatics tools are executed. It ensures consistency in how containers are utilized across different tool execution modules, abstracting away the complexities of container setup and execution parameters. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq/utilities/dockered_pipelines/container_option.py` (1:1) + + + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file diff --git a/.codeboarding/Machine_Learning_Output.md b/.codeboarding/Machine_Learning_Output.md new file mode 100644 index 00000000..1ff6ff7e --- /dev/null +++ b/.codeboarding/Machine_Learning_Output.md @@ -0,0 +1,129 @@ +```mermaid + +graph LR + + XGBoost_Model_Core["XGBoost Model Core"] + + TSV_to_VCF_Converter["TSV to VCF Converter"] + + Nucleotide_Change_Feature_Generator["Nucleotide Change Feature Generator"] + + SomaticSeq_Pipeline_Orchestrator["SomaticSeq Pipeline Orchestrator"] + + Genomic_File_Utilities["Genomic File Utilities"] + + SomaticSeq_Pipeline_Orchestrator -- "orchestrates" --> XGBoost_Model_Core + + SomaticSeq_Pipeline_Orchestrator -- "orchestrates" --> TSV_to_VCF_Converter + + XGBoost_Model_Core -- "uses" --> Nucleotide_Change_Feature_Generator + + TSV_to_VCF_Converter -- "uses" --> Genomic_File_Utilities + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +This subsystem embodies the core machine learning functionality of `somaticseq`, focusing on the classification of somatic variants using an XGBoost model and the subsequent conversion of results into the standard VCF format. It integrates several key components to achieve this, from feature engineering to final output generation. + + + +### XGBoost Model Core + +This component encapsulates the machine learning logic, specifically the training and prediction using the XGBoost algorithm. It takes feature-rich TSV data as input and outputs classification results, including prediction scores and feature importance. It is fundamental because it performs the actual machine learning classification, which is the primary purpose of this subsystem. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq/somatic_xgboost.py` (1:1) + + + + + +### TSV to VCF Converter + +Responsible for transforming the classified TSV output from the XGBoost Model Core into the standardized VCF format. It handles the parsing of TSV data, processing variant information, and formatting it into VCF-compliant fields, including quality scores and filtering details. This component is crucial as it translates the internal processing results into a widely accepted and usable genomic data format. It leverages general genomic file utilities for its operations. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq/somatic_tsv2vcf.py` (1:1) + +- `somaticseq/tsv2vcf.py` (1:1) + + + + + +### Nucleotide Change Feature Generator + +This component identifies and categorizes different types of nucleotide changes (e.g., single nucleotide variants (SNVs), insertions, deletions). This categorization is a crucial step in feature engineering, providing essential input features for the XGBoost Model Core to accurately classify somatic variants. It is fundamental because it prepares the data in a machine-learning-ready format, directly impacting the model's performance. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq/ntchange_type.py` (1:1) + + + + + +### SomaticSeq Pipeline Orchestrator + +This component serves as the high-level coordinator for the entire `somaticseq` pipeline. Within the context of the `Machine Learning & Output` subsystem, it orchestrates the sequential execution of the XGBoost Model Core for classification and the subsequent TSV to VCF Converter for output formatting. It is fundamental as it defines the overall workflow and ensures the correct execution order of the core machine learning and output generation steps. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq/run_somaticseq.py` (1:1) + + + + + +### Genomic File Utilities + +General utility functions for parsing and handling genomic file formats. + + + + + +**Related Classes/Methods**: _None_ + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file diff --git a/.codeboarding/Variant_Data_Processing.md b/.codeboarding/Variant_Data_Processing.md new file mode 100644 index 00000000..eac60c7d --- /dev/null +++ b/.codeboarding/Variant_Data_Processing.md @@ -0,0 +1,107 @@ +```mermaid + +graph LR + + Genomic_File_Parsing_Read_Information_Extraction["Genomic File Parsing & Read Information Extraction"] + + Feature_Calculation_Annotation["Feature Calculation & Annotation"] + + VCF_to_TSV_Transformation["VCF to TSV Transformation"] + + Feature_Calculation_Annotation -- "uses" --> Genomic_File_Parsing_Read_Information_Extraction + + VCF_to_TSV_Transformation -- "uses" --> Genomic_File_Parsing_Read_Information_Extraction + + VCF_to_TSV_Transformation -- "uses" --> Feature_Calculation_Annotation + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +The `Variant Data Processing` component within `somaticseq` is a critical subsystem responsible for preparing genomic variant data for machine learning applications. It orchestrates the parsing of raw genomic files, the extraction of detailed read-level information, the calculation of comprehensive quantitative features, and the final transformation of data into a machine-learning-ready format. + + + +### Genomic File Parsing & Read Information Extraction + +This component serves as the initial gateway for all genomic data. It is responsible for parsing various genomic file formats (e.g., VCF, BAM, pileup) and extracting fundamental read-level information necessary for downstream feature calculation. It provides the basic utilities to read and interpret raw genomic data. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.genomic_file_parsers.genomic_file_handlers` (0:0) + +- `somaticseq.genomic_file_parsers.read_info_extractor` (0:0) + +- `somaticseq.genomic_file_parsers.pileup_reader` (0:0) + +- `somaticseq.genomic_file_parsers.pileup_reader:Base_calls` (163:313) + +- `somaticseq.genomic_file_parsers.pileup_reader:Pileup_line` (13:160) + + + + + +### Feature Calculation & Annotation + +This component focuses on deriving quantitative features from genomic data. This includes calculating read-level metrics from BAM alignment files and integrating contextual information. It also handles the annotation of variants with these calculated features, which are crucial inputs for machine learning models. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.bam_features` (0:0) + +- `somaticseq.sequencing_features` (0:0) + +- `somaticseq.annotate_caller` (0:0) + +- `somaticseq.ntchange_type` (0:0) + + + + + +### VCF to TSV Transformation + +This component is responsible for converting standardized VCF (Variant Call Format) files into a custom tab-separated value (TSV) format. During this transformation, it integrates the features calculated by the "Feature Calculation & Annotation" component, producing a comprehensive dataset ready for machine learning model training or prediction. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.somatic_vcf2tsv` (0:0) + +- `somaticseq.single_sample_vcf2tsv` (0:0) + + + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file diff --git a/.codeboarding/Variant_Post_processing.md b/.codeboarding/Variant_Post_processing.md new file mode 100644 index 00000000..1082ca60 --- /dev/null +++ b/.codeboarding/Variant_Post_processing.md @@ -0,0 +1,145 @@ +```mermaid + +graph LR + + VCF_Combiner["VCF Combiner"] + + VCF_Format_Normalizers["VCF Format Normalizers"] + + Variant_Structure_Modifiers["Variant Structure Modifiers"] + + Caller_Specific_Annotator["Caller-Specific Annotator"] + + Variant_Tally_Aggregator["Variant Tally & Aggregator"] + + VCF_Combiner -- "uses" --> VCF_Format_Normalizers + + VCF_Combiner -- "uses" --> Variant_Structure_Modifiers + + Variant_Tally_Aggregator -- "uses" --> Variant_Structure_Modifiers + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +Abstract Components Overview + + + +### VCF Combiner + +This component acts as the orchestrator for integrating and standardizing VCF files originating from multiple variant callers. Its primary role is to merge these diverse VCFs and prepare them for subsequent processing steps, ensuring a consolidated view of variants across different calling algorithms. It directly calls modules responsible for VCF format normalization and structural modifications. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.combine_callers.combineSingle` (14:256) + + + + + +### VCF Format Normalizers + +This collection of modules is dedicated to converting and normalizing the specific VCF output formats from various variant callers (e.g., MuTect, VarScan2, VarDict, Strelka) into a standardized internal representation. This ensures consistency regardless of the original caller's output quirks. These modules are invoked by the `VCF Combiner`. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.vcf_modifier.modify_MuTect` (1:1) + +- `somaticseq.vcf_modifier.modify_VarScan2` (1:1) + +- `somaticseq.vcf_modifier.modify_VarDict` (1:1) + +- `somaticseq.vcf_modifier.modify_ssMuTect2` (1:1) + +- `somaticseq.vcf_modifier.modify_ssStrelka` (1:1) + + + + + +### Variant Structure Modifiers + +This component specializes in manipulating the structure of VCF entries. It handles the breakdown of complex variant representations (e.g., multi-allelic variants, block substitutions) into simpler, individual SNV (Single Nucleotide Variant) and Indel (Insertion/Deletion) records. Additionally, it provides utilities for VCF manipulation such as intersection with BED regions, sorting VCF entries, and identifying unique variant positions. These modules are directly utilized by the `VCF Combiner` and potentially other components for VCF processing. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.vcf_modifier.split_vcf` (1:1) + +- `somaticseq.vcf_modifier.bed_util` (1:1) + +- `somaticseq.vcf_modifier.getUniqueVcfPositions` (1:1) + +- `somaticseq.vcf_modifier.copy_TextFile` (1:1) + + + + + +### Caller-Specific Annotator + +This component is responsible for processing and enriching the output from individual somatic variant callers. It extracts specific metrics, flags, and information relevant to each caller's output, adding valuable context to the variant calls. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.annotate_caller` (1:1) + + + + + +### Variant Tally & Aggregator + +This component performs detailed analysis, including tallying variants and calculating Variant Allele Frequencies (VAF) across multiple VCF files. It also facilitates the integration of external annotations, such as SNP effects and information from public databases like dbSNP and COSMIC. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.utilities.tally_variants_from_multiple_vcfs` (1:1) + + + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file diff --git a/.codeboarding/on_boarding.md b/.codeboarding/on_boarding.md new file mode 100644 index 00000000..439026c0 --- /dev/null +++ b/.codeboarding/on_boarding.md @@ -0,0 +1,191 @@ +```mermaid + +graph LR + + Workflow_Orchestration_Control["Workflow Orchestration & Control"] + + External_Tool_Execution["External Tool Execution"] + + Variant_Data_Processing["Variant Data Processing"] + + Machine_Learning_Output["Machine Learning & Output"] + + Variant_Post_processing["Variant Post-processing"] + + Workflow_Orchestration_Control -- "Initiates & Directs" --> External_Tool_Execution + + Workflow_Orchestration_Control -- "Orchestrates" --> Variant_Data_Processing + + Workflow_Orchestration_Control -- "Orchestrates" --> Machine_Learning_Output + + External_Tool_Execution -- "Outputs Raw Data To" --> Variant_Data_Processing + + External_Tool_Execution -- "Outputs Raw VCFs To" --> Variant_Post_processing + + Variant_Data_Processing -- "Receives Input From" --> External_Tool_Execution + + Variant_Data_Processing -- "Receives Processed VCFs From" --> Variant_Post_processing + + Variant_Data_Processing -- "Provides Data To" --> Machine_Learning_Output + + Machine_Learning_Output -- "Receives Data From" --> Variant_Data_Processing + + Machine_Learning_Output -- "Outputs Final VCFs To" --> Workflow_Orchestration_Control + + Variant_Post_processing -- "Receives Raw VCFs From" --> External_Tool_Execution + + Variant_Post_processing -- "Provides Processed VCFs To" --> Variant_Data_Processing + + click External_Tool_Execution href "https://github.com/bioinform/somaticseq/blob/master/.codeboarding//External_Tool_Execution.md" "Details" + + click Variant_Data_Processing href "https://github.com/bioinform/somaticseq/blob/master/.codeboarding//Variant_Data_Processing.md" "Details" + + click Machine_Learning_Output href "https://github.com/bioinform/somaticseq/blob/master/.codeboarding//Machine_Learning_Output.md" "Details" + + click Variant_Post_processing href "https://github.com/bioinform/somaticseq/blob/master/.codeboarding//Variant_Post_processing.md" "Details" + +``` + + + +[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org) + + + +## Details + + + +The `somaticseq` project is designed as a modular pipeline for somatic variant calling and classification. The architecture can be abstracted into five core components, each with distinct responsibilities and clear interactions, facilitating a robust and scalable workflow. + + + +### Workflow Orchestration & Control + +This component serves as the central command unit, managing the entire SomaticSeq pipeline's execution flow. It handles both single-sample and paired-sample modes, orchestrates parallel processing across genomic regions, and coordinates the sequential and parallel execution of all downstream components. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.run_somaticseq` (0:0) + +- `somaticseq.somaticseq_parallel` (0:0) + +- `somaticseq.utilities.split_bed_into_equal_regions` (0:0) + + + + + +### External Tool Execution [[Expand]](./External_Tool_Execution.md) + +This component is responsible for generating and executing scripts that run various external bioinformatics tools (e.g., BWA for alignment, MuTect2, VarDict for somatic calling) within containerized environments (Docker/Singularity). It produces the initial raw alignment (BAM) and variant (VCF) files. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.utilities.dockered_pipelines.makeAlignmentScripts` (0:0) + +- `somaticseq.utilities.dockered_pipelines.makeSomaticScripts` (0:0) + +- `somaticseq.utilities.dockered_pipelines.run_workflows` (48:65) + + + + + +### Variant Data Processing [[Expand]](./Variant_Data_Processing.md) + +This component handles the intricate process of preparing variant data for machine learning. It includes parsing various genomic file formats, extracting detailed sequencing read-level information, calculating a comprehensive set of quantitative features from BAM alignment files and genomic context, and transforming VCF files into a feature-rich tab-separated value (TSV) format. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.genomic_file_parsers.genomic_file_handlers` (0:0) + +- `somaticseq.genomic_file_parsers.read_info_extractor` (0:0) + +- `somaticseq.bam_features` (0:0) + +- `somaticseq.sequencing_features` (0:0) + +- `somaticseq.somatic_vcf2tsv` (0:0) + +- `somaticseq.single_sample_vcf2tsv` (0:0) + + + + + +### Machine Learning & Output [[Expand]](./Machine_Learning_Output.md) + +This component embodies the core machine learning functionality. It implements the XGBoost model for both training (building) and predicting (classifying) somatic variants using the feature-rich TSV data. Subsequently, it converts the classified TSV results, including prediction scores and filtering information, back into the standard VCF format. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.somatic_xgboost` (0:0) + +- `somaticseq.somatic_tsv2vcf` (0:0) + +- `somaticseq.tsv2vcf` (45:627) + + + + + +### Variant Post-processing [[Expand]](./Variant_Post_processing.md) + +This component is responsible for the initial processing, combining, and manipulation of VCF outputs generated by various external somatic variant callers. It standardizes their formats, annotates variants with caller-specific information, and provides utilities for VCF manipulation such as intersection with BED regions, splitting complex variants, sorting, and tallying variants from multiple VCFs. + + + + + +**Related Classes/Methods**: + + + +- `somaticseq.combine_callers` (0:0) + +- `somaticseq.annotate_caller` (0:0) + +- `somaticseq.vcf_modifier.bed_util` (0:0) + +- `somaticseq.vcf_modifier.split_vcf` (0:0) + +- `somaticseq.vcf_modifier.modify_MuTect2` (0:0) + +- `somaticseq.vcf_modifier.modify_VarDict` (0:0) + +- `somaticseq.utilities.tally_variants_from_multiple_vcfs` (0:0) + + + + + + + + + +### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq) \ No newline at end of file