Add MSA-based conservation analysis, and plotting module #2195
+1,198
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This module implements an MSA‑based evolutionary conservation workflow for protein sequences. It reads a wild‑type sequence plus either BLAST XML or a precomputed MSA, optionally maps residues to a PDB structure, and computes per‑position conservation metrics at both residue and residue‑class levels. The wild‑type–based scores include: identity fraction, Shannon entropy, normalized conservation (1 − normalized entropy), consensus residue frequency, gap fraction, and two BLOSUM62‑based scores (mean similarity to the wild‑type residue and mean pairwise BLOSUM62 within the column). The type‑based scores (using H/P/N/B/X classes) include: fraction of sequences in the wild‑type’s class, class entropy, and consensus class frequency. Results are written to a tab‑delimited text file with these metrics, Matplotlib heatmaps for any selected scores, and optional PDB files with chosen metrics encoded in the B‑factor column. A high‑level API (computeConservationFromMSA) supports both CLI use and programmatic access to all metrics, consensus sequences, MSA details, and generated output paths.