Validity and Improvement Suggestions for H9N2 EVEscape Analysis Workflow without DMS Data

Hello @OATML-Markslab-admin @pascalnotin @nikkithadani @sarahgurev and EVEscape team,

I am working on an evolutionary and immune escape analysis of H9N2 avian influenza virus using the EVEscape pipeline. My dataset includes whole-genome sequenced H9N2 viruses isolated, which I have merged with all publicly available H9N2 sequences from the GISAID database. I have designed a workflow based on the repository and related publications, but I would like to confirm if this approach is correct and whether there are specific requirements for the reference sequence or PDB file selection (e.g., should they be derived from early isolates or specific strains).  

---

### **Proposed Workflow:**  

1. **Input Data Preparation:**  
   - **Reference Sequence:**  
     Combine my H9N2 isolates with sequences from GISAID, remove duplicates, and ensure format consistency. Use the HA protein sequence as the reference.  
   - **Multiple Sequence Alignment (MSA):**  
     Perform MSA using MAFFT on all HA protein sequences. Apply filtering to remove low-quality or redundant sequences:  
     - Reweight sequences with a threshold of 0.01 (suitable for viral datasets).  
     - Retain sequences with at least 50% coverage of the reference.  
     - Retain alignment columns with at least 70% coverage.  
   - **PDB Structure:**  
     Use an experimental PDB structure for H9N2 HA if available. Otherwise, generate a predicted structure using AlphaFold or MODELLER.  

2. **Environment Setup:**  
   - Clone the repository and install dependencies:  
     ```bash
     git clone https://github.com/OATML-Markslab/EVEscape.git
     cd EVEscape
     conda config --add channels conda-forge
     conda create --name evescape_env --file requirements.txt
     conda activate evescape_env
     ```
   - Download pre-trained EVE model checkpoints for hemagglutinin:  
     ```bash
     curl -o EVE_checkpoints_I4EPC4.zip https://marks.hms.harvard.edu/evescape/EVE_checkpoints_I4EPC4.zip
     unzip EVE_checkpoints_I4EPC4.zip
     rm EVE_checkpoints_I4EPC4.zip
     ```

3. **Generate EVE Scores:**  
   Use the pre-trained EVE model to compute evolutionary fitness scores for HA protein mutations:  
   ```bash
   python scripts/generate_eve_scores.py \
     --msa_file data/H9N2_HA_MSA.fasta \
     --checkpoint_dir data/EVE_checkpoints/ \
     --output_file results/EVE_scores_H9N2.csv
   ```

4. **Calculate EVEscape Scores:**  
   - Compute Fitness, Accessibility, and Dissimilarity scores:  
     ```bash
     python scripts/process_protein_data.py \
       --pdb_file data/H9N2_HA.pdb \
       --fasta_file data/H9N2_HA.fasta \
       --evescores_file results/EVE_scores_H9N2.csv \
       --output_dir results/
     ```
   - Consolidate these scores into final EVEscape scores:  
     ```bash
     python scripts/evescape_scores.py \
       --input_dir results/ \
       --output_file results/summaries_with_scores.csv
     ```

5. **Strain-Level Analysis (Optional):**  
   If analyzing multiple strains, merge all strain FASTA files and run:  
   ```bash
   python scripts/score_pandemic_strains.py \
     --fasta_file data/H9N2_strains.fasta \
     --output_file results/summaries_with_gisaid/strain_scores.csv
   ```

6. **Visualization and Interpretation:**  
   Use the output files (e.g., `summaries_with_scores.csv`) to generate plots such as immune escape score distributions or mutation hotspots.

---

### **Questions:**  

1. Is this workflow appropriate for analyzing H9N2 immune escape and mutation predictions? Are there any steps or parameters that should be adjusted?  

2. Regarding the reference sequence used in the analysis:
   - Should it be derived from early isolates of H9N2, or can it be based on more recent strains?  
   - Are there specific recommendations for selecting a representative sequence?  

3. For the PDB structure:
   - Are there specific requirements for selecting an appropriate PDB file? For example, should it represent an early isolate?   
   - If using AlphaFold predictions, are there additional considerations to ensure compatibility with downstream analyses?  

Thank you for your time and support! I look forward to your feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validity and Improvement Suggestions for H9N2 EVEscape Analysis Workflow without DMS Data #3

Proposed Workflow:

Questions:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Validity and Improvement Suggestions for H9N2 EVEscape Analysis Workflow without DMS Data #3

Description

Proposed Workflow:

Questions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions