Hello @OATML-Markslab-admin @pascalnotin @nikkithadani @sarahgurev and EVEscape team,
I am working on an evolutionary and immune escape analysis of H9N2 avian influenza virus using the EVEscape pipeline. My dataset includes whole-genome sequenced H9N2 viruses isolated, which I have merged with all publicly available H9N2 sequences from the GISAID database. I have designed a workflow based on the repository and related publications, but I would like to confirm if this approach is correct and whether there are specific requirements for the reference sequence or PDB file selection (e.g., should they be derived from early isolates or specific strains).
Proposed Workflow:
-
Input Data Preparation:
- Reference Sequence:
Combine my H9N2 isolates with sequences from GISAID, remove duplicates, and ensure format consistency. Use the HA protein sequence as the reference.
- Multiple Sequence Alignment (MSA):
Perform MSA using MAFFT on all HA protein sequences. Apply filtering to remove low-quality or redundant sequences:
- Reweight sequences with a threshold of 0.01 (suitable for viral datasets).
- Retain sequences with at least 50% coverage of the reference.
- Retain alignment columns with at least 70% coverage.
- PDB Structure:
Use an experimental PDB structure for H9N2 HA if available. Otherwise, generate a predicted structure using AlphaFold or MODELLER.
-
Environment Setup:
- Clone the repository and install dependencies:
git clone https://github.com/OATML-Markslab/EVEscape.git
cd EVEscape
conda config --add channels conda-forge
conda create --name evescape_env --file requirements.txt
conda activate evescape_env
- Download pre-trained EVE model checkpoints for hemagglutinin:
curl -o EVE_checkpoints_I4EPC4.zip https://marks.hms.harvard.edu/evescape/EVE_checkpoints_I4EPC4.zip
unzip EVE_checkpoints_I4EPC4.zip
rm EVE_checkpoints_I4EPC4.zip
-
Generate EVE Scores:
Use the pre-trained EVE model to compute evolutionary fitness scores for HA protein mutations:
python scripts/generate_eve_scores.py \
--msa_file data/H9N2_HA_MSA.fasta \
--checkpoint_dir data/EVE_checkpoints/ \
--output_file results/EVE_scores_H9N2.csv
-
Calculate EVEscape Scores:
- Compute Fitness, Accessibility, and Dissimilarity scores:
python scripts/process_protein_data.py \
--pdb_file data/H9N2_HA.pdb \
--fasta_file data/H9N2_HA.fasta \
--evescores_file results/EVE_scores_H9N2.csv \
--output_dir results/
- Consolidate these scores into final EVEscape scores:
python scripts/evescape_scores.py \
--input_dir results/ \
--output_file results/summaries_with_scores.csv
-
Strain-Level Analysis (Optional):
If analyzing multiple strains, merge all strain FASTA files and run:
python scripts/score_pandemic_strains.py \
--fasta_file data/H9N2_strains.fasta \
--output_file results/summaries_with_gisaid/strain_scores.csv
-
Visualization and Interpretation:
Use the output files (e.g., summaries_with_scores.csv) to generate plots such as immune escape score distributions or mutation hotspots.
Questions:
-
Is this workflow appropriate for analyzing H9N2 immune escape and mutation predictions? Are there any steps or parameters that should be adjusted?
-
Regarding the reference sequence used in the analysis:
- Should it be derived from early isolates of H9N2, or can it be based on more recent strains?
- Are there specific recommendations for selecting a representative sequence?
-
For the PDB structure:
- Are there specific requirements for selecting an appropriate PDB file? For example, should it represent an early isolate?
- If using AlphaFold predictions, are there additional considerations to ensure compatibility with downstream analyses?
Thank you for your time and support! I look forward to your feedback.
Hello @OATML-Markslab-admin @pascalnotin @nikkithadani @sarahgurev and EVEscape team,
I am working on an evolutionary and immune escape analysis of H9N2 avian influenza virus using the EVEscape pipeline. My dataset includes whole-genome sequenced H9N2 viruses isolated, which I have merged with all publicly available H9N2 sequences from the GISAID database. I have designed a workflow based on the repository and related publications, but I would like to confirm if this approach is correct and whether there are specific requirements for the reference sequence or PDB file selection (e.g., should they be derived from early isolates or specific strains).
Proposed Workflow:
Input Data Preparation:
Combine my H9N2 isolates with sequences from GISAID, remove duplicates, and ensure format consistency. Use the HA protein sequence as the reference.
Perform MSA using MAFFT on all HA protein sequences. Apply filtering to remove low-quality or redundant sequences:
Use an experimental PDB structure for H9N2 HA if available. Otherwise, generate a predicted structure using AlphaFold or MODELLER.
Environment Setup:
git clone https://github.com/OATML-Markslab/EVEscape.git cd EVEscape conda config --add channels conda-forge conda create --name evescape_env --file requirements.txt conda activate evescape_envGenerate EVE Scores:
Use the pre-trained EVE model to compute evolutionary fitness scores for HA protein mutations:
Calculate EVEscape Scores:
Strain-Level Analysis (Optional):
If analyzing multiple strains, merge all strain FASTA files and run:
Visualization and Interpretation:
Use the output files (e.g.,
summaries_with_scores.csv) to generate plots such as immune escape score distributions or mutation hotspots.Questions:
Is this workflow appropriate for analyzing H9N2 immune escape and mutation predictions? Are there any steps or parameters that should be adjusted?
Regarding the reference sequence used in the analysis:
For the PDB structure:
Thank you for your time and support! I look forward to your feedback.