Skip to content

Analysis code for "Graph attention with energy features improves the generalizability of identifying functional sequences at a protein interface"

License

Notifications You must be signed in to change notification settings

WhiteheadGroup/Graph-Attention-SARS-RBD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graph Attention - SARS RBD

Analysis code for "Graph attention with energy features improves the generalizability of identifying functional sequences at a protein interface"

Note: The manuscript treats libraries LY010 and LY011 described in this repository as a single library, LY010. Within this repository, LY010 and Cas2 refer to the sequences in this library that have mutations in the second cassette, while LY011 and Cas3 refer to the sequences in this library that have mutations in the third cassette.

LY010 & LY011 Data Processing

Sequencing Coverage Analysis:


sequence_search_csv.py

  • Compares two lists of sequences (csv of merged sequences, excel sheet of oligo's that were ordered) and returns only the merged sequences that are present in the list of ordered oligo's

sequence_search_fastq.py

  • Compares two lists of sequences (fastq of merged sequences, excel sheet of oligo's that were ordered) and returns only the merged sequences that are present in the list of ordered oligo's
  • same as sequence_search_csv.py but for a different file format

Variant Filtering:


get_mutations.py

  • Determines list of amino acid mutations corresponding to each DNA sequence in a CSV file

wuhan_mut_naming.py

  • takes in mutation assignments and re-names mutations to correspond to the true WT sequence
    • initial WT sequence used for mutation assignments was incorrect in some positions (498, 501, and 505)
  • Also prints out a list of the unique mutations found within the data file

variant_filter.py

  • first part of the code is a direct copy of wuhan_mut_naming.py
    • generates the list of expected mutations that should be seen in the sequencing data
    • input sheet should be manually edited to ensure that it contains a line for the WT sequence
  • second part of the code takes in data generated by the SpikeRBDStabilization code
    • "LY010_LY011_10Jul24_SPK.xlsx" is the probabilities worksheet from "LY010_LY011_17Jun24_Probabilities.xlsx"
      • "LY010_LY011_17Jun24_Probabilities.xlsx" is a single workbook containing all of the output data from SpikeRBDStabilization code
    • first checks to make sure all sequences have WT mutations at positions 417, 477, and 484, and a mutation at positions 498, 501, and 505
    • then removes WT mutations from all mutation sets
    • finally crosschecks the mutations corresponding to the sequencing data and expected sets of mutations, and removes any lines corresponding to unexpected mutation sets

Resequencing Data Processing:


Flash

  • sequence read files from the limited re-sequencing data are merged via Flash

mutations_and_counts.py

  • takes in merged fastq file and assigns mutations based on a given WT sequence
  • outputs a file with the list of mutation sets and corresponding read counts

variant_filter_resequencing.py

  • largely the same as variant_filter.py
  • makes sure all mutation sets contain expected mutations and then removes mutation sets that were not encoded in the oligo pool

count_comparison.py

  • combines the processed data from the original sequencing run, and the processed data from the re-sequencing run
  • adjusts the total counts in the 1 nM column to adjust for WT contamination
  • performs calculations necessary to classify each variant as "like-WT", "worse than WT", or "non-binder"

About

Analysis code for "Graph attention with energy features improves the generalizability of identifying functional sequences at a protein interface"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages