Analysis code for "Graph attention with energy features improves the generalizability of identifying functional sequences at a protein interface"
Note: The manuscript treats libraries LY010 and LY011 described in this repository as a single library, LY010. Within this repository, LY010 and Cas2 refer to the sequences in this library that have mutations in the second cassette, while LY011 and Cas3 refer to the sequences in this library that have mutations in the third cassette.
- Compares two lists of sequences (csv of merged sequences, excel sheet of oligo's that were ordered) and returns only the merged sequences that are present in the list of ordered oligo's
- Compares two lists of sequences (fastq of merged sequences, excel sheet of oligo's that were ordered) and returns only the merged sequences that are present in the list of ordered oligo's
- same as sequence_search_csv.py but for a different file format
- Determines list of amino acid mutations corresponding to each DNA sequence in a CSV file
- takes in mutation assignments and re-names mutations to correspond to the true WT sequence
- initial WT sequence used for mutation assignments was incorrect in some positions (498, 501, and 505)
- Also prints out a list of the unique mutations found within the data file
- first part of the code is a direct copy of wuhan_mut_naming.py
- generates the list of expected mutations that should be seen in the sequencing data
- input sheet should be manually edited to ensure that it contains a line for the WT sequence
- second part of the code takes in data generated by the SpikeRBDStabilization code
- "LY010_LY011_10Jul24_SPK.xlsx" is the probabilities worksheet from "LY010_LY011_17Jun24_Probabilities.xlsx"
- "LY010_LY011_17Jun24_Probabilities.xlsx" is a single workbook containing all of the output data from SpikeRBDStabilization code
- first checks to make sure all sequences have WT mutations at positions 417, 477, and 484, and a mutation at positions 498, 501, and 505
- then removes WT mutations from all mutation sets
- finally crosschecks the mutations corresponding to the sequencing data and expected sets of mutations, and removes any lines corresponding to unexpected mutation sets
- "LY010_LY011_10Jul24_SPK.xlsx" is the probabilities worksheet from "LY010_LY011_17Jun24_Probabilities.xlsx"
- sequence read files from the limited re-sequencing data are merged via Flash
- takes in merged fastq file and assigns mutations based on a given WT sequence
- outputs a file with the list of mutation sets and corresponding read counts
- largely the same as variant_filter.py
- makes sure all mutation sets contain expected mutations and then removes mutation sets that were not encoded in the oligo pool
- combines the processed data from the original sequencing run, and the processed data from the re-sequencing run
- adjusts the total counts in the 1 nM column to adjust for WT contamination
- performs calculations necessary to classify each variant as "like-WT", "worse than WT", or "non-binder"