Hi, thank you for the great resource and the detailed codebase.
This is related to #56, but goes deeper into the preprocessing pipeline.
I have been investigating the conversion from the original experimental data to the ProteinGym format, starting with HIS7_YEAST_Pokusaeva_2019 as a case study. The raw experimental data for this assay is available on GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99990), where protein variants are represented as partial amino acid sequences (AAseq column), not as mutation strings (e.g., A6E:L7S).
However, HIS7_YEAST_Pokusaeva_2019.csv in substitutions_raw_DMS.zip (https://marks.hms.harvard.edu/proteingym/ProteinGym_v1.1/substitutions_raw_DMS.zip) already contains a mutant column in ProteinGym format. This means the conversion from AAseq-style sequences to ProteinGym format must have happened somewhere — but I cannot find the corresponding script in this repository (https://github.com/OATML-Markslab/ProteinGym) or in the referenced experimental paper's repository (https://github.com/Lcarey/HIS3InterspeciesEpistasis).
My questions:
- Is the preprocessing script that converts the GEO data for
HIS7_YEAST_Pokusaeva_2019 into the substitutions_raw_DMS format publicly available?
- If not, would it be possible to share it, even informally?
- I plan to investigate similar preprocessing questions for other assays in ProteinGym beyond HIS7. Is there a general pipeline or per-assay documentation that describes how each raw dataset was processed into the ProteinGym format?
This information is important for reproducibility and also for verifying the licensing chain.
Thank you in advance.
Hi, thank you for the great resource and the detailed codebase.
This is related to #56, but goes deeper into the preprocessing pipeline.
I have been investigating the conversion from the original experimental data to the ProteinGym format, starting with
HIS7_YEAST_Pokusaeva_2019as a case study. The raw experimental data for this assay is available on GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99990), where protein variants are represented as partial amino acid sequences (AAseqcolumn), not as mutation strings (e.g.,A6E:L7S).However,
HIS7_YEAST_Pokusaeva_2019.csvinsubstitutions_raw_DMS.zip(https://marks.hms.harvard.edu/proteingym/ProteinGym_v1.1/substitutions_raw_DMS.zip) already contains amutantcolumn in ProteinGym format. This means the conversion fromAAseq-style sequences to ProteinGym format must have happened somewhere — but I cannot find the corresponding script in this repository (https://github.com/OATML-Markslab/ProteinGym) or in the referenced experimental paper's repository (https://github.com/Lcarey/HIS3InterspeciesEpistasis).My questions:
HIS7_YEAST_Pokusaeva_2019into thesubstitutions_raw_DMSformat publicly available?This information is important for reproducibility and also for verifying the licensing chain.
Thank you in advance.