Skip to content

[Q] Preprocessing pipeline from raw experimental data to the substitutions_raw_DMS files #104

@yutaka-fj

Description

@yutaka-fj

Hi, thank you for the great resource and the detailed codebase.

This is related to #56, but goes deeper into the preprocessing pipeline.

I have been investigating the conversion from the original experimental data to the ProteinGym format, starting with HIS7_YEAST_Pokusaeva_2019 as a case study. The raw experimental data for this assay is available on GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99990), where protein variants are represented as partial amino acid sequences (AAseq column), not as mutation strings (e.g., A6E:L7S).

However, HIS7_YEAST_Pokusaeva_2019.csv in substitutions_raw_DMS.zip (https://marks.hms.harvard.edu/proteingym/ProteinGym_v1.1/substitutions_raw_DMS.zip) already contains a mutant column in ProteinGym format. This means the conversion from AAseq-style sequences to ProteinGym format must have happened somewhere — but I cannot find the corresponding script in this repository (https://github.com/OATML-Markslab/ProteinGym) or in the referenced experimental paper's repository (https://github.com/Lcarey/HIS3InterspeciesEpistasis).

My questions:

  1. Is the preprocessing script that converts the GEO data for HIS7_YEAST_Pokusaeva_2019 into the substitutions_raw_DMS format publicly available?
  2. If not, would it be possible to share it, even informally?
  3. I plan to investigate similar preprocessing questions for other assays in ProteinGym beyond HIS7. Is there a general pipeline or per-assay documentation that describes how each raw dataset was processed into the ProteinGym format?

This information is important for reproducibility and also for verifying the licensing chain.

Thank you in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions