[Q] Preprocessing pipeline from raw experimental data to the `substitutions_raw_DMS` files

Hi, thank you for the great resource and the detailed codebase.

This is related to #56, but goes deeper into the preprocessing pipeline.

I have been investigating the conversion from the original experimental data to the ProteinGym format, starting with `HIS7_YEAST_Pokusaeva_2019` as a case study. The raw experimental data for this assay is available on GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99990), where protein variants are represented as partial amino acid sequences (`AAseq` column), not as mutation strings (e.g., `A6E:L7S`).

However, `HIS7_YEAST_Pokusaeva_2019.csv` in `substitutions_raw_DMS.zip` (https://marks.hms.harvard.edu/proteingym/ProteinGym_v1.1/substitutions_raw_DMS.zip) already contains a `mutant` column in ProteinGym format. This means the conversion from `AAseq`-style sequences to ProteinGym format must have happened somewhere — but I cannot find the corresponding script in this repository (https://github.com/OATML-Markslab/ProteinGym) or in the referenced experimental paper's repository (https://github.com/Lcarey/HIS3InterspeciesEpistasis).

**My questions:**

1. Is the preprocessing script that converts the GEO data for `HIS7_YEAST_Pokusaeva_2019` into the `substitutions_raw_DMS` format publicly available?
2. If not, would it be possible to share it, even informally?
3. I plan to investigate similar preprocessing questions for other assays in ProteinGym beyond HIS7. Is there a general pipeline or per-assay documentation that describes how each raw dataset was processed into the ProteinGym format?

This information is important for reproducibility and also for verifying the licensing chain.

Thank you in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q] Preprocessing pipeline from raw experimental data to the `substitutions_raw_DMS` files #104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Q] Preprocessing pipeline from raw experimental data to the substitutions_raw_DMS files #104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Q] Preprocessing pipeline from raw experimental data to the `substitutions_raw_DMS` files #104