Skip to content

Suggestion on Including More Combinatorially Complete Landscapes #85

@EvenStarArwen

Description

@EvenStarArwen

Hi, ProteinGym team,

Thank you for developing this wonderful playground!

I was wondering if you might consider adding more combinatorially complete protein fitness landscapes to the dataset. These landscapes allow full evaluation of model predictions across ALL possible combinations of mutations—albeit in restricted regions (e.g., 3–4 sites). This can in turn help assess the capture of higher-order epistasis.

It seems that currently the only dataset of this kind is the GB1 landscape from Wu et al., 2016, which covers a designed space of 20^4=160,000 variants. As far as I know, there are various other combinatorially complete landscapes in literature:

  • The PhoQ landscape from [1] (20^4=160,000 variants).
  • The ParB and Noc landscapes from [2] (20^4=160,000 variants).
  • The TEV and T7 landscapes from [3] (20^4=160,000 and 20^3=8,000 variants, respectively).
  • The TrpB3 and TrpB4 series landscapes from [4] (from 20^3=8,000 to 20^4=160,000 total variants).

I believe that adding these datasets to ProteinGym would further enhancing its utility.

References

  1. Anna I. Podgornaia and Michael T. Laub. "Pervasive degeneracy and epistasis in a protein-protein interface". Science.(2015)
  2. Adam S. B. Jalal et al. Diversification of DNA-binding specificity by permissive and specificity-switching mutations in the ParB/Noc protein family. Cell Rep. (2020)
  3. Boqiang Tu et al., An ultra-high-throughput method for measuring biomolecular activities. bioRxiv. (2022)
  4. Kadina E Johnston et al., A combinatorially complete epistatic fitness landscape in an enzyme active site. PNAS. (2024)

Best,
Mingyu

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions