Skip to content

Zero-shot pathogenicity prediction (Baseline) #19

@merdivane

Description

@merdivane

Historically first: https://www.biorxiv.org/content/10.1101/2021.07.09.450648v2.full.pdf
More extensive study: https://www.biorxiv.org/content/10.1101/2022.09.30.510294v3.full.pdf (new dataset: COSMIC + TCGA, new task: survival prediction)
Another extensive study: https://arxiv.org/pdf/2211.10000.pdf (new task: rescue mutations impact)
Extension of the baseline to any protein length: https://www.biorxiv.org/content/10.1101/2022.08.25.505311v1.full

Image

The idea is to take a protein language model (PLM) and pre-train it on a large corpus of available protein sequences in a BERT fashion (mask random tokens, task is to predict them). Then, logits predicted by the model for a given position under wildtype
(wt) and mutated (mt) token are shown to be effective predictor of the pathogenicity:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions