Skip to content

ESM-1b scoring strategy & v1.0→v1.1 score changes across all models #99

@ani11452

Description

@ani11452

Hi! Thank you for developing and maintaining this fantastic benchmark! I have been interested in reproducing the ProteinGym zero-shot substitution benchmark and noticed a couple things I'd appreciate clarification on.


1. ESM-1b Scoring Strategy

The paper (Section A.2, p32) states:

"We predict fitness for ESM models with the masked-marginal approach introduced in Meier et al. [2021]"

This implies all ESM models (ESM-1b, ESM-1v, ESM-2) use masked-marginals. However:

Script Line scoring_strategy
scoring_ESM1b_substitutions.sh 13 wt-marginals
scoring_ESM1v_substitutions.sh 19 masked-marginals
scoring_ESM2_substitutions.sh 30 masked-marginals

Question: Was the ESM-1b benchmark result computed with wt-marginals (per the script) or masked-marginals (per the paper)?

Additionally, the masked-marginals implementation in compute_fitness.py (lines 486-514) masks each position independently across L forward passes to build a full per-position probability table:

  for i in range(batch_tokens.size(1)):          # line 489: loop over ALL positions
      batch_tokens_masked[0, i] = alphabet.mask_idx  # line 491: mask one at a time

This differs from Meier et al.'s Strategy (a) (Eq. 1, p4), which masks all mutated positions simultaneously in a single forward pass. For single substitutions the two approaches are mathematically identical, but for multi-mutants they produce different scores (the
independent masking gives each position more context). Is this the intended behavior for multi-mutant scoring?


2. Score Changes Between v1.0 and v1.1

Between v1.0 (commit 4d3d391) and v1.1 (commit 8f1ce5f), scores changed for every model in:

benchmarks/DMS_zero_shot/substitutions/Spearman/Summary_performance_DMS_substitutions_Spearman.csv

The v1.1 release note says "Updates to reference file" but doesn't explain why existing model scores changed.

Model (v1.0 row → v1.1 row) Column v1.0 (4d3d391) v1.1 (8f1ce5f) Delta
ESM-1v single (27→44) Average_Spearman 0.385 0.374 −0.011
ESM-1v single (27→44) Function_Expression 0.431 0.405 −0.026
ESM-1v single (27→44) Function_Stability 0.476 0.437 −0.039
ESM-1v ensemble (17→29) Average_Spearman 0.416 0.406 −0.010
ESM-1v ensemble (17→29) Function_Expression 0.456 0.429 −0.027
ESM-1b (22→35) Average_Spearman 0.399 0.394 −0.005
ESM-1b (22→35) Function_Expression 0.427 0.406 −0.021
ESM2 650M (16→27) Function_Expression 0.439 0.415 −0.024
CARP 640M (33→47) Function_Expression 0.419 0.397 −0.022
Wavenet (48→45) Average_Spearman 0.215 0.373 +0.158

Note: Rows shifted between versions because new models were added, changing the Model_rank ordering.

Diffing reference_files/DMS_substitutions.csv between the two commits shows only cosmetic changes:

  • 3 assay ID renames (e.g., PSAE_SYNP2_Tsuboyama_2023_1PSEPSAE_PICP2_Tsuboyama_2023_1PSE)
  • Formatting of binarization cutoffs (-1.0-1)
  • A new pdb_range column

Question: Were the underlying DMS data CSV files updated between v1.0 and v1.1? I would love to know why this change occurred, in case I am overlooking something or if these are new numbers I should aim for in my benchmark replication.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions